The Observed Antibody Space (OAS) [1,2] is an amazing resource for investigating observed antibodies or as a resource for training antibody specific models, however; its size (over 2.4 billion unpaired and 1.5 million paired antibody sequences as of June 2023) can make it painful to work with. Additionally, OAS is extremely information rich, having nearly 100 columns for each antibody heavy or light chain, further complicating how to handle the data.
From spending a lot of time working with OAS, I wanted to share a few tricks and insights, which I hope will reduce the pain and increase the joy of working with OAS!
Reading individual data units
OAS is structured into files of maximum 2GB, which we call data units, with each data unit consisting of sequences from the same experimental run, chain and isotype. From the OAS website, individual data units can be downloaded for local querying. However, instead of first downloading each data unit, the data units can be read directly from their URL. While reading directly from a URL is slightly slower and requires an internet connection, it conveniently removes the need to have a local version of OAS (~1.1TB).
The following is a code snippet for reading a data unit URL using pandas.
import pandas as pd data_unit = "http://opig.stats.ox.ac.uk/webapps/ngsdb/unpaired/Chen_2020/csv/SRR11937587_1_Heavy_IGHG.csv.gz" pd.read_csv(data_unit, header=1).head(5)
To extract the metadata, the following code snippet can be used.
import json json.loads(','.join(pd.read_csv(data_unit, nrows=0).columns))
Each data unit can now easily be read and processed without having to first download them.
Filtering OAS for potential problematic antibody sequences
Since OAS is derived from high-throughput sequencing studies, its sequences are of varying quality. While sequences with obvious and detrimental errors are removed when the data is originally processed, less obvious or less relevant errors might still be present, such as extremely long indels. The “ANARCI_status” column contains information about unusual residues, indels, if sequences are truncated and, lack of conserved cysteines for each sequence, and can be used to filter OAS for what a given researcher might believe are problematic sequences. For example, for some studies, only sequences with the whole variable domain are needed, and “ANARCI_status” can then be used to filter for truncated sequences.
In the following code snippet, we are filtering all truncated sequences.
sequence_data = pd.read_csv(data_unit, header=1) sequence_data.shape sequence_data.ANARCI_status.head() full_length_data = sequence_data.query("~ANARCI_status.str.contains('Shorter')") full_length_data.shape
Another trick for removing lower quality data, is to filter sequences seen less than a certain number of times, with the “Redundancy” column. This is based on the idea, that sequence errors are highly unlikely to occur in the exact same position, and therefore only correctly sequenced sequences are seen multiple times. However, this trick also greatly reduces the data size, removing many valid sequences, as seen in the following code snippet.
sequence_data = pd.read_csv(data_unit, header=1) sequence_data.shape sequence_data.query("Redundancy>=2").shape
Additional information
Other than the antibody sequence, OAS also contains a lot of additional information. This includes information about the source of each sequence, i.e. the study, species, or the disease state of the subject, as well as, 97 columns of sequence specific information. The sequence specific information includes the nucleotide and amino acid sequence, but also information about its germlines, such as the estimated germlines and the estimated aligned germline after rearrangement, but before somatic hypermutation. Further, the columns “v_identity”, “d_identity” and “j_identity” contain the sequence identity between the nucleotides for the estimated germline and antibody sequence.
This information is great for understanding how mutated a given sequence is, and where the mutations are. The following code snippet shows how to view this.
sequence_data = pd.read_csv(data_unit, header=1) sequence_data[['sequence_alignment_aa', 'germline_alignment_aa', 'v_identity', 'd_identity', 'j_identity']].head()
While the focus of OAS was to collect the sequence of the variable domain of antibodies, for raw sequences which contains parts of the conserved domain, the nucleotide sequence of the conserved domain was stored in the ‘c_region’ column. However, while changing, most entries in OAS currently either lack or have a very short fragment of the conserved domain, as seen in the following code snippet.
sequence_data = pd.read_csv(data_unit, header=1) sequence_data['c_region']
OAS is truly an amazing dataset with many potential use cases. I therefore hope these tricks are useful and can help make exploring/querying OAS easier!
References
- Kovaltsuk, A. et al. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol 15 October 2018; 201 (8): 2502–2509. https://doi.org/10.4049/jimmunol.1800708
- Olsen, TH, Boyles, F, Deane, CM. Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science. 2022; 31: 141– 146. https://doi.org/10.1002/pro.4205