Today is the day for another (potentially penultimate) blog post from me. Using this opportunity, I would like to introduce to you our recent update to the Observed Antibody Space (OAS) resource.
The OAS resource was built with antibody structural sequence viability in mind, where researches can download already numbered and cleaned amino acid sequences. The following structural filtering steps are performed: 1) Checking for HMM alignment 2) Checking for indels 3) Checking for the presence of conserved positions 4) Checking for any chimeric sequences. These steps ensure the high quality sequences with all 3 complementary-determining regions (CDRs) present.
These filtering steps could reduce the antibody repertoire size, which can be undesirable, especially if researchers do not need such strict filtering criteria. Therefore, instead of removing those sequences, we now provide a short description about each sequence viability.
We have also decided to use a new file format for all new studies that we add to the OAS resource. This format complies with the minimal AIRR community [1] requirements where we provide the full tab-delimited Igblast outputs [2].
Currently, we only added a single study in this new format (Nielsen et al., 2020) [3]. We are planning to release more studies next month.
Below, I provide code snippets how to work with the new OAS format.
Processing the new OAS format
Metadata
First we import the python packages and load the data unit metadata. Metadata is stored as the first row of the file.
Note that in the new format we have MiAIRR == “yes”
Loading the data
To load numbered outputs, we need to skip the first row (metadata)
The output is the familiar Igblast output. We also supply five extra columns regarding antibody isotype as well as ANARCI numbering information.
- c_region – isolated nucleotide sequences after framework 4
- Isotype – identified isotype after running Smith-Waterman local alignment on c_region
- Redundancy – count after sequence collapse (on sequence_alignment_aa and isotype)
- ANARCI_numbering – ANARCI numbering output of sequence_alignment_aa
- ANARCI_status – whether sequences have any liabilities. In our case, first five sequences were good (no liabilities)
ANARCI_status columns
ANARCI_status column shows whether a given antibody amino acid sequence has any liabilities. Below you can see a list of all unique liability combinations encounter in the current data unit. We can see that lots of liabilities are concerned about indels.
If you want to work only with the highest quality sequence datset
df = df.query("ANARCI_status == 'good'")
You can also keep sequences with “[‘fwh4 length is shorter IMGT defined’]”, if you do not care about missing residues in the antibody C-terminus.
Importance of structural filtering
Below I will provide one example where Igblast productive column == True, but ANARCI_status column indicated that the sequence was Chimeric.
I defined a sequence to be “Chimeric” when a large portion of the sequence cannot be numbered with ANARCI [4] after it was previously aligned with igblast. In many cases two antibody sequences are fused, where ANARCI numbers only the first one. As per this case, a frame shift was left unnoticed by Igblast, which resulted in only a half of the amino acid sequence being numbered with ANARCI.
References
- Vander Heiden, Jason Anthony, et al. “AIRR community standardized representations for annotated immune repertoires.” Frontiers in immunology 9 (2018): 2206.
- Ye, Jian, et al. “IgBLAST: an immunoglobulin variable domain sequence analysis tool.” Nucleic acids research 41.W1 (2013): W34-W40.
- Nielsen, Sandra CA, et al. “B cell clonal expansion and convergent antibody responses to SARS-CoV-2.” (2020).
- Dunbar, James, and Charlotte M. Deane. “ANARCI: antigen receptor numbering and receptor classification.” Bioinformatics 32.2 (2016): 298-300.