The Observed Antibody Space (OAS) [1,2] is an amazing resource for investigating observed antibodies or as a resource for training antibody specific models, however; its size (over 2.4 billion unpaired and 1.5 million paired antibody sequences as of June 2023) can make it painful to work with. Additionally, OAS is extremely information rich, having nearly 100 columns for each antibody heavy or light chain, further complicating how to handle the data.
From spending a lot of time working with OAS, I wanted to share a few tricks and insights, which I hope will reduce the pain and increase the joy of working with OAS!
From 19th-22nd February I was fortunate enough to participate in the joint Keystone Symposium on Next-Generation Antibody Therapeutics and Multispecific Immune Cell Engagers, held in Banff, Canada. Now in their 51st year, the Keystone Symposia are a comprehensive programme of scientific conferences spanning the full range of topics relating to human health, from studies on fundamental bodily processes through to drug discovery.
The lineup for the Royal Society of Chemistry’s 5th “Artificial Intelligence in Chemistry” Symposium (Thursday-Friday, 1st-2nd September 2022) is now complete for both oral and poster presentations. It really is a fantastic selection of topics and speakers and it is clear this event is now a highlight of the scientific calendar. Our very own Prof. Charlotte M. Deane, MBE will be giving a keynote.
It marks a return to in-person meetings: it will be held at Churchill College, Cambridge, with a conference dinner at Trinity Hall.
Yep, it is very well known that the sugar coating (aka glycosylation) of viruses makes them invisible to the immune system, a strategy so effective that like in the case of HIV, whose spike is almost entirely covered by glycans, makes it so difficult to target by the human immune system.
Unsurprisingly, coronaviruses such as SARS, MERS, and SARS-CoV-1(2) not only benefit from this evolutionary strategy but there is evidence now that sugars provide stability to their spikes to be effective binders by glueing the spike chains, hence making them infectious.
This is the major finding of this paper that introduces very interesting results from all-atom MD simulations of a fully glycosylated model of the SARS-CoV-2 spike protein embedded in a realistic viral membrane. Researchers aimed to look into the stability of the protein spike (A, B, and C) chains in the “open” and “closed” conformation and how these changed upon key residue mutations to test how glycans sitting in the inter-chain space affect stability. It also aimed at quantifying glycans’ shielding effect from molecules ranging from 2 to 15 Angstroms, i.e., from small-sized to peptide- and antibody-sized molecules.
Eve, Brennan and I were delighted to attend the sixth AIRR (adaptive immune receptor repertoire) Community Meeting: Exploring New Frontiers in San Diego. Eve and I had been awaiting this meeting for a mere 3 years, since it was announced during the last in-person AIRR Community Meeting back in 2019. Fortunately, San Diego did not disappoint.
After a rocky start (featuring many hours stuck in traffic on the M40, one missed flight and one delayed flight), we made it to California! The three day conference had ~230 participants (remote and in-person) and featured great talks from academia and industry. We particularly enjoyed keynote talks from Dennis Burton on rational vaccine design using broadly neutralising antibodies, Gunilla Karlsson Hedestam on functional consequences of allelic variation, Shane Crotty on covid and HIV vaccine design, and Atul Butte on uses of electronic health record data and how we should all found start-ups.
We had fun delivering a tutorial on OPIG antibody tools and, most importantly, we all won AIRR t-shirts in the raffle (potentially we were the only people who noticed how to enter on the conference app). Highlights outside of the conference included paddle boarding and seeing hummingbirds, pelicans, sealions, seals, ‘Garibaldi’ the state fish, and meeting Bob the golden retriever at a surfing shop. We’re now off to find jobs on the West Coast so we can live at the beach….
Between the 27th April and 1st of May, I was very fortunate to be able attend the Antibodies as Drugs Keystone Symposium and give my first conference talk internationally, in which I spoke about the methods our group has developed for using structure to make predictions about where an antibody binds relative to other antibodies. This included paratyping [1], Ab-Ligity [2] and most recently SPACE [3].
I will preface this by saying that lots of the work people spoke about was unpublished, which was so exciting, but makes for a difficult blog post to write. To avoid any possibility of putting my foot in my mouth I will keep the science very surface level. The conference was held at the Keystone resort in Colorado, and the science combined with a kind of landscape I have never experienced before made for an extremely cool experience. This meeting was originally combined with a protein design meeting, and the two were split by COVID – this meant that in-silico methods were the minority in the program, but I didn’t mind that as the computational work that was presented was quite diverse so it was definitely a good representation of the field still. I also really enjoyed the large number of infectious disease talks in which we got a good range of the major human pathogens – ebolaviruses, SARS-CoV-2 of course, dengue, hantaviruses, metapneumovirus, HIV, TB and malaria all featured. The bispecific session was another highlight for me. The conference was very well organised and I liked how we were all asked to share a fun fact about ourselves – one speaker shared that he is a Christmas tree farmer in his spare time (I won’t share his name in case he is keeping that under wraps). That made me reconsider how fun I can truly consider myself…
Without turning this into a travel blog, I also want to add that Keystone was insanely beautiful and make you look at some pics I got.
Benford’s law is an observation that in numerical data (produced by many kinds of process), the leading digit tends to be small. Wikipedia tells you that it in datasets obeying Benford’s law, the number 1 appears as the leading digit about 30% of the time while 9 appears less than 5% of the time (p(n) = log10(1+1/n) where n is the leading digit). Wikipedia further lists multiple kinds of data where this tends to be true such as electricity bills, population numbers and physical and mathematical constants, and particularly where data can be described by a power law.
Power laws and antibodies have been co-discussed in reference to network descriptions of antigen-experienced BCR repertoires [1], which are often described as scale-free to use the network terminology (following a power law). This means a few highly-connected nodes in the network and lots of nodes with few or no connections. This is an obvious candidate for Benford’s law.
This is of no practical relevance, but I wondered if I could see Benford’s law in other kinds of data besides clone counts in the Observed Antibody Space (OAS). For example, I looked at the leading digit in the number of sequences in all of the data units in OAS. It looks like a good fit for Benford’s law (though with more density at the smaller leading digits) and has a chi-squared value of 0.007 (Figure 1A).
Last year, the Structural Antibody Database (SAbDab) listed a record-breaking 894 new antibody structures, driven in no small part by the continued efforts of the researchers to understand SARS-CoV-2.
Fig. 1: The aggregate growth in antibody structure data (all methods) over time. Taken from http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/stats/ on 25th May 2022.
In this blog post I wanted to highlight the major driving force behind this curve – the huge increase in cryo electron microscopy (cryoEM) data – and the implications of this for the field of structure-based antibody informatics.
When writing code, you have inevitably needed to store data throughout your pipeline. In these cases you store your value, list or data frame as a variable to easily use it elsewhere in your code. However, sometimes your data has an awkward form, consisting of a number of different length lists or data of different types and sizes. While it is still doable to work with, and using tuples or dictionaries can help, accessing different elements in your data quickly becomes messy and it is less intuitive what your code is actually doing.
To solve the above stated problem, data classes were introduced as a new feature in Python 3.7. A data class is a regular Python class, but with certain methods already implemented for you. This makes them easy to create and removes a lot of boilerplate (repeated code) making them simpler, more intuitive and pretty. Further, as data classes are part of the standard library, you can directly import it without needing to install any external dependencies (noice).
With the sales pitch out of the way, let us look at how we can use data classes.
from dataclasses import dataclass
from typing import Any
@dataclass
class Antibody:
vgene: str
jgene: None
sequence: Any = 'EVQ'