In this so-called ‘big data’ age, the quest to find the signal amidst the noise is becoming more difficult than ever. Though we have sophisticated systems that can extract and parse data incredibly efficiently, the amount of noise has equally, if not more so, expanded, thus masking the signals that we crave for. Oddly enough, it sometimes seems that we are churning and gathering a vast amount data just for the sake of it, rather than looking for highly-relevant, high-quality data.
One such example is antibody (Ab) binding data. Even though there are several Ab-specific databases (e.g. AbySis, IMGT), none of these, to our knowledge, has any information on an Ab’s binding affinity to its antigen (Ag), despite the fact that an Ab’s affinity is one of the few quantitative metrics of its performance. Therefore, gathering Ab binding data would not only help us to create more accurate models of Ab binding, it would, in the long term, facilitate the in silico maturation and design/re-design of Abs. If this seems like a dream, have a read of this paper – they made an incredibly effective Ab from computationally-inspired methods.
Given the tools at our disposal, and the fact that several protein-protein binding databases are available in the public domain, this task may seem somewhat trivial. However, there’s the ever-present issue of gathering only the highest quality data points in order to perform some of the applications mentioned earlier.
Over the past few weeks, we have gathered the binding data for 228 Ab-Ag complexes across two major protein-protein binding databases; PDB-Bind and the structure-based benchmark from Kastritis et al. Ultimately, 36 entries were removed from further analyses as they had irrelevant data (e.g. IC50 instead of KD; IC50 relates to inhibition, which is not the same as the Ab’s affinity for its Ag). Given the dataset, we performed some initial tests on existing energy functions and docking programs to see if there is any correlation between the programs’ scores and protein binding affinities.
As the graphs show, there is no distinctive correlation between a program/function’s score and the affinity of an Ab. Having said this, these programs were trained on general protein-protein interfaces (though that does occasionally include Abs!) and we thus trained DCOMPLEX and RAPDF specifically for Ab structures (~130 structures). The end results were poor nonetheless (top-centre and top-right graphs, above), but the interatomic heatmaps show clear differences in the interaction patterns between Ab-Ag interfaces and general protein-protein interfaces.
Now, with this new information, the search for signals continues. It is evident that Ab binding has distinctive differences with respect to protein-protein interfaces. Therefore, the next step is to gather more high-quality data and see if there is any correlation between an Ab’s distinct binding mode and its affinity. However, we are not interested in just getting whatever affinity data is available. As we have done for the past few weeks, the rigorous standards we have used for building the current benchmark set must be maintained – otherwise we risk in masking the signal with unnecessary noise.
Currently, the results are disappointing, but if the past few weeks in OPIG has taught me anything, this is only the beginning of a long and difficult search for a good model. BUT – this is what makes research so exciting! We learn from the low Pearson correlation coefficients, the (almost) random distribution of data, and the not-so-pretty plots of our data in order to form useful models for practical applications like Ab design. I think a quote from The Great Gatsby accurately ‘models’ my optimism for making sense of the incoming stream of data:
Gatsby believed in the green light, the orgastic future that year by year recedes before us. It eluded us then, but that’s no matter — to-morrow we will run faster, stretch out our arms farther. . . . And one fine morning ——
So we beat on, boats against the current, borne back ceaselessly into the past.