Category Archives: Machine Learning

Is bigger better?

Recent work in Natural Language Processing (NLP) indicates that the bigger your model is, the better performance you will get. In a paper by Kaplan, Jared, et al., they show that loss scales as a power-law with model size, dataset size, and the amount of compute used for training.

Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv preprint arXiv:2001.08361 (2020).
Continue reading

Better understanding of correlation

Although correlation is often used as the linear relationship between two sets of points, I will in the following text use it more broadly to mean any relationship between two sets of points.

You have tasked yourself with finding the correlation between the different features in your dataset. Your purpose could be to remove highly correlated features or just improve your understanding of your data. Nonetheless, calculating and using the Pearson Correlation Coefficient (PCC) or the Spearman’s rank Correlation Coefficient (SCC) to get an overview of the correlations might be the first thing that comes to your mind.

Unfortunately, both of these are limited to linear (PCC) or monotonic (SCC) relationships. In datasets with many and complex features, many of them will be highly correlated, just not linearly (or monotonic). Instead these correlations can be non-linear which, as seen in the third row in the below figure, does not get detected with PCC.

Figure: PCC of different sets of x and y points. https://en.wikipedia.org/wiki/Correlation_and_dependence
Continue reading

The Coronavirus Antibody Database: 10 months on, 10x the data!

Back in May 2020, we released the Coronavirus Antibody Database (‘CoV-AbDab’) to capture molecular information on existing coronavirus-binding antibodies, and to track what we anticipated would be a boon of data on antibodies able to bind SARS-CoV-2. At the time, we had found around 300 relevant antibody sequences and a handful of solved crystal structures, most of which were characterised shortly after the SARS-CoV epidemic of 2003. We had no idea just how many SARS-CoV-2 binding antibody sequences would come to be released into the public domain…

10 months later (2nd March 2021), we now have tracked 2,673 coronavirus-binding antibodies, ~95% with full Fv sequence information and ~5% with solved structures. These datapoints originate from 100s of independent studies reported in either the academic literature or patent filings.

The entire contents CoV-AbDab database as of 2nd March 2021.
Continue reading

Tracking machine learning projects with Weights & Biases

Optimising machine learning models requires extensive comparison of architectures and hyperparameter combinations. There are many frameworks that make logging and visualising performance metrics across model runs easier. I recently started using Weights & Biases. In the following, I give a brief overview over some basic code snippets for your machine learning python code to get started with this tool.

Continue reading

CASP14: what Google DeepMind’s AlphaFold 2 really achieved, and what it means for protein folding, biology and bioinformatics

Disclaimer: this post is an opinion piece based on the experience and opinions derived from attending the CASP14 conference as a doctoral student researching protein modelling. When provided, quotes have been extracted from my notes of the event, and while I hope to have captured them as accurately as possible, I cannot guarantee that they are a word-by-word facsimile of what the individuals said. Neither the Oxford Protein Informatics Group nor I accept any responsibility for the content of this post.

You might have heard it from the scientific or regular press, perhaps even from DeepMind’s own blog. Google ‘s AlphaFold 2 indisputably won the 14th Critical Assessment of Structural Prediction competition, a biannual blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally — yet not publicly released. Their results are so incredibly accurate that many have hailed this code as the solution to the long-standing protein structure prediction problem.

Continue reading

BioDataScience101: a fantastic initiative to learn bioinformatics and data science

Last Wednesday, I was fortunate enough to be invited as a guest lecturer to the 3rd BioDataScience101 workshop, an initiative spearheaded by Paolo Marcatili, Professor of Bioinformatics at the Technical University of Denmark (DTU). This session, on amino acid sequence analysis applied to both proteomics and antibody drug discovery, was designed and organised by OPIG’s very own Tobias Olsen.

Continue reading

NeurIPS 2020: Chemistry / Biology papers

Another blog post, another look at accepted papers for a major ML conference. NeurIPS joins the other major machine learning conferences (and others) in moving virtual this year, running from 6th – 12th December 2020. In a continuation of past posts (ICML 2020, NeurIPS 2019), I will highlight several of potential interest to the chem-/bio-informatics communities

The list of accepted papers can be found here, with 1,903 papers accepted out of 9,467 submissions (20% acceptance rate).

In addition to the main conference, there are several workshops highly related to the type of research undertaken in OPIG: Machine Learning in Structural Biology and Machine Learning for Molecules.

The usual caveat: given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”). If you find any I have missed, please reach out and I will update accordingly.

Continue reading

Learning from Biased Datasets

Both the beauty and the downfall of learning-based methods is that the data used for training will largely determine the quality of any model or system.

While there have been numerous algorithmic advances in recent years, the most successful applications of machine learning have been in areas where either (i) you can generate your own data in a fully understood environment (e.g. AlphaGo/AlphaZero), or (ii) data is so abundant that you’re essentially training on “everything” (e.g. GPT2/3, CNNs trained on ImageNet).

This covers only a narrow range of applications, with most data not falling into one of these two categories. Unfortunately, when this is true (and even sometimes when you are in one of those rare cases) your data is almost certainly biased – you just may or may not know it.

Continue reading

No labels, no problem! A quick introduction to Gaussian Mixture Models

Statistical Modelling Big Data AnalyticsTM is in vogue at the moment, and there’s nothing quite so fashionable as the neural network. Capable of capturing complex non-linear relationships and scalable for high-dimensional datasets, they’re here to stay.

For your garden-variety neural network, you need two things: a set of features, X, and a label, Y. But what do you do if labelling is prohibitively expensive or your expert labeller goes on holiday for 2 months and all you have in the meantime is a set of features? Happily, we can still learn something about the labels, even if we might not know what they are!

Continue reading

K-Means clustering made simple

The 21st century is often referred to as the age of “Big Data” due to the unprecedented increase in the volumes of data being generated. As most of this data comes without labels, making sense of it is a non-trivial task. To gain insight from unlabelled data, unsupervised machine learning algorithms have been developed and continue to be refined. These algorithms determine underlying relationships within the data by grouping data points into cluster families. The resulting clusters not only highlight associations within the data, but they are also critical for creating predictive models for new data.

Continue reading