Category Archives: Machine Learning

Highlights from the European Antibody Congress 2021

Last month, I was fortunate enough to be able to attend (in person!) and present at the Festival of Biologics European Antibody Congress (9-11 November, 2021) in Basel, Switzerland. The Festival of Biologics is an annual conference, which brings together researchers from industry and academia. It was an excellent opportunity to learn about exciting research and meet people working in the antibody development field.

Here are some of my highlights from the European Antibody Congress, with a focus on antibody design and engineering:

Continue reading

New review on BCR/antibody repertoire analysis out in MAbs!

In our latest immunoinformatics review, OPIG has teamed up with experienced antibody consultant Dr. Anthony Rees to outline the evidence for BCR/antibody repertoire convergence on common epitopes post-pathogen exposure, and all the ways we can go about detecting it from repertoire gene sequencing data. We highlight the new advances in the repertoire functional analysis field, including the role for OPIG’s latest tools for structure-aware antibody analytics: Structural Annotation of AntiBody repertoires+ (SAAB+), Paratyping, Ab-Ligity, Repertoire Structural Profiling & Structural Profiling of Antibodies to Cluster by Epitope (‘SPACE’).

Continue reading

A logical brain teaser to derail your afternoon

Brain teasers have a strange power. For many they evoke nothing more than a mild and transient sense of curiosity. But for a certain subset of people they create an irresistible intellectual temptation which even needs to actively be avoided at times as not to completely derail conversations and take over whole afternoons.

For better or worse, I am in the camp of people who are highly susceptible to brain teasers. I just love them too much. More than once in my lifetime I had to ask a friend not to tell me about a particular brain teaser they had heard about because I knew it would inevitably take over my mind and send me down an almost hypnotic spiral of thoughts whose only escape would be finding the solution.

While brain teasers can admittedly turn into ridiculously powerful distractions for some of us, they are not necessarily a waste of time. They have high recreational value and help the mind to enter a playful and creative state. They serve as mental gymnastics to directly train logical thinking skills, and logical thinking is arguably one of the most powerful transferable skills that exists. And last but not least, brain teasers are canonically used nowadays in job interviews at some of the worlds top employers (Google, Facebook, Microsoft, prestigious hedge funds, …).

In this post, I will present one of my favourite brain teasers to see if I can get you hooked. It is a slightly modified and self-contained version of the so-called pirate game. You can find the solution at the end of the page. Enjoy responsibly! Continue reading

Issues with graph neural networks: the cracks are where the light shines through

Deep convolutional neural networks have lead to astonishing breakthroughs in the area of computer vision in recent years. The reason for the extraordinary performance of convolutional architectures in the image domain is their strong ability to extract informative high-level features from visual data. For prediction tasks on images, this has lead to superhuman performance in a variety of applications and to an almost universal shift from classical feature engineering to differentiable feature learning.

Unfortunately, the picture is not quite as rosy yet in the area of molecular machine learning. Feature learning techniques which operate directly on raw molecular graphs without intermediate feature-engineering steps have only emerged in the last few years in the form of graph neural networks (GNNs). GNNs, however, still have not managed to definitively outcompete and replace more classical non-differentiable molecular representation methods such as extended-connectivity fingerprints (ECFPs). There is an increasing awareness in the computational chemistry community that GNNs have not quite lived up to the initial hype and still suffer from a number of technical limitations.

Continue reading

How to interact with small molecules in Jupyter Notebooks

The combination of Python and the cheminformatics toolkit RDKit has opened up so many ways to explore chemistry on a computer. Jupyter — named for the three languages, Julia, Python, and R — ties interactivity and visualization together, creating wonderful environments (Notebooks and JupyterLab) to carry out, share and reproduce research, including:

“data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.”

—https://jupyter.org

At this year’s annual RDKit UGM (User Group Meeting), Cédric Bouysset shared a tutorial explaining how to create a grid of molecules that you can interact with, using his “mols2grid“:

Continue reading

AlphaFold 2 is here: what’s behind the structure prediction miracle

Nature has now released that AlphaFold 2 paper, after eight long months of waiting. The main text reports more or less what we have known for nearly a year, with some added tidbits, although it is accompanied by a painstaking description of the architecture in the supplementary information. Perhaps more importantly, the authors have released the entirety of the code, including all details to run the pipeline, on Github. And there is no small print this time: you can run inference on any protein (I’ve checked!).

Have you not heard the news? Let me refresh your memory. In November 2020, a team of AI scientists from Google DeepMind  indisputably won the 14th Critical Assessment of Structural Prediction competition, a biennial blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally but not publicly released. Their results were so astounding, and the problem so central to biology, that it took the entire world by surprise and left an entire discipline, computational biology, wondering what had just happened.

Continue reading

Out-of-distribution generalisation and scaffold splitting in molecular property prediction

The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.

In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set X into two sets – a training set X_{\text{train}} and a test set X_{\text{test}}. The model is then subsequently trained on the examples in the training set X_{\text{train}} and afterwards its prediction abilities are measured on the untouched examples in the test set X_{\text{test}} via a suitable performance metric.

Since in this scenario the model has never seen any of the examples in X_{\text{test}} during training, its performance on X_{\text{test}} must be indicative of its performance on novel data X_{\text{new}} which it will encounter in the future. Right?

Continue reading

CAML: Courses in Applied Machine Learning

*Shameless self-promotion klaxon!! Have a look at my new website!*

I’m excited to share a project I’ve been working on for the past few months! One of the biggest challenges of working on an interdisciplinary research project is getting to grips with the core principles of the disciplines which you don’t have much formal training in. For me, that means learning the basics of Medicinal Chemistry and Structural Biology so that when someone mentions pi-stacking I don’t think they’re talking about the logistics of managing a bakery; for people coming from Bio/Chem backgrounds it can mean understanding the Maths and Statistics necessary to make sense of the different algorithms which are central to their work.

Continue reading

Can few-shot language models perform bioinformatics tasks?

In 2019, I tried my hand at using large language models, specifically GPT-2, for text generation. In that blogpost, I used Hansard files to fine-tune the public release of GPT-2 to generate speeches by several speakers in the House of Commons (link).

In 2020, OpenAI released GPT-3, their new and improved text generation model (paper), which uses a whopping 175 billion parameters (as opposed to its predecessor’s 1.5 billion) and not only proved to be capable of state of the art performance on common text prediction benchmarks, but also generated a considerable amount of interest in the news media:

Continue reading

Bioinformatics Hackathon Reflection

A week ago I participated in Copenhagen Bioinformatics Hackathon 2021, a hackathon focusing on machine learning and proteins, as a mentor for a challenge proposed by our group. The whole experience was fun, but I am also sitting here contemplating over a lot of things I wish I had done differently. For this blog text, I therefore want to highlight two changes which I believe would have greatly improved my challenge and which can hopefully also work as an inspiration for others presenting a hackathon challenge. 

Going into this event I had some experience from a few hackathons I had previously attended. Based on this, I wanted to create a challenge containing two parts. First, a simple task which everyone would be able to create a solution for, and second, a more challenging addition to the first task for more experienced participants. I decided to go with the challenge of predicting which heavy and light chains can form a pair, where the additional challenge was to try to visualize which residues were relevant for this interaction. Together with OAS containing a really nice positive dataset of paired chains, I thought this was going to be an amazing challenge, but as soon as the event began I started seeing the flaws of the challenge.

Continue reading