Category Archives: Machine Learning

How to turn a SMILES string into a molecular graph for Pytorch Geometric

Despite some of their technical issues, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor vectors.

But molecular data usually comes in the sequential form of labeled SMILES strings. It is not obvious for beginners how to optimally transform a SMILES string into a structured molecular graph object that can be used as an input for a GNN. In this post, we show how to convert a SMILES string into a molecular graph object which can subsequently be used for graph-based machine learning. We do so within the framework of Pytorch Geometric which currently is one of the best and most commonly used Python-based GNN-libraries.

We divide our task into three high-level steps:

  1. We define a function that maps an RDKit atom object to a suitable atom feature vector.
  2. We define a function that maps an RDKit bond object to a suitable bond feature vector.
  3. We define a function that takes as its input a list of SMILES strings and associated labels and then uses the functions from 1.) and 2.) to create a list of labeled Pytorch Geometric graph objects as its output.
Continue reading

Antibody Engineering and Therapeutics Conference

I was invited to speak at the Antibody Engineering and Therapeutics Conference (presenting mine and Matt’s recently published epitope profiling paper), in San Diego (December 12th – 16th). Unfortunately, the pandemic had other ideas so I decided not to travel but luckily the conference was hybrid. 

The conference included 1 day of pre-conference workshops and 4 days of presentations from academic and industry, with livestreaming of the initial keynotes (including one from Charlotte). Remaining talks were recorded and made available after the conference. I’ve highlighted a few of my favourite talks and conference themes, with links to papers where available.

Naturally, a lot of the presented research related to covid-19. I was speaking in the ‘Antibody Repertoires and Covid-19’ session, where there were interesting presentations from Professor Eline Luning Prak from the University of Pennsylvania and Elaine Chen from Vanderbilt University analysing antibody responses in covid-recovered individuals, and comparing vaccine responses in covid-recovered vs covid-naiive individuals. Other talks around SARS-CoV-2 vaccines included Dr Laura Walker from Adimab/Adagio Therapeutics comparing BCR repertoire responses to different types of vaccinations, and the effect of using different booster types.

Continue reading

Highlights from the European Antibody Congress 2021

Last month, I was fortunate enough to be able to attend (in person!) and present at the Festival of Biologics European Antibody Congress (9-11 November, 2021) in Basel, Switzerland. The Festival of Biologics is an annual conference, which brings together researchers from industry and academia. It was an excellent opportunity to learn about exciting research and meet people working in the antibody development field.

Here are some of my highlights from the European Antibody Congress, with a focus on antibody design and engineering:

Continue reading

New review on BCR/antibody repertoire analysis out in MAbs!

In our latest immunoinformatics review, OPIG has teamed up with experienced antibody consultant Dr. Anthony Rees to outline the evidence for BCR/antibody repertoire convergence on common epitopes post-pathogen exposure, and all the ways we can go about detecting it from repertoire gene sequencing data. We highlight the new advances in the repertoire functional analysis field, including the role for OPIG’s latest tools for structure-aware antibody analytics: Structural Annotation of AntiBody repertoires+ (SAAB+), Paratyping, Ab-Ligity, Repertoire Structural Profiling & Structural Profiling of Antibodies to Cluster by Epitope (‘SPACE’).

Continue reading

A logical brain teaser to derail your afternoon

Brain teasers have a strange power. For many they evoke nothing more than a mild and transient sense of curiosity. But for a certain subset of people they create an irresistible intellectual temptation which even needs to actively be avoided at times as not to completely derail conversations and take over whole afternoons.

For better or worse, I am in the camp of people who are highly susceptible to brain teasers. I just love them too much. More than once in my lifetime I had to ask a friend not to tell me about a particular brain teaser they had heard about because I knew it would inevitably take over my mind and send me down an almost hypnotic spiral of thoughts whose only escape would be finding the solution.

While brain teasers can admittedly turn into ridiculously powerful distractions for some of us, they are not necessarily a waste of time. They have high recreational value and help the mind to enter a playful and creative state. They serve as mental gymnastics to directly train logical thinking skills, and logical thinking is arguably one of the most powerful transferable skills that exists. And last but not least, brain teasers are canonically used nowadays in job interviews at some of the worlds top employers (Google, Facebook, Microsoft, prestigious hedge funds, …).

In this post, I will present one of my favourite brain teasers to see if I can get you hooked. It is a slightly modified and self-contained version of the so-called pirate game. You can find the solution at the end of the page. Enjoy responsibly! Continue reading

Issues with graph neural networks: the cracks are where the light shines through

Deep convolutional neural networks have lead to astonishing breakthroughs in the area of computer vision in recent years. The reason for the extraordinary performance of convolutional architectures in the image domain is their strong ability to extract informative high-level features from visual data. For prediction tasks on images, this has lead to superhuman performance in a variety of applications and to an almost universal shift from classical feature engineering to differentiable feature learning.

Unfortunately, the picture is not quite as rosy yet in the area of molecular machine learning. Feature learning techniques which operate directly on raw molecular graphs without intermediate feature-engineering steps have only emerged in the last few years in the form of graph neural networks (GNNs). GNNs, however, still have not managed to definitively outcompete and replace more classical non-differentiable molecular representation methods such as extended-connectivity fingerprints (ECFPs). There is an increasing awareness in the computational chemistry community that GNNs have not quite lived up to the initial hype and still suffer from a number of technical limitations.

Continue reading

How to interact with small molecules in Jupyter Notebooks

The combination of Python and the cheminformatics toolkit RDKit has opened up so many ways to explore chemistry on a computer. Jupyter — named for the three languages, Julia, Python, and R — ties interactivity and visualization together, creating wonderful environments (Notebooks and JupyterLab) to carry out, share and reproduce research, including:

“data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.”

—https://jupyter.org

At this year’s annual RDKit UGM (User Group Meeting), Cédric Bouysset shared a tutorial explaining how to create a grid of molecules that you can interact with, using his “mols2grid“:

Continue reading

AlphaFold 2 is here: what’s behind the structure prediction miracle

Nature has now released that AlphaFold 2 paper, after eight long months of waiting. The main text reports more or less what we have known for nearly a year, with some added tidbits, although it is accompanied by a painstaking description of the architecture in the supplementary information. Perhaps more importantly, the authors have released the entirety of the code, including all details to run the pipeline, on Github. And there is no small print this time: you can run inference on any protein (I’ve checked!).

Have you not heard the news? Let me refresh your memory. In November 2020, a team of AI scientists from Google DeepMind  indisputably won the 14th Critical Assessment of Structural Prediction competition, a biennial blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally but not publicly released. Their results were so astounding, and the problem so central to biology, that it took the entire world by surprise and left an entire discipline, computational biology, wondering what had just happened.

Continue reading

Out-of-distribution generalisation and scaffold splitting in molecular property prediction

The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.

In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set X into two sets – a training set X_{\text{train}} and a test set X_{\text{test}}. The model is then subsequently trained on the examples in the training set X_{\text{train}} and afterwards its prediction abilities are measured on the untouched examples in the test set X_{\text{test}} via a suitable performance metric.

Since in this scenario the model has never seen any of the examples in X_{\text{test}} during training, its performance on X_{\text{test}} must be indicative of its performance on novel data X_{\text{new}} which it will encounter in the future. Right?

Continue reading

CAML: Courses in Applied Machine Learning

*Shameless self-promotion klaxon!! Have a look at my new website!*

I’m excited to share a project I’ve been working on for the past few months! One of the biggest challenges of working on an interdisciplinary research project is getting to grips with the core principles of the disciplines which you don’t have much formal training in. For me, that means learning the basics of Medicinal Chemistry and Structural Biology so that when someone mentions pi-stacking I don’t think they’re talking about the logistics of managing a bakery; for people coming from Bio/Chem backgrounds it can mean understanding the Maths and Statistics necessary to make sense of the different algorithms which are central to their work.

Continue reading