Tag Archives: machine learning

Entering a Stable Relationship with your Neural Network

Over the past year, I have been working on building a graph-based paratope (antibody binding site) prediction tool – Paragraph. Fortunately, I have had moderate success with this and you can now check out the preprint of this work here.

However, for a long time, I struggled with a highly unstable network, where different random seeds yielded very different results. I believe this instability was largely due to the high class imbalance in my data – only ~10% of all residues in the Fv (variable region of the antibody) belong to the paratope.

I tried many different things in an attempt to stabilise my training, most of which failed. I will share all of these ideas with you though – successful or not – as what works for one person/network is never guaranteed to work for another. I hope that the below may provide some ideas to try out for others facing similar issues. Where possible, I also provide some example hyperparameter values that could act as sensible starting points.

Continue reading

Visualise with Weight and Biases

Understanding what’s going on when you’ve started training your shiny new ML model is hard enough. Will it work? Have I got the right parameters? Is it the data? Probably.  Any tool that can help with that process is a Godsend. Weights and biases is a great tool to help you visualise and track your model throughout your production cycle. In this blog post, I’m going to detail some basics on how you can initialise and use it to visualise your next project.

Installation

To use weights and biases (wandb), you need to make an account. For individuals it is free, however, for team-oriented features, you will have to pay. Wandb can then be installed using pip or conda.

$ 	conda install -c conda-forge wandb

or 

$   pip install wandb

To initialise your project, import the package, sign in, and then use the following command using your chosen project name and username (if you want):

import wandb

wandb.login()

wandb.init(project='project1')

In addition to your project, you can also initialise a config dictionary with starting parameter values:

Continue reading

How to turn a SMILES string into a vector of molecular descriptors using RDKit

Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:

  • molecular graph structure,
  • lipophilicity (logP),
  • molecular refractivity,
  • electrotopological state,
  • druglikeness,
  • fragment profile,
  • molecular charge,
  • molecular surface,

Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.

A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:

Continue reading

CryoEM is now the dominant technique for solving antibody structures

Last year, the Structural Antibody Database (SAbDab) listed a record-breaking 894 new antibody structures, driven in no small part by the continued efforts of the researchers to understand SARS-CoV-2.

Fig. 1: The aggregate growth in antibody structure data (all methods) over time. Taken from http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/stats/ on 25th May 2022.

In this blog post I wanted to highlight the major driving force behind this curve – the huge increase in cryo electron microscopy (cryoEM) data – and the implications of this for the field of structure-based antibody informatics.

Continue reading

From code to molecules: The future of chemical synthesis

In June, after I finish my PhD, I will be joining Chemify, a new startup based in Glasgow that aims to make chemical synthesis universally accessible, reproducible and fully automated using AI and robotics. After previously talking about “Why you should care about startups as a researcher” and a quick guide on “Commercialising your research: Where to start?” on this blog, I have now joined a science-based startup fresh out of university myself.

Chemify is a spinout from the University of Glasgow originating from the group of Prof. Lee Cronin. The core of the technology is the chemical programming language χDL (pronounced “chi DL”) that, in combination with a natural language processing AI that reads and understands chemical synthesis procedures, can be used to plan and autonomously executed chemical reactions on robotic hardware. The Cronin group has also already build the modular robotic hardware needed to carry out almost any chemical reaction, the “Chemputer”. Due to the flexibility of both the Chemputer and the χDL language, Chemify has already shown that the applications go way beyond simple synthesis and can be applied to drug formulation, the discovery of new materials or the optimisation of reaction conditions.

Armed with this transformational software and hardware, Chemify is now fully operational and is hiring exceptional talent into their labs in Glasgow. I am excited to see how smart, AI-driven automation techniques like Chemify will change how small scale chemical synthesis and chemical discovery more broadly is done in the future. I’m super excited to be part of the journey.

Paper review: “EquiBind”

Molecular docking helps us understand how small-molecules interact with proteins. This is especially useful in early drug development stages such as target identification and compound screening. Quick and accurate docking software allows researchers to focus their attention on a smaller set of lead molecules for further testing. Traditionally, docking software has employed first principles from physics and chemistry. Recently, deep learning has become all the rage for molecular docking, maybe motivated by the successful application of deep learning to molecular folding.

Method

EquiBind is a deep learning unconstrained docking method which models a fixed receptor and a ligand with selected rotatable bonds. It predicts the binding pocket and the ligand’s conformation within the pocket in one go. Under the hood, EquiBind employs two great ideas from a recent ICLR 2022 Paper: a SE3-invariant graph neural network based architecture and the idea to generate fixed sets of matching key points to define a rotation and translation between receptor and ligand. In addition, the authors innovate a fast method to project a deformed ligand onto the space spanned by the rotatable bonds of a pre-generated ligand conformation.

Continue reading

3 Key Questions to Think About When Designing Proteins Computationally

We have reached the era of design, not just ‘hunting’. Particularly exciting to me is the de novo design of proteins, which have a wide and ever increasing range of applications from therapeutics to consumer products, biomanufacturing to biomaterials. Protein design has been a) enabled by decades of research that contributed to our understanding of protein sequence, structure & function and b) accelerated by computational advances – capturing the information we have learned from proteins and representing it for computers and machine learning algorithms.

In this blog post, I will discuss three key methodological considerations for computational protein design:

  1. Sequence- vs structure-based design
  2. ML- vs physics-based design
  3. Target-agnostic vs target-aware design
Continue reading

How to turn a SMILES string into a molecular graph for Pytorch Geometric

Despite some of their technical issues, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor vectors.

But molecular data usually comes in the sequential form of labeled SMILES strings. It is not obvious for beginners how to optimally transform a SMILES string into a structured molecular graph object that can be used as an input for a GNN. In this post, we show how to convert a SMILES string into a molecular graph object which can subsequently be used for graph-based machine learning. We do so within the framework of Pytorch Geometric which currently is one of the best and most commonly used Python-based GNN-libraries.

We divide our task into three high-level steps:

  1. We define a function that maps an RDKit atom object to a suitable atom feature vector.
  2. We define a function that maps an RDKit bond object to a suitable bond feature vector.
  3. We define a function that takes as its input a list of SMILES strings and associated labels and then uses the functions from 1.) and 2.) to create a list of labeled Pytorch Geometric graph objects as its output.
Continue reading

Issues with graph neural networks: the cracks are where the light shines through

Deep convolutional neural networks have lead to astonishing breakthroughs in the area of computer vision in recent years. The reason for the extraordinary performance of convolutional architectures in the image domain is their strong ability to extract informative high-level features from visual data. For prediction tasks on images, this has lead to superhuman performance in a variety of applications and to an almost universal shift from classical feature engineering to differentiable feature learning.

Unfortunately, the picture is not quite as rosy yet in the area of molecular machine learning. Feature learning techniques which operate directly on raw molecular graphs without intermediate feature-engineering steps have only emerged in the last few years in the form of graph neural networks (GNNs). GNNs, however, still have not managed to definitively outcompete and replace more classical non-differentiable molecular representation methods such as extended-connectivity fingerprints (ECFPs). There is an increasing awareness in the computational chemistry community that GNNs have not quite lived up to the initial hype and still suffer from a number of technical limitations.

Continue reading

AlphaFold 2 is here: what’s behind the structure prediction miracle

Nature has now released that AlphaFold 2 paper, after eight long months of waiting. The main text reports more or less what we have known for nearly a year, with some added tidbits, although it is accompanied by a painstaking description of the architecture in the supplementary information. Perhaps more importantly, the authors have released the entirety of the code, including all details to run the pipeline, on Github. And there is no small print this time: you can run inference on any protein (I’ve checked!).

Have you not heard the news? Let me refresh your memory. In November 2020, a team of AI scientists from Google DeepMind  indisputably won the 14th Critical Assessment of Structural Prediction competition, a biennial blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally but not publicly released. Their results were so astounding, and the problem so central to biology, that it took the entire world by surprise and left an entire discipline, computational biology, wondering what had just happened.

Continue reading