Optimising machine learning models requires extensive comparison of architectures and hyperparameter combinations. There are many frameworks that make logging and visualising performance metrics across model runs easier. I recently started using Weights & Biases. In the following, I give a brief overview over some basic code snippets for your machine learning python code to get started with this tool.
Continue readingCategory Archives: Machine Learning
CASP14: what Google DeepMind’s AlphaFold 2 really achieved, and what it means for protein folding, biology and bioinformatics
Disclaimer: this post is an opinion piece based on the experience and opinions derived from attending the CASP14 conference as a doctoral student researching protein modelling. When provided, quotes have been extracted from my notes of the event, and while I hope to have captured them as accurately as possible, I cannot guarantee that they are a word-by-word facsimile of what the individuals said. Neither the Oxford Protein Informatics Group nor I accept any responsibility for the content of this post.
You might have heard it from the scientific or regular press, perhaps even from DeepMind’s own blog. Google ‘s AlphaFold 2 indisputably won the 14th Critical Assessment of Structural Prediction competition, a biannual blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally — yet not publicly released. Their results are so incredibly accurate that many have hailed this code as the solution to the long-standing protein structure prediction problem.
Continue readingBioDataScience101: a fantastic initiative to learn bioinformatics and data science
Last Wednesday, I was fortunate enough to be invited as a guest lecturer to the 3rd BioDataScience101 workshop, an initiative spearheaded by Paolo Marcatili, Professor of Bioinformatics at the Technical University of Denmark (DTU). This session, on amino acid sequence analysis applied to both proteomics and antibody drug discovery, was designed and organised by OPIG’s very own Tobias Olsen.
Continue readingNeurIPS 2020: Chemistry / Biology papers
Another blog post, another look at accepted papers for a major ML conference. NeurIPS joins the other major machine learning conferences (and others) in moving virtual this year, running from 6th – 12th December 2020. In a continuation of past posts (ICML 2020, NeurIPS 2019), I will highlight several of potential interest to the chem-/bio-informatics communities
The list of accepted papers can be found here, with 1,903 papers accepted out of 9,467 submissions (20% acceptance rate).
In addition to the main conference, there are several workshops highly related to the type of research undertaken in OPIG: Machine Learning in Structural Biology and Machine Learning for Molecules.
The usual caveat: given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”). If you find any I have missed, please reach out and I will update accordingly.
Continue readingLearning from Biased Datasets
Both the beauty and the downfall of learning-based methods is that the data used for training will largely determine the quality of any model or system.
While there have been numerous algorithmic advances in recent years, the most successful applications of machine learning have been in areas where either (i) you can generate your own data in a fully understood environment (e.g. AlphaGo/AlphaZero), or (ii) data is so abundant that you’re essentially training on “everything” (e.g. GPT2/3, CNNs trained on ImageNet).
This covers only a narrow range of applications, with most data not falling into one of these two categories. Unfortunately, when this is true (and even sometimes when you are in one of those rare cases) your data is almost certainly biased – you just may or may not know it.
Continue readingNo labels, no problem! A quick introduction to Gaussian Mixture Models
Statistical Modelling Big Data AnalyticsTM is in vogue at the moment, and there’s nothing quite so fashionable as the neural network. Capable of capturing complex non-linear relationships and scalable for high-dimensional datasets, they’re here to stay.
For your garden-variety neural network, you need two things: a set of features, X, and a label, Y. But what do you do if labelling is prohibitively expensive or your expert labeller goes on holiday for 2 months and all you have in the meantime is a set of features? Happily, we can still learn something about the labels, even if we might not know what they are!

K-Means clustering made simple
The 21st century is often referred to as the age of “Big Data” due to the unprecedented increase in the volumes of data being generated. As most of this data comes without labels, making sense of it is a non-trivial task. To gain insight from unlabelled data, unsupervised machine learning algorithms have been developed and continue to be refined. These algorithms determine underlying relationships within the data by grouping data points into cluster families. The resulting clusters not only highlight associations within the data, but they are also critical for creating predictive models for new data.
Continue readingICML 2020: Chemistry / Biology papers
ICML is one of the largest machine learning conferences and, like many other conferences this year, is running virtually from 12th – 18th July.
The list of accepted papers can be found here, with 1,088 papers accepted out of 4,990 submissions (22% acceptance rate). Similar to my post on NeurIPS 2019 papers, I will highlight several of potential interest to the chem-/bio-informatics communities. As before, given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”).
Continue readingUnderstanding the synthesizability of molecules proposed by generative models
De novo molecular design is a computational technique to generate molecules with desired properties from scratch. Classical generative algorithms are based on Genetic Algorithms (GA) and the iterative construction of molecules from molecular fragments. Recently, Variational Auto-Encoders (VAEs), Generative Adversarial Networks (GANs) have been developed for this task, however, the synthesizability of the proposed molecular structures remains an issue. Gao and Coley[1] provided an analysis of the synthesizability of the molecules proposed by these de novo generative algorithms, and discuss their strengths and weaknesses.
Continue readingDeLinker – Deep Generative Models for 3D Linker Design
*** Disclaimer: This blog post represents some shameless self-promotion. ***
I am delighted to announce that our most recent work, DeLinker, was recently published in the Journal of Chemical Information and Modeling (link).
