Category Archives: Machine Learning

ICML 2020: Chemistry / Biology papers

ICML is one of the largest machine learning conferences and, like many other conferences this year, is running virtually from 12th – 18th July.

The list of accepted papers can be found here, with 1,088 papers accepted out of 4,990 submissions (22% acceptance rate). Similar to my post on NeurIPS 2019 papers, I will highlight several of potential interest to the chem-/bio-informatics communities. As before, given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”).

Continue reading

Understanding the synthesizability of molecules proposed by generative models

De novo molecular design is a computational technique to generate molecules with desired properties from scratch. Classical generative algorithms are based on Genetic Algorithms (GA) and the iterative construction of molecules from molecular fragments. Recently, Variational Auto-Encoders (VAEs), Generative Adversarial Networks (GANs) have been developed for this task, however, the synthesizability of the proposed molecular structures remains an issue. Gao and Coley[1] provided an analysis of the synthesizability of the molecules proposed by these de novo generative algorithms, and discuss their strengths and weaknesses.

Continue reading

Identifying shared antibodies using deep learning

Antibody convergence is the presence of similar antibodies in different individuals – suggesting that the individuals have had exposure to a common antigen, which has stimulated the production of similar, antigen-specific antibodies. We want to be able to identify these shared antibodies, sometimes referred to as ‘public clones’, as it could lead to development of immunodiagnostic tests against the shared antibodies, and potentially assist in the design of vaccines and therapeutic antibodies. A recent paper on bioRxiv by Sai Reddy’s group[i] has applied deep learning techniques – variational autoencoders (VAE) and support vector machines (SVM) – to the problem of how to identify shared antibodies.

Continue reading

CCK-18 is Going Virtual

We are going virtual! Our next Comp Chem Kitchen, CCK-18, will be via a Zoom Webinar, on Friday, March 27, 2020, at 5-6 pm. We are delighted to announce that Prof. Andreas Bender from the University of Cambridgewill be speaking, as well as Dr Vicky Hellon from F1000 Research. To attend the CCK-18 webinar, you must sign up for a free Eventbrite ticket (limit 100).

Visualisation of very large high-dimensional data sets as minimum spanning trees

Large high-dimensional data sets are frequently used in chemical and biological sciences. For example the ChEMBL database contain millions of bioactive molecules from the scientific literature and their associated biological assay data are usually used for drug discovery. Visualising such databases helps understand the structure of data.

Continue reading

State of the art in AI for drug discovery: more wet-lab please

The reception of ML approaches for the drug discovery pipeline, especially when focused on the hit to lead optimization process, has been rather skeptical by the medchem community. One of the main drivers for that is the way many ML publications benchmark their models: Historic datasets are split into two parts, with the larger part used to train and the smaller to test ML models. In order to standardize that validation process, computational chemists have constructed widely used benchmark datasets such as the DUD-E set, which is commonly used as a standard for protein-ligand binding classification tasks. Common criticism from medicinal chemists centers on the main problem associated with benchmark datasets: the absence of direct lab validation.

Continue reading

Cooking Up a (Deep)STORM with a Little Cup of Super Resolution Microscopy

Recently, I attended the Quantitative BioImaging (QBI) Conference 2020, served right here in Oxford. Amongst the many methods on the menu were new recipes for spicing up your Cryo-EM images with a bit of CiNNamon with a peppering of Poisson point processes in the inhomogeneous spatial case amongst many others. However, like many of today’s top tier restaurants most of the courses on offer were on the smaller side, nano-scale in fact, serving up the new field of Super Resolution Microscopy!

Continue reading

Transforming Parliament – Training and deploying speech generation transformers for parliamentary speakers

Introduction

I recently wanted to explore areas of machine learning that I do not usually interact with as part of my DPhil research on antibody drug discovery. This post explores how to train and deploy a speech generation model for parliamentary speeches in the style of Jeremy Corbyn and Boris Johnson. You can play around with the resulting model at https://con-schneider.github.io/theytalktoyou.html.

Continue reading

Journal Club: Is our data biased, and should it be?

Jia, X., Lynch, A., Huang, Y. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019) doi:10.1038/s41586-019-1540-5 https://www.nature.com/articles/s41586-019-1540-5

Last week I presented the above paper at group meeting. While a little different from a typical OPIG journal club paper, the data we have access to almost certainly suffers from the same range of (possible) biases explored in this paper.

Continue reading