Antibody convergence is the presence of similar antibodies in different individuals – suggesting that the individuals have had exposure to a common antigen, which has stimulated the production of similar, antigen-specific antibodies. We want to be able to identify these shared antibodies, sometimes referred to as ‘public clones’, as it could lead to development of immunodiagnostic tests against the shared antibodies, and potentially assist in the design of vaccines and therapeutic antibodies. A recent paper on bioRxiv by Sai Reddy’s group[i] has applied deep learning techniques – variational autoencoders (VAE) and support vector machines (SVM) – to the problem of how to identify shared antibodies.
Continue readingCategory Archives: Machine Learning
CCK-18 is Going Virtual
We are going virtual! Our next Comp Chem Kitchen, CCK-18, will be via a Zoom Webinar, on Friday, March 27, 2020, at 5-6 pm. We are delighted to announce that Prof. Andreas Bender from the University of Cambridgewill be speaking, as well as Dr Vicky Hellon from F1000 Research. To attend the CCK-18 webinar, you must sign up for a free Eventbrite ticket (limit 100).
Visualisation of very large high-dimensional data sets as minimum spanning trees
Large high-dimensional data sets are frequently used in chemical and biological sciences. For example the ChEMBL database contain millions of bioactive molecules from the scientific literature and their associated biological assay data are usually used for drug discovery. Visualising such databases helps understand the structure of data.
Continue readingState of the art in AI for drug discovery: more wet-lab please
The reception of ML approaches for the drug discovery pipeline, especially when focused on the hit to lead optimization process, has been rather skeptical by the medchem community. One of the main drivers for that is the way many ML publications benchmark their models: Historic datasets are split into two parts, with the larger part used to train and the smaller to test ML models. In order to standardize that validation process, computational chemists have constructed widely used benchmark datasets such as the DUD-E set, which is commonly used as a standard for protein-ligand binding classification tasks. Common criticism from medicinal chemists centers on the main problem associated with benchmark datasets: the absence of direct lab validation.
Continue readingCooking Up a (Deep)STORM with a Little Cup of Super Resolution Microscopy
Recently, I attended the Quantitative BioImaging (QBI) Conference 2020, served right here in Oxford. Amongst the many methods on the menu were new recipes for spicing up your Cryo-EM images with a bit of CiNNamon with a peppering of Poisson point processes in the inhomogeneous spatial case amongst many others. However, like many of today’s top tier restaurants most of the courses on offer were on the smaller side, nano-scale in fact, serving up the new field of Super Resolution Microscopy!
Continue readingTransforming Parliament – Training and deploying speech generation transformers for parliamentary speakers
Introduction
I recently wanted to explore areas of machine learning that I do not usually interact with as part of my DPhil research on antibody drug discovery. This post explores how to train and deploy a speech generation model for parliamentary speeches in the style of Jeremy Corbyn and Boris Johnson. You can play around with the resulting model at https://con-schneider.github.io/theytalktoyou.html.
Continue readingJournal Club: Is our data biased, and should it be?

Last week I presented the above paper at group meeting. While a little different from a typical OPIG journal club paper, the data we have access to almost certainly suffers from the same range of (possible) biases explored in this paper.
Continue readingde novo Small Molecule Design using Deep Learning
This is an interesting paper by Zhavoronkov, et al. that recently got published in Nature Biotechnology as a brief communication: https://www.nature.com/articles/s41587-019-0224-x. The paper describes a new deep generative model called generative tensorial reinforcement learning (GENTRL), which enables optimization for synthetic feasibility, novelty, and biological activity. In this work, authors have deigned, synthesized, and experimentally validated molecules targeting discoidin domain receptor 1 (DDR1) in less than two months. The code for GENTRL is available here: https://github.com/insilicomedicine/gentrl.
Reference: Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nature Biotechnology 2019, 37, 1038-1040.
A Gentle Introduction to the GPyOpt Module
Manually tuning hyperparameters in a neural network is slow and boring. Using Bayesian Optimisation to do it for you is slightly less slower and you can go do other things whilst it’s running. Susan recently highlighted some of the resources available to get to grips with GPyOpt. Below is a copy of a Jupyter Notebook where we walk through a couple of simple examples and hopefully shed a little bit of light on how the algorithm works.
Continue readingNeurIPS 2019: Chemistry/Biology papers
NeurIPS is the largest machine learning conference (by number of participants), with over 8,000 in 2017. This year, the conference will be held in Vancouver, Canada from 8th-14th December.
Recently, the list of accepted papers was announced, with 1430 papers accepted. Here, I will highlight several of potential interest to the chem-/bio-informatics communities. Given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”).
Continue reading