Category Archives: Small Molecules

Fragment Based Drug Discovery with Crystallographic Fragment Screening at XChem and Beyond

Disclaimer: I’m a current PhD student working on PanDDA 2 for Frank von Delft and Charlotte Deane, and sponsored by Global Phasing, and some of this is my opinion – if it isn’t obvious in one of the references I probably said it so take it with a pinch of salt

Fragment Based Drug Discovery

Principle

Fragment based drugs discovery (FBDD) is a technique for finding lead compounds for medicinal chemistry. In FBDD a protein target of interest is identified for inhibition and a small library, typically of a few hundred compounds, is screened against it. Though these typically bind weakly, they can be used as a starting point for chemical elaboration towards something more lead-like. This approach is primarily contrasted with high throughput screening (HTS), in which an enormous number of larger, more complex molecules are screened in order to find ones which bind. The key idea is recognizing that the molecules in these HTS libraries can typically be broken down into a much smaller number of common substructures, fragments, so screening these ought to be more informative: between them they describe more of the “chemical space” which interacts with the protein. Since it first appeared about 25 years ago, FBDD has delivered four drugs for clinical use and over 40 molecules to clinical trials.

Continue reading

How to turn a SMILES string into a molecular graph for Pytorch Geometric

Despite some of their technical issues, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor vectors.

But molecular data usually comes in the sequential form of labeled SMILES strings. It is not obvious for beginners how to optimally transform a SMILES string into a structured molecular graph object that can be used as an input for a GNN. In this post, we show how to convert a SMILES string into a molecular graph object which can subsequently be used for graph-based machine learning. We do so within the framework of Pytorch Geometric which currently is one of the best and most commonly used Python-based GNN-libraries.

We divide our task into three high-level steps:

  1. We define a function that maps an RDKit atom object to a suitable atom feature vector.
  2. We define a function that maps an RDKit bond object to a suitable bond feature vector.
  3. We define a function that takes as its input a list of SMILES strings and associated labels and then uses the functions from 1.) and 2.) to create a list of labeled Pytorch Geometric graph objects as its output.
Continue reading

A quantitative way to measure targeted protein degradation

Whenever we order consumables in the Chemistry department, the whole lab gets an email notification once they arrive. So I can understand why I got some puzzled reactions from my colleagues when one such email arrived saying that my ‘artichoke’ was ready to collect from stores. Had I been sneakily doing my grocery shopping on a university research budget?

Artichoke is, in fact, the name of a plasmid designed by the Ebert lab (https://www.addgene.org/73320/), which I have been using in some of my research on targeted protein degradation. The premise is simple enough: genes for two different fluorescent proteins, one of which is fused to a protein-of-interest.

Continue reading

Stochastic chemical kinetics – things randomly bumping into each other

In this blog post I describe the advantages of taking a stochastic view of chemical systems based on the work of D. T. Gillespie and subsequent publications. Gillespie presented his formalism for considering stochastic chemical kinetics, now referred to as the Gillespie Algorithm, in two papers published in 1976 and 1977 (Gillespie, D. T. J. Comp. Phys. 1976, Gillespie D. T. J. Phys. Chem. 1977) – if you want to see the full derivation for the Gillespie Algorithm along with many examples I recommend giving them both a read.

The essential question of chemical kinetics as stated by Gillespie is:

“If a fixed volume V contains a spatially uniform mixture of N chemical species which can inter-react through M specified chemical reaction channels, then given the numbers of molecules of each species present at some initial time, what will these molecular population levels be at any later time?”

Continue reading

Model validation in Crystallographic Fragment Screening

Fragment based drug discovery is a powerful technique for finding lead compounds for medicinal chemistry. Crystallographic fragment screening is particularly useful because it informs one not just about whether a fragment binds, but has the advantage of providing information on how it binds. This information allows for rational elaboration and merging of fragments.

However, this comes with a unique challenge: the confidence in the experimental readout, if and how a fragment binds, is tied to the quality of the crystallographic model that can be built. This intimately links crystallographic fragment screening to the general statistical idea of a “model”, and the statistical ideas of goodness of fit and overfitting.

Continue reading

Targeted protein degradation phenotypic studies using HaloTag CRISPR/Cas9 endogenous target tagging and HaloPROTAC

Biologists currently have several options in their arsenal when it comes to gene silencing. if you want to completely vanquish the gene in question, you can use CRISPR to knock the gene out completely. This is a great way to completely eliminate the gene, and hence compare cell phenotypes with and without the gene, but it’s less good if the gene is essential and the cells won’t grow without it in the first place. 

Otherwise you can use RNA interference, where small pieces of RNA that complement the mRNA for that gene are introduced to the cell, with the overall effect of blocking transcription of that gene’s mRNA, hence silencing it. However, this method suffers from side effects and varying levels of gene knockdown efficiency. Moreover, it does not vanquish existing protein, it just stops more from being produced.

Continue reading

Benchmarks in De Novo Drug Design

I recently came across a review of “De novo molecular drug design benchmarking” by Lauren L. Grant and Clarissa S. Sit where they highlighted the recently proposed benchmarking methods including Fréchet ChemNet Distance [1], GuacaMol [2], and Molecular Sets (MOSES) [3] together with its current and future potential applications as well as the steps moving forward in terms of validation of benchmarking methods [4].

From this review, I particularly wanted to note about the issues with current benchmarking methods and the points we should be aware of when using these methods to benchmark our own de novo molecular design methods. Goal-directed models are referring to de novo molecular design methods optimizing for a particular scoring function [2].

Continue reading

Getting the PDB structures of compounds in ChEMBL

Recently I was dealing with a set of compounds with known target activities from the ChEMBL database, and I wanted to find out which of them also had PDB  crystal structures in complex with that target.

Referencing this manually is very easy for cases where we are interested in 2-3 compounds, but for any larger number, using the ChEMBL and PDB web services greatly reduces the number of clicks.

Continue reading

Issues with graph neural networks: the cracks are where the light shines through

Deep convolutional neural networks have lead to astonishing breakthroughs in the area of computer vision in recent years. The reason for the extraordinary performance of convolutional architectures in the image domain is their strong ability to extract informative high-level features from visual data. For prediction tasks on images, this has lead to superhuman performance in a variety of applications and to an almost universal shift from classical feature engineering to differentiable feature learning.

Unfortunately, the picture is not quite as rosy yet in the area of molecular machine learning. Feature learning techniques which operate directly on raw molecular graphs without intermediate feature-engineering steps have only emerged in the last few years in the form of graph neural networks (GNNs). GNNs, however, still have not managed to definitively outcompete and replace more classical non-differentiable molecular representation methods such as extended-connectivity fingerprints (ECFPs). There is an increasing awareness in the computational chemistry community that GNNs have not quite lived up to the initial hype and still suffer from a number of technical limitations.

Continue reading

How to interact with small molecules in Jupyter Notebooks

The combination of Python and the cheminformatics toolkit RDKit has opened up so many ways to explore chemistry on a computer. Jupyter — named for the three languages, Julia, Python, and R — ties interactivity and visualization together, creating wonderful environments (Notebooks and JupyterLab) to carry out, share and reproduce research, including:

“data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.”

—https://jupyter.org

At this year’s annual RDKit UGM (User Group Meeting), Cédric Bouysset shared a tutorial explaining how to create a grid of molecules that you can interact with, using his “mols2grid“:

Continue reading