Category Archives: Small Molecules

Getting the PDB structures of compounds in ChEMBL

Recently I was dealing with a set of compounds with known target activities from the ChEMBL database, and I wanted to find out which of them also had PDB crystal structures in complex with that target.

Referencing this manually is very easy for cases where we are interested in 2-3 compounds, but for any larger number, using the ChEMBL and PDB web services greatly reduces the number of clicks.

Continue reading →

Issues with graph neural networks: the cracks are where the light shines through

Deep convolutional neural networks have lead to astonishing breakthroughs in the area of computer vision in recent years. The reason for the extraordinary performance of convolutional architectures in the image domain is their strong ability to extract informative high-level features from visual data. For prediction tasks on images, this has lead to superhuman performance in a variety of applications and to an almost universal shift from classical feature engineering to differentiable feature learning.

Unfortunately, the picture is not quite as rosy yet in the area of molecular machine learning. Feature learning techniques which operate directly on raw molecular graphs without intermediate feature-engineering steps have only emerged in the last few years in the form of graph neural networks (GNNs). GNNs, however, still have not managed to definitively outcompete and replace more classical non-differentiable molecular representation methods such as extended-connectivity fingerprints (ECFPs). There is an increasing awareness in the computational chemistry community that GNNs have not quite lived up to the initial hype and still suffer from a number of technical limitations.

Continue reading →

How to interact with small molecules in Jupyter Notebooks

The combination of Python and the cheminformatics toolkit RDKit has opened up so many ways to explore chemistry on a computer. Jupyter — named for the three languages, Julia, Python, and R — ties interactivity and visualization together, creating wonderful environments (Notebooks and JupyterLab) to carry out, share and reproduce research, including:

“data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.”
—https://jupyter.org

At this year’s annual RDKit UGM (User Group Meeting), Cédric Bouysset shared a tutorial explaining how to create a grid of molecules that you can interact with, using his “mols2grid“:

Continue reading →

Watch out when using PDBbind!

Now that PDBbind 2020 has been released, I want to draw some attention to an issue with using the SDF files that are supplied in the PDBbind refined set 2020.

Normally, SDF files save the chirality information of compounds in the atom block of the file which is shown belowas a snipped of the full sdf file for the ligand of PDB entry 4qsv. The column that defines chirality is marked in red.

As you can see, all columns shown here are 0. The SDF files supplied by PDBbind for some reason do NOT encode chirality information explicitly. This will be a problem when using RDKit to read the molecule and transform it into a smiles string. By using the following commands to read the ligand for 4qsv from PDBBind 2020 and write a SMILES string, we get:

Continue reading →

Out-of-distribution generalisation and scaffold splitting in molecular property prediction

The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.

In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set $X$ into two sets – a training set $X_{\text{train}}$ and a test set $X_{\text{test}}$ . The model is then subsequently trained on the examples in the training set $X_{\text{train}}$ and afterwards its prediction abilities are measured on the untouched examples in the test set $X_{\text{test}}$ via a suitable performance metric.

Since in this scenario the model has never seen any of the examples in $X_{\text{test}}$ during training, its performance on $X_{\text{test}}$ must be indicative of its performance on novel data $X_{\text{new}}$ which it will encounter in the future. Right?

Continue reading →

Automated intermolecular interaction detection using the ODDT Python Module

Detecting intermolecular interactions is often one of the first steps when assessing the binding mode of a ligand. This usually involves the human researcher opening up a molecular viewer and checking the orientations of the ligand and protein functional groups, sometimes aided by the viewer’s own interaction detecting functionality. For looking at single digit numbers of structures, this approach works fairly well, especially as more experienced researchers can spot cases where the automated interaction detection has failed. When analysing tens or hundreds of binding sites, however, an automated way of detecting and recording interaction information for downstream processing is needed. When I had to do this recently, I used an open-source Python module called ODDT (Open Drug Discovery Toolkit, its full documentation can be found here).

My use case was fairly standard: starting with a list of holo protein structures as pdb files and their corresponding ligands in .sdf format, I wanted to detect any hydrogen bonds between a ligand and its native protein crystal structure. Specifically, I needed the number and name of the the interacting residue, its chain ID, and the name of the protein atom involved in the interaction. A general example on how to do this can be found in the ODDT documentation. Below, I show how I have used the code on PDB structure 1a9u.

Continue reading →

ORDER!: Returning bond order information to your docked poses

Common docking software, such as AutoDock Vina or AutoDock 4, require the ligand and receptor files to be converted into the PDBQT format. Once a correct pose has been identified, the pose will be produced also as a .pdbqt file.

Continue reading →

Fragment-to-Lead Successes in 2019

In this blogpost, I want to highlight the excellent work by Jahnke and collaborators. For the past 5 years, they have published an annual perspective covering fragment-to-lead success stories from the previous year. Very helpfully, their work includes a table detailing the hit fragment(s) and lead molecule, together with key experimental results and parameters.

Continue reading →

Curious About the Origins of Computerized Molecules? Free Webinar Dec 22…

After the stunning announcement at CASP14 that DeepMind’s AlphaFold 2 had successfully predicted the structures of proteins from their sequence alone, it’s hard to believe we began this journey by representing molecules with punched cards…

Image of a punched card, showing 80 columns and 12 rows, with particular rectangular holes representing the 1 bits of binary numbers. The upper right corner is cut at an angle, to facilitate feeding the card into a punched card reader. The column numbers are printed along the bottom. The words “IBM UNITED KINGDOM LIMITED” are printed along the very bottom. This card is line 12 from a Fortran program, “12 PIFRA=(A(JB,37)-A(JB,99))/A(JB,47) PUX 0430”. Image Credit: Pete Birkinshaw, Manchester, U.K. CC BY 2.0

Tales of carrying stacks of punched cards to the computer centre with a line drawn diagonally on the side of the stack, to help put them back in order should you trip and fall—seem like another universe—but this is what passed for the human-computer interface in much of the mid-20th century.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Small Molecules

Getting the PDB structures of compounds in ChEMBL

Issues with graph neural networks: the cracks are where the light shines through

How to interact with small molecules in Jupyter Notebooks

Watch out when using PDBbind!

Out-of-distribution generalisation and scaffold splitting in molecular property prediction

Automated intermolecular interaction detection using the ODDT Python Module

ORDER!: Returning bond order information to your docked poses

Fragment-to-Lead Successes in 2019

Curious About the Origins of Computerized Molecules? Free Webinar Dec 22…

When a ring flips, how do we calculate RMSD?