Category Archives: Small Molecules

Meeko: Docking straight from SMILES string

When docking, using software like AutoDock Vina, you must prepare your ligand by protonating the molecule, generating 3D coordinates, and converting it to a specific file format (in the case of Vina, PDBQT). Docking software typically needs the protein and ligand file inputs to be written on disk. This is limiting as generating 10,000s of files for a large virtual screen can be annoying and hinder the speed at which you dock.

Fortunately, the Forli group in Scripps Research have developed a Python package, Meeko, to prepare ligands directly from SMILES or other molecule formats for docking to AutoDock 4 or Vina, without writing any files to disk. This means you can dock directly from a single file containing all the SMILES of the ligands you are investigating!

Continue reading

5th Artificial Intelligence in Chemistry Symposium

The lineup for the Royal Society of Chemistry’s 5th “Artificial Intelligence in Chemistry” Symposium (Thursday-Friday, 1st-2nd September 2022) is now complete for both oral and poster presentations. It really is a fantastic selection of topics and speakers and it is clear this event is now a highlight of the scientific calendar. Our very own Prof. Charlotte M. Deane, MBE will be giving a keynote.

5th RSC-BMCS/RSC-CICAG Airtificial Intelligence in Chemistry Symposium, 1st-2nd September, Churchill College, Cambridge + Zoom broadcast.

It marks a return to in-person meetings: it will be held at Churchill College, Cambridge, with a conference dinner at Trinity Hall.

More details are here: https://www.rscbmcs.org/events/aichem22/.

Registration for in person attendance is open until Monday 29th August 17:00 (BST).

It is also possible to register for virtual attendance; the meeting will be broadcast on Zoom.

Exploring topological fingerprints in RDKit

Finding a way to express the similarity of irregular and discrete molecular graphs to enable quantitative algorithmic reasoning in chemical space is a fundamental problem in data-driven small molecule drug discovery.

Virtually all algorithms that are widely and successfully used in this setting boil down to extracting and comparing (multi-)sets of subgraphs, differing only in the space of substructures they consider and the extent to which they are able to adapt to specific downstream applications.

A large body of recent work has explored approaches centred around graph neural networks (GNNs), which can often maximise both of these considerations. However, the subgraph-derived embeddings learned by these algorithms may not always perform well beyond the specific datasets they are trained on and for many generic or resource-constrained applications more traditional “non-parametric” topological fingerprints may still be a viable and often preferable choice .

This blog post gives an overview of the topological fingerprint algorithms implemented in RDKit. In general, they count the occurrences of a certain family of subgraphs in a given molecule and then represent this set/multiset as a bit/count vector, which can be compared to other fingerprints with the Jaccard/Dice similarity metric or further processed by other algorithms.

Continue reading

Viewing fragment elaborations in RDKit

As a reasonably new RDKit user, I was relieved to find that using its built-in functionality for generating basic images from molecules is quite easy to use. However, over time I have picked up some additional tricks to make the images generated slightly more pleasing on the eye!

The first of these (which I definitely stole from another blog post at some point…) is to ask it to produce SVG images rather than png:

#ensure the molecule visualisation uses svg rather than png format
IPythonConsole.ipython_useSVG=True

Now for something slightly more interesting: as a fragment elaborator, I often need to look at a long list of elaborations that have been made to a starting fragment. As these have usually been docked, these don’t look particularly nice when loaded straight into RDKit and drawn:

#load several mols from a single sdf file using SDMolSupplier
#add these to a list
elabs = [mol for mol in Chem.SDMolSupplier('frag2/elabsTestNoRefine_Docked_0.sdf')]

#get list of ligand efficiencies so these can be displayed alongside the molecules
LEs = [(float(mol.GetProp('Gold.PLP.Fitness'))/mol.GetNumHeavyAtoms()) for mol in elabs]

Draw.MolsToGridImage(elabs, legends = [str(LE) for LE in LEs])
Fig. 1: Images generated without doing any tinkering

Two quick changes that will immediately make this image more useful are aligning the elaborations by a supplied substructure (here I supplied the original fragment so that it’s always in the same place) and calculating the 2D coordinates of the molecules so we don’t see the twisty business happening in the bottom right of Fig. 1:

Continue reading

How to turn a SMILES string into a vector of molecular descriptors using RDKit

Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:

  • molecular graph structure,
  • lipophilicity (logP),
  • molecular refractivity,
  • electrotopological state,
  • druglikeness,
  • fragment profile,
  • molecular charge,
  • molecular surface,

Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.

A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:

Continue reading

From code to molecules: The future of chemical synthesis

In June, after I finish my PhD, I will be joining Chemify, a new startup based in Glasgow that aims to make chemical synthesis universally accessible, reproducible and fully automated using AI and robotics. After previously talking about “Why you should care about startups as a researcher” and a quick guide on “Commercialising your research: Where to start?” on this blog, I have now joined a science-based startup fresh out of university myself.

Chemify is a spinout from the University of Glasgow originating from the group of Prof. Lee Cronin. The core of the technology is the chemical programming language χDL (pronounced “chi DL”) that, in combination with a natural language processing AI that reads and understands chemical synthesis procedures, can be used to plan and autonomously executed chemical reactions on robotic hardware. The Cronin group has also already build the modular robotic hardware needed to carry out almost any chemical reaction, the “Chemputer”. Due to the flexibility of both the Chemputer and the χDL language, Chemify has already shown that the applications go way beyond simple synthesis and can be applied to drug formulation, the discovery of new materials or the optimisation of reaction conditions.

Armed with this transformational software and hardware, Chemify is now fully operational and is hiring exceptional talent into their labs in Glasgow. I am excited to see how smart, AI-driven automation techniques like Chemify will change how small scale chemical synthesis and chemical discovery more broadly is done in the future. I’m super excited to be part of the journey.

Paper review: “EquiBind”

Molecular docking helps us understand how small-molecules interact with proteins. This is especially useful in early drug development stages such as target identification and compound screening. Quick and accurate docking software allows researchers to focus their attention on a smaller set of lead molecules for further testing. Traditionally, docking software has employed first principles from physics and chemistry. Recently, deep learning has become all the rage for molecular docking, maybe motivated by the successful application of deep learning to molecular folding.

Method

EquiBind is a deep learning unconstrained docking method which models a fixed receptor and a ligand with selected rotatable bonds. It predicts the binding pocket and the ligand’s conformation within the pocket in one go. Under the hood, EquiBind employs two great ideas from a recent ICLR 2022 Paper: a SE3-invariant graph neural network based architecture and the idea to generate fixed sets of matching key points to define a rotation and translation between receptor and ligand. In addition, the authors innovate a fast method to project a deformed ligand onto the space spanned by the rotatable bonds of a pre-generated ligand conformation.

Continue reading

Better Models Through Molecular Standardization

“Cheminformatics is hard.”

— Paul Finn

I would add: “Chemistry is nuanced”… Just as there are many different ways of drawing the same molecule, SMILES is flexible enough to allow us to write the same molecule in different ways. While canonical SMILES can resolve this problem, we sometimes have different problem. In some situations, e.g., in machine learning, we need to map all these variants back to the same molecule. We also need to make sure we clean up our input molecules and eliminate invalid or incomplete structures.

Different Versions of the Same Molecule: Salt, Neutral or Charged?

Sometimes, a chemical supplier or compound vendor provides a salt of the compound, e.g., sodium acetate, but all we care about is the organic anion, i.e., the acetate. Very often, our models are built on the assumption we have only one molecule as input—but a salt will appear as two molecules (the sodium ion and the acetate ion). We might also have been given just the negatively-charged acetate instead of the neutral acetic acid.

Tautomers

Another important chemical phenomenon exists where apparently different molecules with identical heavy atoms and a nearby hydrogen can be easily interconverted: tautomers. By moving just one hydrogen atom and exchanging adjacent bond orders, the molecule can convert from one form to another. Usually, one tautomeric form is most stable. Warfarin, a blood-thinning drug, can exist in solution in 40 distinct tautomeric forms. A famous example is keto-enol tautomerism: for example, ethenol (not ethanol) can interconvert with the ketone form. When one form is more stable than the other form(s), we need to make sure we convert the less stable form(s) into the most stable form. Ethenol, a.k.a. vinyl alcohol, (SMILES: ‘C=CO[H]’), will be more stable when it is in the ketone form (SMILES: ‘CC(=O)([H])’):

from IPython.display import SVG # to use Scalar Vector Graphics (SVG) not bitmaps, for cleaner lines

import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw # to draw molecules
from rdkit.Chem.Draw import IPythonConsole # to draw inline in iPython
from rdkit.Chem import rdDepictor  # to generate 2D depictions of molecules
from rdkit.Chem.Draw import rdMolDraw2D # to draw 2D molecules using vectors

AllChem.ReactionFromSmarts('[C:1]-[C:2](-[O:3]-[H:4])>>[C:1]-[C:2](=[O:3])(-[H:4])')
Continue reading

How to prepare a molecule for RDKit

RDKit is very fussy when it comes to inputs in SDF format. Using the SDMolSupplier, we get a significant rate of failure even on curated datasets such as the PDBBind refined set. Pymol has no such scruples, and with that, I present a function which has proved invaluable to me over the course of my DPhil. For reasons I have never bothered to explore, using pymol to convert from sdf, into mol2 and back to sdf format again (adding in missing hydrogens along the way) will almost always make a molecule safe to import using RDKit:

from pathlib import Path
from pymol import cmd

def py_mollify(sdf, overwrite=False):
    """Use pymol to sanitise an SDF file for use in RDKit.

    Arguments:
        sdf: location of faulty sdf file
        overwrite: whether or not to overwrite the original sdf. If False,
            a new file will be written in the form <sdf_fname>_pymol.sdf
            
    Returns:
        Original sdf filename if overwrite == False, else the filename of the
        sanitised output.
    """
    sdf = Path(sdf).expanduser().resolve()
    mol2_fname = str(sdf).replace('.sdf', '_pymol.mol2')
    new_sdf_fname = sdf if overwrite else str(sdf).replace('.sdf', '_pymol.sdf')
    cmd.load(str(sdf))
    cmd.h_add('all')
    cmd.save(mol2_fname)
    cmd.reinitialize()
    cmd.load(mol2_fname)
    cmd.save(str(new_sdf_fname))
    return new_sdf_fname

Post-processing for molecular docking: Assigning the correct bond order using RDKit.

AutoDock4 and AutoDock Vina are the most commonly used open-source software for protein-ligand docking. However, they both rely on a derivative of the “PDB” (Protein Data Base) file format: the “PDBQT” file (Protein Data Bank, Partial Charge (Q), & Atom Type (T)). In addition to the information contained in normal PDB files, PDBQT files have an additional column that lists the partial charge (Q) and the assigned AutoDock atom type (T) for each atom in the molecule. AutoDock atom types offer a more granular differentiation between atoms such as listing aliphatic carbons and aromatic carbons as separate AutoDock atom types.

The biggest drawback about the PDBQT format is that it does not encode for the bond order in molecules explicitly. Instead, the bond order is inferred based on the atom type, distance and angle to nearby atoms in the molecule. For normal sp3 carbons and molecules with mostly single bonds this system works fine, however, for more complex structures containing for example aromatic rings, conjugated systems and hypervalent atoms such as sulphur, the bond order is often not displayed correctly. This leads to issues downstream in the screening pipeline when molecules suddenly change their bond order or have to be discarded after docking because of impossible bond orders.

The solution to this problem is included in RDKit: The AssignBondOrdersFromTemplate function. All you need to do is load the original molecule used for docking as a template molecule and the docked pose PDBQT file into RDKIT as a PDB, without the bond order information. Then assign the original bond order from your template molecule. The following code snippet covers the necessary functions and should help you build a more accurate and reproducible protein-ligand docking pipeline:

#import RDKit AllChem
from rdkit import Chem
from rdkit.Chem import AllChem


#load original molecule from smiles
SMILES_STRING = "CCCCCCCCC" #the smiles string of your ligand
template = Chem.MolFromSmiles(SMILES_STRING)

#load the docked pose as a PDB file
loc_of_docked_pose = "docked_pose_mol.pdb" #file location of the docked pose converted to PDB file
docked_pose = AllChem.MolFromPDBFile(loc_of_docked_pose)

#Assign the bond order to force correct valence
newMol = AllChem.AssignBondOrdersFromTemplate(template, docked_pose)

#Add Hydrogens if desired. "addCoords = True" makes sure the hydrogens are added in 3D. This does not take pH/pKa into account. 
newMol_H = Chem.AddHs(newMol, addCoords=True)

#save your new correct molecule as a sdf file that encodes for bond orders correctly
output_loc = "docked_pose_assigned_bond_order.sdf" #output file name
Chem.MolToMolFile(newMol_H, output_loc)