Category Archives: Python

How to turn a SMILES string into a vector of molecular descriptors using RDKit

Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:

  • molecular graph structure,
  • lipophilicity (logP),
  • molecular refractivity,
  • electrotopological state,
  • druglikeness,
  • fragment profile,
  • molecular charge,
  • molecular surface,

Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.

A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:

Continue reading

Make your code do more, with less

When you wrangle data for a living, you start to wonder why everything takes so darn long. Through five years of introspection, I have come to conclude that two simple factors limit every computational project. One is, of course, your personal productivity. Your time of focused work, minus distractions (and yes, meetings figure here), times your energy and mental acuity. All those things you have little control over, unfortunately. But the second is the productivity of your code and tools. And this, in principle, is a variable that you have full control over.

Even quick calculations, when applied to tens of millions of sequences, can take quite some time!

This is a post about how to increase your productivity, by helping you navigate all those instances when the progress bar does not seem to go fast enough. I want to discuss actionable tools to make your code run faster, and generate more results, with less effort, in less time. Instructions to tinker less and think more, so you can do the science that you truly want to be doing. And, above all, I want to give out advice that is so counter-intuitive that you should absolutely consider following it.

Continue reading

Better Models Through Molecular Standardization

“Cheminformatics is hard.”

— Paul Finn

I would add: “Chemistry is nuanced”… Just as there are many different ways of drawing the same molecule, SMILES is flexible enough to allow us to write the same molecule in different ways. While canonical SMILES can resolve this problem, we sometimes have different problem. In some situations, e.g., in machine learning, we need to map all these variants back to the same molecule. We also need to make sure we clean up our input molecules and eliminate invalid or incomplete structures.

Different Versions of the Same Molecule: Salt, Neutral or Charged?

Sometimes, a chemical supplier or compound vendor provides a salt of the compound, e.g., sodium acetate, but all we care about is the organic anion, i.e., the acetate. Very often, our models are built on the assumption we have only one molecule as input—but a salt will appear as two molecules (the sodium ion and the acetate ion). We might also have been given just the negatively-charged acetate instead of the neutral acetic acid.

Tautomers

Another important chemical phenomenon exists where apparently different molecules with identical heavy atoms and a nearby hydrogen can be easily interconverted: tautomers. By moving just one hydrogen atom and exchanging adjacent bond orders, the molecule can convert from one form to another. Usually, one tautomeric form is most stable. Warfarin, a blood-thinning drug, can exist in solution in 40 distinct tautomeric forms. A famous example is keto-enol tautomerism: for example, ethenol (not ethanol) can interconvert with the ketone form. When one form is more stable than the other form(s), we need to make sure we convert the less stable form(s) into the most stable form. Ethenol, a.k.a. vinyl alcohol, (SMILES: ‘C=CO[H]’), will be more stable when it is in the ketone form (SMILES: ‘CC(=O)([H])’):

from IPython.display import SVG # to use Scalar Vector Graphics (SVG) not bitmaps, for cleaner lines

import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw # to draw molecules
from rdkit.Chem.Draw import IPythonConsole # to draw inline in iPython
from rdkit.Chem import rdDepictor  # to generate 2D depictions of molecules
from rdkit.Chem.Draw import rdMolDraw2D # to draw 2D molecules using vectors

AllChem.ReactionFromSmarts('[C:1]-[C:2](-[O:3]-[H:4])>>[C:1]-[C:2](=[O:3])(-[H:4])')
Continue reading

Einops: Powerful library for tensor operations in deep learning

Tobias and I recently gave a talk at the OPIG retreat on tips for using PyTorch. For this we created a tutorial on Google Colab notebook (link can be found here). I remember rambling about the advantages of implementing your own models against using other peoples code. Well If I convinced you, einops is for you!!

Basically, einops lets you perform operations on tensors using the Einstein Notation. This package comes with a number of advantages a few of which I will try and summarise here:

Continue reading

How to prepare a molecule for RDKit

RDKit is very fussy when it comes to inputs in SDF format. Using the SDMolSupplier, we get a significant rate of failure even on curated datasets such as the PDBBind refined set. Pymol has no such scruples, and with that, I present a function which has proved invaluable to me over the course of my DPhil. For reasons I have never bothered to explore, using pymol to convert from sdf, into mol2 and back to sdf format again (adding in missing hydrogens along the way) will almost always make a molecule safe to import using RDKit:

from pathlib import Path
from pymol import cmd

def py_mollify(sdf, overwrite=False):
    """Use pymol to sanitise an SDF file for use in RDKit.

    Arguments:
        sdf: location of faulty sdf file
        overwrite: whether or not to overwrite the original sdf. If False,
            a new file will be written in the form <sdf_fname>_pymol.sdf
            
    Returns:
        Original sdf filename if overwrite == False, else the filename of the
        sanitised output.
    """
    sdf = Path(sdf).expanduser().resolve()
    mol2_fname = str(sdf).replace('.sdf', '_pymol.mol2')
    new_sdf_fname = sdf if overwrite else str(sdf).replace('.sdf', '_pymol.sdf')
    cmd.load(str(sdf))
    cmd.h_add('all')
    cmd.save(mol2_fname)
    cmd.reinitialize()
    cmd.load(mol2_fname)
    cmd.save(str(new_sdf_fname))
    return new_sdf_fname

How to Install Open Source PyMOL on Windows 10

It is possible to get an installer for the crystallographer’s favourite molecular visualization tool for Windows machines, that is if you are willing to pay a fee. Fortunately, Christoph Gohlke has made available free, pre-compiled Windows versions of the latest PyMOL software, along with all of it’s requirements, it’s just not particularly straightforward to install. The PyMOLWiki offers a three-step guide on how to do this and I will break it down to make it somewhat clearer.

1. Install the latest version of Python 3 for Windows

Download the Windows Installer (x-bit) for Python 3 from their website, x being your Windows architecture – 32 or 64.

Then, follow the instructions on how to install it. You can check if it has installed by running the following in PowerShell:

Continue reading

Post-processing for molecular docking: Assigning the correct bond order using RDKit.

AutoDock4 and AutoDock Vina are the most commonly used open-source software for protein-ligand docking. However, they both rely on a derivative of the “PDB” (Protein Data Base) file format: the “PDBQT” file (Protein Data Bank, Partial Charge (Q), & Atom Type (T)). In addition to the information contained in normal PDB files, PDBQT files have an additional column that lists the partial charge (Q) and the assigned AutoDock atom type (T) for each atom in the molecule. AutoDock atom types offer a more granular differentiation between atoms such as listing aliphatic carbons and aromatic carbons as separate AutoDock atom types.

The biggest drawback about the PDBQT format is that it does not encode for the bond order in molecules explicitly. Instead, the bond order is inferred based on the atom type, distance and angle to nearby atoms in the molecule. For normal sp3 carbons and molecules with mostly single bonds this system works fine, however, for more complex structures containing for example aromatic rings, conjugated systems and hypervalent atoms such as sulphur, the bond order is often not displayed correctly. This leads to issues downstream in the screening pipeline when molecules suddenly change their bond order or have to be discarded after docking because of impossible bond orders.

The solution to this problem is included in RDKit: The AssignBondOrdersFromTemplate function. All you need to do is load the original molecule used for docking as a template molecule and the docked pose PDBQT file into RDKIT as a PDB, without the bond order information. Then assign the original bond order from your template molecule. The following code snippet covers the necessary functions and should help you build a more accurate and reproducible protein-ligand docking pipeline:

#import RDKit AllChem
from rdkit import Chem
from rdkit.Chem import AllChem


#load original molecule from smiles
SMILES_STRING = "CCCCCCCCC" #the smiles string of your ligand
template = Chem.MolFromSmiles(SMILES_STRING)

#load the docked pose as a PDB file
loc_of_docked_pose = "docked_pose_mol.pdb" #file location of the docked pose converted to PDB file
docked_pose = AllChem.MolFromPDBFile(loc_of_docked_pose)

#Assign the bond order to force correct valence
newMol = AllChem.AssignBondOrdersFromTemplate(template, docked_pose)

#Add Hydrogens if desired. "addCoords = True" makes sure the hydrogens are added in 3D. This does not take pH/pKa into account. 
newMol_H = Chem.AddHs(newMol, addCoords=True)

#save your new correct molecule as a sdf file that encodes for bond orders correctly
output_loc = "docked_pose_assigned_bond_order.sdf" #output file name
Chem.MolToMolFile(newMol_H, output_loc)

Non-linear Dependence? Mutual Information to the Rescue!

We are all familiar with the idea of a correlation. In the broadest sense of the word, a correlation can refer to any kind of dependence between two variables. There are three widely used tests for correlation:

  • Spearman’s r: Used to measure a linear relationship between two variables. Requires linear dependence and each marginal distribution to be normal.
  • Pearson’s ρ: Used to measure rank correlations. Requires the dependence structure to be described by a monotonic relationship
  • Kendall’s 𝛕: Used to measure ordinal association between variables.

While these three measures give us plenty of options to work with, they do not work in all cases. Take for example the following variables, Y1 and Y2. These might be two variables that vary in a concerted manner.

Perhaps we suspect that a state change in Y1 leads to a state change in Y2 or vice versa and we want to measure the association between these variables. Using the three measures of correlation, we get the following results:

Continue reading

How to turn a SMILES string into a molecular graph for Pytorch Geometric

Despite some of their technical issues, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor vectors.

But molecular data usually comes in the sequential form of labeled SMILES strings. It is not obvious for beginners how to optimally transform a SMILES string into a structured molecular graph object that can be used as an input for a GNN. In this post, we show how to convert a SMILES string into a molecular graph object which can subsequently be used for graph-based machine learning. We do so within the framework of Pytorch Geometric which currently is one of the best and most commonly used Python-based GNN-libraries.

We divide our task into three high-level steps:

  1. We define a function that maps an RDKit atom object to a suitable atom feature vector.
  2. We define a function that maps an RDKit bond object to a suitable bond feature vector.
  3. We define a function that takes as its input a list of SMILES strings and associated labels and then uses the functions from 1.) and 2.) to create a list of labeled Pytorch Geometric graph objects as its output.
Continue reading

Python’s Data Classes

When writing code, you have inevitably needed to store data throughout your pipeline. In these cases you store your value, list or data frame as a variable to easily use it elsewhere in your code. However, sometimes your data has an awkward form, consisting of a number of different length lists or data of different types and sizes. While it is still doable to work with, and using tuples or dictionaries can help, accessing different elements in your data quickly becomes messy and it is less intuitive what your code is actually doing.

To solve the above stated problem, data classes were introduced as a new feature in Python 3.7. A data class is a regular Python class, but with certain methods already implemented for you. This makes them easy to create and removes a lot of boilerplate (repeated code) making them simpler, more intuitive and pretty. Further, as data classes are part of the standard library, you can directly import it without needing to install any external dependencies (noice).

With the sales pitch out of the way, let us look at how we can use data classes.

from dataclasses import dataclass
from typing import Any

@dataclass
class Antibody:
    vgene: str
    jgene: None
    sequence: Any = 'EVQ'
Continue reading