FLAML and LazyPredict are two packages designed to quickly train and test machine learning models from scikit-learn so that you can determine which is the best type of model for learning from your data.
Continue readingTag Archives: Python
How to turn a SMILES string into an extended-connectivity fingerprint using RDKit
After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).
ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.
Continue readingUsing Conda environments with Flask and Apache
With the advent of ABlooper, we’ve recently introduced OpenMM as a new dependency for the SAbDab-SAbPred antibody modelling platform. By far the easiest way to install the OpenMM Python API is via Conda, so we’ve moved to Conda environments for the entire platform. This has made installation of the platform much easier, but introduces complications when it comes to running its web applications under Apache. In this post, I’ll briefly explain the reason for this, and provide a basic guide for running Flask apps using Conda environments under Apache.
Continue readingExploring topological fingerprints in RDKit
Finding a way to express the similarity of irregular and discrete molecular graphs to enable quantitative algorithmic reasoning in chemical space is a fundamental problem in data-driven small molecule drug discovery.
Virtually all algorithms that are widely and successfully used in this setting boil down to extracting and comparing (multi-)sets of subgraphs, differing only in the space of substructures they consider and the extent to which they are able to adapt to specific downstream applications.
A large body of recent work has explored approaches centred around graph neural networks (GNNs), which can often maximise both of these considerations. However, the subgraph-derived embeddings learned by these algorithms may not always perform well beyond the specific datasets they are trained on and for many generic or resource-constrained applications more traditional “non-parametric” topological fingerprints may still be a viable and often preferable choice .
This blog post gives an overview of the topological fingerprint algorithms implemented in RDKit. In general, they count the occurrences of a certain family of subgraphs in a given molecule and then represent this set/multiset as a bit/count vector, which can be compared to other fingerprints with the Jaccard/Dice similarity metric or further processed by other algorithms.
Continue readingVisualise with Weight and Biases
Understanding what’s going on when you’ve started training your shiny new ML model is hard enough. Will it work? Have I got the right parameters? Is it the data? Probably. Any tool that can help with that process is a Godsend. Weights and biases is a great tool to help you visualise and track your model throughout your production cycle. In this blog post, I’m going to detail some basics on how you can initialise and use it to visualise your next project.
Installation
To use weights and biases (wandb), you need to make an account. For individuals it is free, however, for team-oriented features, you will have to pay. Wandb can then be installed using pip or conda.
$ conda install -c conda-forge wandb or $ pip install wandb
To initialise your project, import the package, sign in, and then use the following command using your chosen project name and username (if you want):
import wandb wandb.login() wandb.init(project='project1')
In addition to your project, you can also initialise a config dictionary with starting parameter values:
Continue readingMaking better plots with matplotlib.pyplot in Python3
The default plots made by Python’s matplotlib.pyplot module are almost always insufficient for publication. With a ~20 extra lines of code, however, you can generate high-quality plots suitable for inclusion in your next article.
Let’s start with code for a very default plot:
import matplotlib.pyplot as plt import numpy as np np.random.seed(1) d1 = np.random.normal(1.0, 0.1, 1000) d2 = np.random.normal(3.0, 0.1, 1000) xvals = np.arange(1, 1000+1, 1) plt.plot(xvals, d1, label='data1') plt.plot(xvals, d2, label='data2') plt.legend(loc='best') plt.xlabel('Time, ns') plt.ylabel('RMSD, Angstroms') plt.savefig('bad.png', dpi=300)
The result of this will be:
The fake data I generated for the plot look something like Root Mean Square Deviation (RMSD) versus time for a converged molecular dynamics simulation, so let’s pretend they are. There are a number of problems with this plot: it’s overall ugly, the color scheme is not very attractive and may not be color-blind friendly, the y-axis range of the data extends outside the range of the tick labels, etc.
We can easily convert this to a much better plot:
Continue readingHow to turn a SMILES string into a vector of molecular descriptors using RDKit
Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:
- molecular graph structure,
- lipophilicity (logP),
- molecular refractivity,
- electrotopological state,
- druglikeness,
- fragment profile,
- molecular charge,
- molecular surface,
- …
Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.
A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:
Continue readingGitHub actions can be useful
GitHub actions is a (relatively) novel GitHub feature that allows you to run code on GitHub when a predefined event is triggered. The most widespread use case for GitHub actions is for Continuous Integration, because it allows you to automatically test your code on any machine immediately after each push. For a great tutorial on how to use it for this see here.
But you can do so much more with them!! Basically you can set up any workflow to run after any event. An event is basically when a specific activity on GitHub happens, while a workflow is basically the script you want to run after the event has happened. For a full list of the events you can use see here. Workflow scripts are written in a .yml file and should be saved within the .github/worflows directory within your repository. I am incapable of writing a better tutorial for these than what is already on their documentation, but I will show a copy of a workflow script I recently put together and walk you through it.
In one of my previous blog posts I wrote about how to upload your code to PyPI. Hopefully I convinced you that this is quite easy, but it does require a few steps that you may not want to be doing every time you come up with a new feature (find a bug) and have to re-upload it. Luckily, you don’t have to!! Just stick the code into a GitHub actions workflow so it will automatically re-upload it for you. Here is the script I use for this:
Continue readingPost-processing for molecular docking: Assigning the correct bond order using RDKit.
AutoDock4 and AutoDock Vina are the most commonly used open-source software for protein-ligand docking. However, they both rely on a derivative of the “PDB” (Protein Data Base) file format: the “PDBQT” file (Protein Data Bank, Partial Charge (Q), & Atom Type (T)). In addition to the information contained in normal PDB files, PDBQT files have an additional column that lists the partial charge (Q) and the assigned AutoDock atom type (T) for each atom in the molecule. AutoDock atom types offer a more granular differentiation between atoms such as listing aliphatic carbons and aromatic carbons as separate AutoDock atom types.
The biggest drawback about the PDBQT format is that it does not encode for the bond order in molecules explicitly. Instead, the bond order is inferred based on the atom type, distance and angle to nearby atoms in the molecule. For normal sp3 carbons and molecules with mostly single bonds this system works fine, however, for more complex structures containing for example aromatic rings, conjugated systems and hypervalent atoms such as sulphur, the bond order is often not displayed correctly. This leads to issues downstream in the screening pipeline when molecules suddenly change their bond order or have to be discarded after docking because of impossible bond orders.
The solution to this problem is included in RDKit: The AssignBondOrdersFromTemplate function. All you need to do is load the original molecule used for docking as a template molecule and the docked pose PDBQT file into RDKIT as a PDB, without the bond order information. Then assign the original bond order from your template molecule. The following code snippet covers the necessary functions and should help you build a more accurate and reproducible protein-ligand docking pipeline:
#import RDKit AllChem from rdkit import Chem from rdkit.Chem import AllChem #load original molecule from smiles SMILES_STRING = "CCCCCCCCC" #the smiles string of your ligand template = Chem.MolFromSmiles(SMILES_STRING) #load the docked pose as a PDB file loc_of_docked_pose = "docked_pose_mol.pdb" #file location of the docked pose converted to PDB file docked_pose = AllChem.MolFromPDBFile(loc_of_docked_pose) #Assign the bond order to force correct valence newMol = AllChem.AssignBondOrdersFromTemplate(template, docked_pose) #Add Hydrogens if desired. "addCoords = True" makes sure the hydrogens are added in 3D. This does not take pH/pKa into account. newMol_H = Chem.AddHs(newMol, addCoords=True) #save your new correct molecule as a sdf file that encodes for bond orders correctly output_loc = "docked_pose_assigned_bond_order.sdf" #output file name Chem.MolToMolFile(newMol_H, output_loc)
How to turn a SMILES string into a molecular graph for Pytorch Geometric
Despite some of their technical issues, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor vectors.
But molecular data usually comes in the sequential form of labeled SMILES strings. It is not obvious for beginners how to optimally transform a SMILES string into a structured molecular graph object that can be used as an input for a GNN. In this post, we show how to convert a SMILES string into a molecular graph object which can subsequently be used for graph-based machine learning. We do so within the framework of Pytorch Geometric which currently is one of the best and most commonly used Python-based GNN-libraries.
We divide our task into three high-level steps:
- We define a function that maps an RDKit atom object to a suitable atom feature vector.
- We define a function that maps an RDKit bond object to a suitable bond feature vector.
- We define a function that takes as its input a list of SMILES strings and associated labels and then uses the functions from 1.) and 2.) to create a list of labeled Pytorch Geometric graph objects as its output.