Environment modules is a great tool for high-performance computing as it is a modular system to quickly and painlessly enable preset configurations of environment variables, for example a user may be provided with modulefile for an antiquated version of a tool and a bleeding-edge alpha version of that same tool and they can easily load whichever they wish. In many clusters the modules are created with a tool called EasyBuild, which delivered an out-of-the-box installation. This works for things like a single binary, but for conda this severely falls short as there are many many configuration changes needed.
Continue readingTag Archives: Python
Ligands of CASF-2016
CASF-2016 is a commonly used benchmark for docking tools. Unfortunately, some of the provided ligand files cannot be loaded using RDKit (version 2022.09.1) but there is an easy remedy.
Continue readingHow to easily use pharmacophoric atom features to turn ECFPs into FCFPs
Today’s post builds on my earlier blogpost on how to turn a SMILES string into an extended-connectivity fingerprint using RDKit and describes an interesting and easily implementable modification of the extended-connectivity fingerprint (ECFP) featurisation. This modification is based on representing the atoms in the input compound at a different (and potentially more useful) level of abstraction.
We remember that each binary component of an ECFP indicates the presence or absence of a particular circular subgraph in the input compound. Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types (single, double, triple, or aromatic). To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [1]; but there is also another less commonly used and often overlooked version of the ECFP that uses pharmacophoric atom features instead [2]. Pharmacophoric atom features attempt to describe atomic properties that are critical for biological activity or binding to a target protein. These features try to capture the potential for important chemical interactions such as hydrogen bonding or ionic bonding. ECFPs that use pharmacophoric atom features instead of standard atom features are called functional-connectivity fingerprints (FCFPs). The exact sets of standard- vs. pharmacophoric atom features for ECFPs vs. FCFPs are listed in the table below.
In RDKit, ECFPs can be changed to FCFPs extremely easily by changing a single input argument. Below you can find a Python/RDKit implementation of a function that turns a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False.
# import packages import numpy as np from rdkit.Chem import AllChem # define function that transforms a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False def FCFP_from_smiles(smiles, R = 2, L = 2**10, use_features = True, use_chirality = False): """ Inputs: - smiles ... SMILES string of input compound - R ... maximum radius of circular substructures - L ... fingerprint-length - use_features ... if true then use pharmacophoric atom features, if false then use standard DAYLIGHT atom features - use_chirality ... if true then append tetrahedral chirality flags to atom features Outputs: - np.array(feature_list) ... FCFP/ECFP with length L and maximum radius R """ molecule = AllChem.MolFromSmiles(smiles) feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule, radius = R, nBits = L, useFeatures = use_features, useChirality = use_chirality) return np.array(feature_list)
The use of pharmacophoric atom features makes FCFPs more specific to molecular interactions that drive biological activity. In certain molecular machine-learning applications, replacing ECFPs with FCFPs can therefore lead to increased performance and decreased learning time, as important high-level atomic properties are presented to the learning algorithm from the start and do not need to be inferred statistically. However, the standard atom features used in ECFPs contain more detailed low-level information that could potentially still be relevant for the prediction task at hand and thus be utilised by the learning algorithm. It is often unclear from the outset whether FCFPs will provide a substantial advantage over ECFPs in a given application; however, given how easy it is to switch between the two, it is almost always worth trying out both options.
[1] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.
[2] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.
Two useful modules to help you find the best ML model for your task
FLAML and LazyPredict are two packages designed to quickly train and test machine learning models from scikit-learn so that you can determine which is the best type of model for learning from your data.
Continue readingHow to turn a SMILES string into an extended-connectivity fingerprint using RDKit
After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).
ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.
Continue readingUsing Conda environments with Flask and Apache
With the advent of ABlooper, we’ve recently introduced OpenMM as a new dependency for the SAbDab-SAbPred antibody modelling platform. By far the easiest way to install the OpenMM Python API is via Conda, so we’ve moved to Conda environments for the entire platform. This has made installation of the platform much easier, but introduces complications when it comes to running its web applications under Apache. In this post, I’ll briefly explain the reason for this, and provide a basic guide for running Flask apps using Conda environments under Apache.
Continue readingExploring topological fingerprints in RDKit
Finding a way to express the similarity of irregular and discrete molecular graphs to enable quantitative algorithmic reasoning in chemical space is a fundamental problem in data-driven small molecule drug discovery.
Virtually all algorithms that are widely and successfully used in this setting boil down to extracting and comparing (multi-)sets of subgraphs, differing only in the space of substructures they consider and the extent to which they are able to adapt to specific downstream applications.
A large body of recent work has explored approaches centred around graph neural networks (GNNs), which can often maximise both of these considerations. However, the subgraph-derived embeddings learned by these algorithms may not always perform well beyond the specific datasets they are trained on and for many generic or resource-constrained applications more traditional “non-parametric” topological fingerprints may still be a viable and often preferable choice .
This blog post gives an overview of the topological fingerprint algorithms implemented in RDKit. In general, they count the occurrences of a certain family of subgraphs in a given molecule and then represent this set/multiset as a bit/count vector, which can be compared to other fingerprints with the Jaccard/Dice similarity metric or further processed by other algorithms.
Continue readingVisualise with Weight and Biases
Understanding what’s going on when you’ve started training your shiny new ML model is hard enough. Will it work? Have I got the right parameters? Is it the data? Probably. Any tool that can help with that process is a Godsend. Weights and biases is a great tool to help you visualise and track your model throughout your production cycle. In this blog post, I’m going to detail some basics on how you can initialise and use it to visualise your next project.
Installation
To use weights and biases (wandb), you need to make an account. For individuals it is free, however, for team-oriented features, you will have to pay. Wandb can then be installed using pip or conda.
$ conda install -c conda-forge wandb or $ pip install wandb
To initialise your project, import the package, sign in, and then use the following command using your chosen project name and username (if you want):
import wandb wandb.login() wandb.init(project='project1')
In addition to your project, you can also initialise a config dictionary with starting parameter values:
Continue readingMaking better plots with matplotlib.pyplot in Python3
The default plots made by Python’s matplotlib.pyplot module are almost always insufficient for publication. With a ~20 extra lines of code, however, you can generate high-quality plots suitable for inclusion in your next article.
Let’s start with code for a very default plot:
import matplotlib.pyplot as plt import numpy as np np.random.seed(1) d1 = np.random.normal(1.0, 0.1, 1000) d2 = np.random.normal(3.0, 0.1, 1000) xvals = np.arange(1, 1000+1, 1) plt.plot(xvals, d1, label='data1') plt.plot(xvals, d2, label='data2') plt.legend(loc='best') plt.xlabel('Time, ns') plt.ylabel('RMSD, Angstroms') plt.savefig('bad.png', dpi=300)
The result of this will be:
The fake data I generated for the plot look something like Root Mean Square Deviation (RMSD) versus time for a converged molecular dynamics simulation, so let’s pretend they are. There are a number of problems with this plot: it’s overall ugly, the color scheme is not very attractive and may not be color-blind friendly, the y-axis range of the data extends outside the range of the tick labels, etc.
We can easily convert this to a much better plot:
Continue readingHow to turn a SMILES string into a vector of molecular descriptors using RDKit
Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:
- molecular graph structure,
- lipophilicity (logP),
- molecular refractivity,
- electrotopological state,
- druglikeness,
- fragment profile,
- molecular charge,
- molecular surface,
- …
Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.
A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:
Continue reading