Have you ever had an annoying dataset that looks something like this?
or even worse, just several of them
In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this
ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.
Graph neural networks (GNNs) have quickly become one of the most important tools in computational chemistry and molecular machine learning. GNNs are a type of deep learning architecture designed for the adaptive extraction of vectorial features directly from graph-shaped input data, such as low-level molecular graphs. The feature-extraction mechanism of most modern GNNs can be decomposed into two phases:
Message-passing: In this phase the node feature vectors of the graph are iteratively updated following a trainable local neighbourhood-aggregation scheme often referred to as message-passing. Each iteration delivers a set of updated node feature vectors which is then imagined to form a new “layer” on top of all the previous sets of node feature vectors.
Global graph pooling: After a sufficient number of layers has been computed, the updated node feature vectors are used to generate a single vectorial representation of the entire graph. This step is known as global graph readout or global graph pooling. Usually only the top layer (i.e. the final set of updated node feature vectors) is used for global graph pooling, but variations of this are possible that involve all computed graph layers and even the set of initial node feature vectors. Commonly employed global graph pooling strategies include taking the sum or the average of the node features in the top graph layer.
While a lot of research attention has been focused on designing novel and more powerful message-passing schemes for GNNs, the global graph pooling step has often been treated with relative neglect. As mentioned in my previous post on the issues of GNNs, I believe this to be problematic. Naive global pooling methods (such as simply summing up all final node feature vectors) can potentially form dangerous information bottlenecks within the neural graph learning pipeline. In the worst case, such information bottlenecks pose the risk of largely cancelling out the information signal delivered by the message-passing step, no matter how sophisticated the message-passing scheme.
The lineup for the Royal Society of Chemistry’s 5th “Artificial Intelligence in Chemistry” Symposium (Thursday-Friday, 1st-2nd September 2022) is now complete for both oral and poster presentations. It really is a fantastic selection of topics and speakers and it is clear this event is now a highlight of the scientific calendar. Our very own Prof. Charlotte M. Deane, MBE will be giving a keynote.
Finding a way to express the similarity of irregular and discrete molecular graphs to enable quantitative algorithmic reasoning in chemical space is a fundamental problem in data-driven small molecule drug discovery.
Virtually all algorithms that are widely and successfully used in this setting boil down to extracting and comparing (multi-)sets of subgraphs, differing only in the space of substructures they consider and the extent to which they are able to adapt to specific downstream applications.
A large body of recent work has explored approaches centred around graph neural networks (GNNs), which can often maximise both of these considerations. However, the subgraph-derived embeddings learned by these algorithms may not always perform well beyond the specific datasets they are trained on and for many generic or resource-constrained applications more traditional “non-parametric” topological fingerprints may still be a viable and often preferable choice .
This blog post gives an overview of the topological fingerprint algorithms implemented in RDKit. In general, they count the occurrences of a certain family of subgraphs in a given molecule and then represent this set/multiset as a bit/count vector, which can be compared to other fingerprints with the Jaccard/Dice similarity metric or further processed by other algorithms.
So you’ve submitted your paper, made your code publicly available, and maybe even provided documentation to ensure somebody can reproduce your work. But what about the data your work is based on? Is that readily available to your readers, too?
Maybe it’s too large to put on GitHub alongside your code. Maybe it’s sensitive, or subject to GDPR restrictions, so you can’t just stick a download link on your website. Maybe it’s in a proprietary format that needs non-open software to read. There are many reasons sharing data can be less straightforward than sharing code, and often it’s not entirely clear what ‘best practices’ are for a given situation. Data management is a complicated topic, and to do it justice would require far more than a quick blog post. Instead, I’d like to focus on a single source of guidance that serves as a useful starting point for thinking about responsible data management: the FAIR principles.
The default plots made by Python’s matplotlib.pyplot module are almost always insufficient for publication. With a ~20 extra lines of code, however, you can generate high-quality plots suitable for inclusion in your next article.
The fake data I generated for the plot look something like Root Mean Square Deviation (RMSD) versus time for a converged molecular dynamics simulation, so let’s pretend they are. There are a number of problems with this plot: it’s overall ugly, the color scheme is not very attractive and may not be color-blind friendly, the y-axis range of the data extends outside the range of the tick labels, etc.
Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:
molecular graph structure,
lipophilicity (logP),
molecular refractivity,
electrotopological state,
druglikeness,
fragment profile,
molecular charge,
molecular surface,
…
Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.
A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:
Last year, the Structural Antibody Database (SAbDab) listed a record-breaking 894 new antibody structures, driven in no small part by the continued efforts of the researchers to understand SARS-CoV-2.
Fig. 1: The aggregate growth in antibody structure data (all methods) over time. Taken from http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/stats/ on 25th May 2022.
In this blog post I wanted to highlight the major driving force behind this curve – the huge increase in cryo electron microscopy (cryoEM) data – and the implications of this for the field of structure-based antibody informatics.
I would add: “Chemistry is nuanced”… Just as there are many different ways of drawing the same molecule, SMILES is flexible enough to allow us to write the same molecule in different ways. While canonical SMILES can resolve this problem, we sometimes have different problem. In some situations, e.g., in machine learning, we need to map all these variants back to the same molecule. We also need to make sure we clean up our input molecules and eliminate invalid or incomplete structures.
Different Versions of the Same Molecule: Salt, Neutral or Charged?
Sometimes, a chemical supplier or compound vendor provides a salt of the compound, e.g., sodium acetate, but all we care about is the organic anion, i.e., the acetate. Very often, our models are built on the assumption we have only one molecule as input—but a salt will appear as two molecules (the sodium ion and the acetate ion). We might also have been given just the negatively-charged acetate instead of the neutral acetic acid.
Tautomers
Another important chemical phenomenon exists where apparently different molecules with identical heavy atoms and a nearby hydrogen can be easily interconverted: tautomers. By moving just one hydrogen atom and exchanging adjacent bond orders, the molecule can convert from one form to another. Usually, one tautomeric form is most stable. Warfarin, a blood-thinning drug, can exist in solution in 40 distinct tautomeric forms. A famous example is keto-enol tautomerism: for example, ethenol (not ethanol) can interconvert with the ketone form. When one form is more stable than the other form(s), we need to make sure we convert the less stable form(s) into the most stable form. Ethenol, a.k.a. vinyl alcohol, (SMILES: ‘C=CO[H]’), will be more stable when it is in the ketone form (SMILES: ‘CC(=O)([H])’):
from IPython.display import SVG # to use Scalar Vector Graphics (SVG) not bitmaps, for cleaner lines
import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw # to draw molecules
from rdkit.Chem.Draw import IPythonConsole # to draw inline in iPython
from rdkit.Chem import rdDepictor # to generate 2D depictions of molecules
from rdkit.Chem.Draw import rdMolDraw2D # to draw 2D molecules using vectors
AllChem.ReactionFromSmarts('[C:1]-[C:2](-[O:3]-[H:4])>>[C:1]-[C:2](=[O:3])(-[H:4])')