Category Archives: Small Molecules

Atom mapping with RXNMapper

When recently looking at some reaction data, I was confronted with the problem of atom-to-atom mapping (AAM) and what tools are available to tackle it. AAM refers to the process of mapping individual atoms in reactants to their corresponding atoms in the products, which is important for defining a reaction template and identifying which bonds are being formed and broken. This has many downstream uses for computational chemists, such as for reaction searching and forward and retrosynthesis planning1. The problem is that many reaction databases do not contain these mappings, and annotation by expert chemists is impractical for databases containing thousands (or more) data points.

Continue reading

How to easily use pharmacophoric atom features to turn ECFPs into FCFPs

Today’s post builds on my earlier blogpost on how to turn a SMILES string into an extended-connectivity fingerprint using RDKit and describes an interesting and easily implementable modification of the extended-connectivity fingerprint (ECFP) featurisation. This modification is based on representing the atoms in the input compound at a different (and potentially more useful) level of abstraction.

We remember that each binary component of an ECFP indicates the presence or absence of a particular circular subgraph in the input compound. Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types (single, double, triple, or aromatic). To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [1]; but there is also another less commonly used and often overlooked version of the ECFP that uses pharmacophoric atom features instead [2]. Pharmacophoric atom features attempt to describe atomic properties that are critical for biological activity or binding to a target protein. These features try to capture the potential for important chemical interactions such as hydrogen bonding or ionic bonding. ECFPs that use pharmacophoric atom features instead of standard atom features are called functional-connectivity fingerprints (FCFPs). The exact sets of standard- vs. pharmacophoric atom features for ECFPs vs. FCFPs are listed in the table below.

In RDKit, ECFPs can be changed to FCFPs extremely easily by changing a single input argument. Below you can find a Python/RDKit implementation of a function that turns a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False.

# import packages
import numpy as np
from rdkit.Chem import AllChem

# define function that transforms a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False
def FCFP_from_smiles(smiles,
                     R = 2,
                     L = 2**10,
                     use_features = True,
                     use_chirality = False):
    """
    Inputs:
    
    - smiles ... SMILES string of input compound
    - R ... maximum radius of circular substructures
    - L ... fingerprint-length
    - use_features ... if true then use pharmacophoric atom features, if false then use standard DAYLIGHT atom features
    - use_chirality ... if true then append tetrahedral chirality flags to atom features
    
    Outputs:
    - np.array(feature_list) ... FCFP/ECFP with length L and maximum radius R
    """
    
    molecule = AllChem.MolFromSmiles(smiles)
    feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule,
                                                         radius = R,
                                                         nBits = L,
                                                         useFeatures = use_features,
                                                         useChirality = use_chirality)
    return np.array(feature_list)

The use of pharmacophoric atom features makes FCFPs more specific to molecular interactions that drive biological activity. In certain molecular machine-learning applications, replacing ECFPs with FCFPs can therefore lead to increased performance and decreased learning time, as important high-level atomic properties are presented to the learning algorithm from the start and do not need to be inferred statistically. However, the standard atom features used in ECFPs contain more detailed low-level information that could potentially still be relevant for the prediction task at hand and thus be utilised by the learning algorithm. It is often unclear from the outset whether FCFPs will provide a substantial advantage over ECFPs in a given application; however, given how easy it is to switch between the two, it is almost always worth trying out both options.

[1] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.

[2] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.

Some ponderings on generalisability

Now that machine learning has managed to get its proverbial fingers into just about every pie, people have started to worry about the generalisability of methods used. There are a few reasons for these concerns, but a prominent one is that the pressure and publication biases that have led to reproducibility issues in the past are also very present in ML-based science.

The Center for Statistics and Machine Learning at Princeton University hosted a workshop last July highlighting the scale of this problem. Alongside this, they released a running list of papers highlighting reproducibility issues in ML-based science. Included on this list are 20 papers highlighting errors from 17 distinct fields, collectively affecting a whopping 329 papers.

Continue reading

Happy 10th Birthday, Blopig!

OPIG recently celebrated its 20th year; and on 10 January 2023 I gave a talk just a day before the 10th anniversary of BLOPIG’s first blog post. It’s worth reflecting on what’s stayed the same and what’s changed since then.

Continue reading

Bad chemistry in old protein-ligand binding complex data set

The Astex Diverse set [1] is a dataset containing the crystallized poses of 85 protein-ligand complexes. It was introduced in 2007 to address problems in previous datasets such as incorrect ligand representation.

Loading the 85 ligand files with today’s version of the cheminformatics toolkit RDKit [2] is, however, not as straightforward as you might expect.

Continue reading

A ChatGPT rap battle

The AI chatbot revolution is here. Last week, OpenAI released ChatGPT, a freely accessible language model fine-tuned for human conversations. The new model is based on InstructGPT, trained especially for following user instructions and with human feedback in the training loop. 

ChatGPT remembers the previous discussion, admits its mistakes and can even ask for clarification on ambiguous questions. It is also trained to refuse answering questions it deems inappropriate or goes against OpenAI’s AI alignment policy.

In the meanwhile, the internet is having immense fun circumventing its safety filters by asking it to only “PRETEND to be evil”, making it take SAT tests, and even simulating an entire virtual computer within its neural weights. Some are even using it to replace Google searches, and it excels at writing bioinformatics code across most programming languages.

Continue reading

How to turn a SMILES string into an extended-connectivity fingerprint using RDKit

After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).

ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.

Continue reading

Code your own molecule sketcher in 4 easy steps!

Drawing molecules on your laptop usually requires access to proprietary software such as ChemDraw (link) or free websites such as PubChem’s online sketcher (link). However, if you are feeling adventurous, you can build your personal sketcher in React/Typescript using the Ketcher package!

Ketcher is an open-source package that allows easy implementation of a molecule sketcher into a web application. Unfortunately, it does require TypeScript so the script to run it cannot be imported directly into an HTML page. Therefore we will set up a simple React app to get it working.

The sketcher is very sleek and has a vast array of functionality, such as choosing any atom from the periodic table and being able to directly import molecules from either SMILES or Mol2/SDF file format into the sketcher. These molecules can then be edited and saved to a new file in the chemical file format of your choosing.

Continue reading

Graphormer: Merging GNNs and Transformers for Cheminformatics

This is my first OPIG blog! I’m going to start with a summary of the Graphormer, a Graph Neural Network (GNN) that borrows concepts from Transformers to boost performance on graph tasks. This post is largely based on the NeurIPS paper Do Transformers Really Perform Bad for Graph Representation? by Ying et. al., which introduces the Graphormer, and which we read for our last deep learning journal club. The project has now been integrated as a Microsoft Research project.

I’ll start with a cheap and cheerful summary of Transformers and GNNs before diving into the changes in the Graphormer. Enjoy!

Continue reading

Universal graph pooling for GNNs

Graph neural networks (GNNs) have quickly become one of the most important tools in computational chemistry and molecular machine learning. GNNs are a type of deep learning architecture designed for the adaptive extraction of vectorial features directly from graph-shaped input data, such as low-level molecular graphs. The feature-extraction mechanism of most modern GNNs can be decomposed into two phases:

  • Message-passing: In this phase the node feature vectors of the graph are iteratively updated following a trainable local neighbourhood-aggregation scheme often referred to as message-passing. Each iteration delivers a set of updated node feature vectors which is then imagined to form a new “layer” on top of all the previous sets of node feature vectors.
  • Global graph pooling: After a sufficient number of layers has been computed, the updated node feature vectors are used to generate a single vectorial representation of the entire graph. This step is known as global graph readout or global graph pooling. Usually only the top layer (i.e. the final set of updated node feature vectors) is used for global graph pooling, but variations of this are possible that involve all computed graph layers and even the set of initial node feature vectors. Commonly employed global graph pooling strategies include taking the sum or the average of the node features in the top graph layer.

While a lot of research attention has been focused on designing novel and more powerful message-passing schemes for GNNs, the global graph pooling step has often been treated with relative neglect. As mentioned in my previous post on the issues of GNNs, I believe this to be problematic. Naive global pooling methods (such as simply summing up all final node feature vectors) can potentially form dangerous information bottlenecks within the neural graph learning pipeline. In the worst case, such information bottlenecks pose the risk of largely cancelling out the information signal delivered by the message-passing step, no matter how sophisticated the message-passing scheme.

Continue reading