Tag Archives: Cheminformatics

Visualising and validating differences between machine learning models on small benchmark datasets

Introduction
Author

Sam Money-Kyrle

Introduction

An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is ‘better’ than another (something Pat Walters has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.

The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (TDC). These leaderboard tables do not show:

  1. whether differences in metrics between methods are statistically significant,
  2. whether methods use ensembles or single models,
  3. whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,
  4. whether methods are pre-trained or not,
  5. whether pre-trained models are supervised, self-supervised, or both,
  6. the data and tasks that pre-trained models are pre-trained on.

This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.

Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by Ash et al. (2024) sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.

Continue reading

How to turn a SMILES string into a vector of molecular descriptors using RDKit

Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:

  • molecular graph structure,
  • lipophilicity (logP),
  • molecular refractivity,
  • electrotopological state,
  • druglikeness,
  • fragment profile,
  • molecular charge,
  • molecular surface,

Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.

A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:

Continue reading

Out-of-distribution generalisation and scaffold splitting in molecular property prediction

The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.

In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set X into two sets – a training set X_{\text{train}} and a test set X_{\text{test}}. The model is then subsequently trained on the examples in the training set X_{\text{train}} and afterwards its prediction abilities are measured on the untouched examples in the test set X_{\text{test}} via a suitable performance metric.

Since in this scenario the model has never seen any of the examples in X_{\text{test}} during training, its performance on X_{\text{test}} must be indicative of its performance on novel data X_{\text{new}} which it will encounter in the future. Right?

Continue reading

Trying out some code from the Eighth Joint Sheffield Conference on Chemoinformatics: finding the most common functional groups present in the DSPL library

Last month a bunch of us attended the Sheffield Chemoinformatics Conference. We heard many great presentations and there were many invitations to check out one’s GitHub page. I decided now is the perfect time to try out some code that was shown by one presenter.

Peter Ertl from Novartis presented his work on the The encyclopedia of functional groups. He presented a method that automatically detects functional groups, without the use of a pre-defined list (which is what most other methods use for detecting functional groups). His method involves recursive searching through the molecule to identify groups of atoms that meet certain criteria. He used his method to answer questions such as: how many functional groups are there and what are the most common functional groups found in common synthetic molecules versus bioactive molecules versus natural products. Since I, like many others in the group, are interested in fragment libraries (possibly due to a supervisor in common), I thought I could try it out on one of these.

Continue reading

Advances in Conformer Generation: ETKDG and ETDG

Predicting the possible shapes a small molecule can adopt is essential to understanding its chemistry and the possible biological roles of candidate drugs. Thus, conformer generation, the process of converting a topological description of a molecule into a set of 3D positions of its constituent atoms, is an essential component of computational drug discovery.

A former member of OPIG, Dr Jean-Paul Ebejer, myself, and Prof. Charlotte Deane, compared the ability of freely-available conformer generation methods:

“(i) to identify which tools most accurately reproduce experimentally determined structures;
(ii) to examine the diversity of the generated conformational set; and
(iii) to benchmark the computational time expended.”

in Ebejer et al., 2012.  JP assembled a set of 708 crystal structures of drug-like molecules from the OMEGA validation set and the Astex Diverse Set with which to test the various methods. We found that RDKit, combining its Distance Geometry (DG) algorithm with energy minimization  using the MMFF94 force field proved to be a “valid free alternative to commercial, proprietary software”, and was able to generate “a diverse and representative set of conformers which also contains a close conformer to the known structure”.

Following on from our work at InhibOx, and building on the same benchmark set JP assembled, Greg Landrum and Sereina Riniker recently described (Riniker & Landrum, 2015) two new conformer generation methods, ETDG and ETKDG, that improve upon the classical distance geometry (DG) algorithm. They do this by combining DG with knowledge of preferred torsional angles derived from experimentally determined crystal structures (ETDG), and also by further adding constraints from chemical knowledge, such as ‘aromatic rings are be flat’, or ‘bonds connected to triple bonds are colinear’ (ETKDG). They compared DG, ETDG, ETKDG, and a knowledge-based method, CONFECT, and found:

“ETKDG was found to outperform standard DG and the knowledge-based conformer generator CONFECT in reproducing crystal conformations from both small-molecule crystals (CSD data set) and protein−ligand complexes (PDB data set). With ETKDG, 84% of a set of 1290 small-molecule crystal structures from the CSD could be reproduced within an RMSD of 1.0 Å and 38% within an RMSD of 0.5 Å. The experimental torsional-angle preferences or the K terms alone each performed better than standard DG but were not sufficient to obtain the full performance of ETKDG.

Comparison of ETKDG with the DG conformers optimized using either the Universal Force Field (UFF) or the Merck Molecular Force Field (MMFF) showed different results for the two data sets. While FF-optimized DG performed better on the CSD data set, the two approaches were comparable for the PDB data set.”

They also showed (Fig. 13) that their ETKDG method was faster than DG followed by energy minimization, but not quite as accurate in reproducing the crystal structure.

ETKDG takes 3 times as long as DG. The addition of the K terms, i.e., generating ETKDG embeddings instead of ETDG embeddings, increases runtime by only 10% over ETDG (results not shown). Despite the longer runtime per conformer, ETKDG requires on average one-quarter of the number of conformers to achieve performance similar to DG (Figure S12 and Table S4 in the Supporting Information). This results in a net performance improvement, at least when it comes to reproducing crystal conformers.

As measured by performance in reproducing experimental crystal structures, ETKDG is a viable alternative to plain DG followed by a UFF-optimization, so it is of interest how their runtimes compare. Figure 13 (right) plots the runtime for ETKDG versus the runtime for DG + UFF-optimization. The median ratio of the DG + UFF optimization and ETKDG runtimes is 1.97, i.e., DG + UFF optimization takes almost twice as long as ETKDG. Thus, although ETKDG is significantly slower than DG on a per-conformer basis, when higher-quality conformations are required it can provide structures that are the equivalent of those obtained using DG + UFF-optimization in about half the time.

ETKDG looks like a great addition to the RDKit toolbox for conformer generation (and it was great to see JP thanked in the Acknowledgments!).

 

References

Ebejer, J. P., G. M. Morris and C. M. Deane (2012). “Freely available conformer generation methods: how good are they?” J Chem Inf Model, 52(5): 1146-1158. 10.1021/ci2004658.

Riniker, S. and G. A. Landrum (2015). “Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.” J Chem Inf Model, 55(12): 2562-2574. 10.1021/acs.jcim.5b00654.