The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.
In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set into two sets – a training set and a test set . The model is then subsequently trained on the examples in the training set and afterwards its prediction abilities are measured on the untouched examples in the test set via a suitable performance metric.
Since in this scenario the model has never seen any of the examples in during training, its performance on must be indicative of its performance on novel data which it will encounter in the future. Right?
No.
In practise, one can regularly observe a situation where a machine learning model which performs well on a randomly selected test set fails spectacularly when confronted with novel data which was collected at a later point in time, by a different lab, in a different environment, or in some other context that differs from the original context in which the initial data set was collected. The reason for this can be found in the distributional shift between and which frequently occurs when the data collection context (and thus the data generating process) is altered in some way.
If the data split for the initial data set into training set and test set is done uniformly at random (as is usual), then both and follow the same distribution. This random uniform data split is very much in accordance with the framework of classical statistical learning theory [1], where one assumes that a learning model is primarily built to deal with training- and test data examples that have all been sampled independently from the same underlying probability distribution.
Unfortunately, a random uniform data split is rarely a good simulation of practical reality where a newly collected data set which is fed into a machine learning model to obtain predictions almost never follows the data distribution of the data set on which the model was originally trained. This distributional shift between the initial training data set and the newly collected data set normally leads to a substantial drop in performance of the model on compared to its performance on a test set which follows the same distribution as . Thus, splitting the initial data set uniformly at random into a test set and a training set often leads to overoptimistic results when trying to estimate the predictive abilities of a machine learning model in a practical setting.
To get a more reliable picture of the real-world predictive capabilities of a trained machine learning model one must find a way to model a meaningful distributional shift and build it into the test set . Evaluating the model on can then provide a measure for the out-of-distribution generalisation abilities of the model.
Measuring out-of-distribution generalisation is of particular relevance in the field of molecular property prediction where distributional shifts tend to be large and difficult to handle for machine learning models. Different molecular data sets obtained by distinct pharmaceutical companies and research groups often contain compounds from vastly different areas of chemical space that exhibit high structural heterogeneity. An elegant solution for the modelling of such distributional shifts in chemical space is given by the idea of scaffold splitting.
The notion of a (two-dimensional) molecular scaffold is described in the article by Bemis and Murcko [2]. A molecular scaffold reduces the chemical structure of a compound to its core components, essentially by removing all side chains and only keeping ring systems and parts which link together ring systems. An additional option for making molecular scaffolds even more general is to “forget” the identities of the bonds and atoms by replacing all atoms with carbons and all bonds with single bonds.
Bemis-Murcko scaffolds can be automatically generated in RDKit via the following Python code:
# how to extract the Bemis-Murcko scaffold of a molecular compound via RDKit # import packages from rdkit import Chem from rdkit.Chem.Scaffolds import MurckoScaffold # define compound via its SMILES string smiles = "CN1CCCCC1CCN2C3=CC=CC=C3SC4=C2C=C(C=C4)SC" # convert SMILES string to RDKit mol object mol = Chem.MolFromSmiles(smiles) # create RDKit mol object corresponding to Bemis-Murcko scaffold of original compound mol_scaffold = MurckoScaffold.GetScaffoldForMol(mol) # make the scaffold generic by replacing all atoms with carbons and all bonds with single bonds mol_scaffold_generic = MurckoScaffold.MakeScaffoldGeneric(mol_scaffold) # convert the generic scaffold mol object back to a SMILES string format smiles_scaffold_generic = Chem.CanonSmiles(Chem.MolToSmiles(mol_scaffold_generic)) # display compound and its generic Bemis-Murcko scaffold display(mol) print(smiles) display(mol_scaffold_generic) print(smiles_scaffold_generic)
If we now have a molecular data set , we can map each compound in to its respective scaffold. Let us assume that a total number of pairwise distinct scaffolds appear in and that these scaffolds are numbered consecutively from to . We can then define an equivalence relation on by calling two compounds equivalent if they share the same scaffold. The associated equivalence classes consist of compound sets whereby a given set contains all compounds in which share the -th scaffold. It is not hard to see that the sets form a partition of the original data set . Without loss of generality, we assume that the equivalence classes are ordered by size in descending order, i.e. we assume that contains at least as many molecules as , and so on.
One appropriate way to now produce a scaffold split of the molecular data set into a training set and a test set for machine learning is to define as the union of the first (larger) sets and as the union of the last (smaller) sets . Here is a custom index parameter which can be used to control the respective sizes of and ; frequently is chosen such that contains approximately of the examples in .
While a scaffold split is certainly not perfect, it is already a lot better than a uniform random split at providing a relevant measure of the practical utility of a molecular property prediction model. It mimics a situation where the training set was sampled from a structurally different area of chemical space than the test set . This creates a distributional shift between and which is comparable to the distributional shifts which are commonly observed in real chemical data sets. Evaluating a molecular machine learning model using a scaffold split rather than a uniform random split thus leads to significantly more robust results.
References:
[1] Poggio, Tomaso, and Christian R. Shelton. “On the mathematical foundations of learning.” American Mathematical Society 39.1 (2002): 1-49.
[2] Bemis, Guy W., and Mark A. Murcko. “The properties of known drugs. 1. Molecular frameworks.” Journal of medicinal chemistry 39.15 (1996): 2887-2893.