The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.
In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set into two sets – a training set
and a test set
. The model is then subsequently trained on the examples in the training set
and afterwards its prediction abilities are measured on the untouched examples in the test set
via a suitable performance metric.
Since in this scenario the model has never seen any of the examples in during training, its performance on
must be indicative of its performance on novel data
which it will encounter in the future. Right?
No.
In practise, one can regularly observe a situation where a machine learning model which performs well on a randomly selected test set fails spectacularly when confronted with novel data
which was collected at a later point in time, by a different lab, in a different environment, or in some other context that differs from the original context in which the initial data set
was collected. The reason for this can be found in the distributional shift between
and
which frequently occurs when the data collection context (and thus the data generating process) is altered in some way.
If the data split for the initial data set into training set
and test set
is done uniformly at random (as is usual), then both
and
follow the same distribution. This random uniform data split is very much in accordance with the framework of classical statistical learning theory [1], where one assumes that a learning model is primarily built to deal with training- and test data examples that have all been sampled independently from the same underlying probability distribution.
Unfortunately, a random uniform data split is rarely a good simulation of practical reality where a newly collected data set which is fed into a machine learning model to obtain predictions almost never follows the data distribution of the data set
on which the model was originally trained. This distributional shift between the initial training data set
and the newly collected data set
normally leads to a substantial drop in performance of the model on
compared to its performance on a test set
which follows the same distribution as
. Thus, splitting the initial data set
uniformly at random into a test set
and a training set
often leads to overoptimistic results when trying to estimate the predictive abilities of a machine learning model in a practical setting.
To get a more reliable picture of the real-world predictive capabilities of a trained machine learning model one must find a way to model a meaningful distributional shift and build it into the test set . Evaluating the model on
can then provide a measure for the out-of-distribution generalisation abilities of the model.
Measuring out-of-distribution generalisation is of particular relevance in the field of molecular property prediction where distributional shifts tend to be large and difficult to handle for machine learning models. Different molecular data sets obtained by distinct pharmaceutical companies and research groups often contain compounds from vastly different areas of chemical space that exhibit high structural heterogeneity. An elegant solution for the modelling of such distributional shifts in chemical space is given by the idea of scaffold splitting.
The notion of a (two-dimensional) molecular scaffold is described in the article by Bemis and Murcko [2]. A molecular scaffold reduces the chemical structure of a compound to its core components, essentially by removing all side chains and only keeping ring systems and parts which link together ring systems. An additional option for making molecular scaffolds even more general is to “forget” the identities of the bonds and atoms by replacing all atoms with carbons and all bonds with single bonds.
Bemis-Murcko scaffolds can be automatically generated in RDKit via the following Python code:
# how to extract the Bemis-Murcko scaffold of a molecular compound via RDKit # import packages from rdkit import Chem from rdkit.Chem.Scaffolds import MurckoScaffold # define compound via its SMILES string smiles = "CN1CCCCC1CCN2C3=CC=CC=C3SC4=C2C=C(C=C4)SC" # convert SMILES string to RDKit mol object mol = Chem.MolFromSmiles(smiles) # create RDKit mol object corresponding to Bemis-Murcko scaffold of original compound mol_scaffold = MurckoScaffold.GetScaffoldForMol(mol) # make the scaffold generic by replacing all atoms with carbons and all bonds with single bonds mol_scaffold_generic = MurckoScaffold.MakeScaffoldGeneric(mol_scaffold) # convert the generic scaffold mol object back to a SMILES string format smiles_scaffold_generic = Chem.CanonSmiles(Chem.MolToSmiles(mol_scaffold_generic)) # display compound and its generic Bemis-Murcko scaffold display(mol) print(smiles) display(mol_scaffold_generic) print(smiles_scaffold_generic)

If we now have a molecular data set , we can map each compound in
to its respective scaffold. Let us assume that a total number of
pairwise distinct scaffolds appear in
and that these scaffolds are numbered consecutively from
to
. We can then define an equivalence relation on
by calling two compounds equivalent if they share the same scaffold. The associated equivalence classes consist of compound sets
whereby a given set
contains all compounds in
which share the
-th scaffold. It is not hard to see that the sets
form a partition of the original data set
. Without loss of generality, we assume that the equivalence classes
are ordered by size in descending order, i.e. we assume that
contains at least as many molecules as
, and so on.
One appropriate way to now produce a scaffold split of the molecular data set into a training set
and a test set
for machine learning is to define
as the union of the first (larger) sets
and
as the union of the last (smaller) sets
. Here
is a custom index parameter which can be used to control the respective sizes of
and
; frequently
is chosen such that
contains approximately
of the examples in
.
While a scaffold split is certainly not perfect, it is already a lot better than a uniform random split at providing a relevant measure of the practical utility of a molecular property prediction model. It mimics a situation where the training set was sampled from a structurally different area of chemical space than the test set
. This creates a distributional shift between
and
which is comparable to the distributional shifts which are commonly observed in real chemical data sets. Evaluating a molecular machine learning model using a scaffold split rather than a uniform random split thus leads to significantly more robust results.
References:
[1] Poggio, Tomaso, and Christian R. Shelton. “On the mathematical foundations of learning.” American Mathematical Society 39.1 (2002): 1-49.
[2] Bemis, Guy W., and Mark A. Murcko. “The properties of known drugs. 1. Molecular frameworks.” Journal of medicinal chemistry 39.15 (1996): 2887-2893.