Machine learning (ML) has significantly advanced key computational tasks in drug discovery, including virtual screening, binding affinity prediction, protein-ligand structure prediction (co-folding), and docking. However, the extent to which these models generalise beyond their training data is often overestimated due to shortcomings in benchmarking datasets. Existing benchmarks frequently fail to account for similarities between the training and test sets, leading to inflated performance estimates. This issue is particularly pronounced in tasks where models tend to memorise training examples rather than learning generalisable biophysical principles. The figure below demonstrates two examples of model performance decreasing with increased dissimilarity between training and test data, for co-folding (left) and binding affinity prediction (right).
![](https://i0.wp.com/i.postimg.cc/cH26XWyS/Screenshot-2025-02-11-at-16-32-43.png?w=625&ssl=1)
To address this, more rigorous methods for evaluating ML models are required. One solution is to use a time-based splitting procedure to separate training and test cases based on their date of publication. However, given that new drugs are often developed for existing targets and/or using existing scaffolds, this approach does not rule out overlap between training and test sets. Examples of time-based splits for structure-based data include the PoseBusters set and the new Runs N’ Poses. Time-based splitting provides a practical approach because it does not require retraining of existing models and is particularly useful when the test set is stratified by training data similarity, such as in the Runs N’ Poses paper.
Alternatively, when retraining is viable, constructing an out-of-distribution (OOD) test set, similar to the OOD Test introduced for binding affinity prediction, offers a more robust means of assessing generalisability. In this post, we discuss how to construct such maximally separated test sets.
Constructing a Maximally Separated Benchmark Set
To construct a robust OOD test set, we seek to minimise the maximum similarity between training and test sets across multiple dimensions:
- Ligand Similarity: e.g., Morgan fingerprint Tanimoto similarity or SuCOS score.
- Protein Similarity: e.g., sequence identity
- Binding Pocket Similarity: e.g., SuCOS-pocket scores
![](https://i0.wp.com/i.postimg.cc/Fzx51bxB/Screenshot-2025-02-11-at-16-33-04.png?w=625&ssl=1)
Once you have performed the single-linkage clustering, you can randomly select clusters into your test set and leave the rest for training. This clustering is easy with AgglomerativeClustering from sklearn.