Machine learning (ML) has significantly advanced key computational tasks in drug discovery, including virtual screening, binding affinity prediction, protein-ligand structure prediction (co-folding), and docking. However, the extent to which these models generalise beyond their training data is often overestimated due to shortcomings in benchmarking datasets. Existing benchmarks frequently fail to account for similarities between the training and test sets, leading to inflated performance estimates. This issue is particularly pronounced in tasks where models tend to memorise training examples rather than learning generalisable biophysical principles. The figure below demonstrates two examples of model performance decreasing with increased dissimilarity between training and test data, for co-folding (left) and binding affinity prediction (right).
