Now that machine learning has managed to get its proverbial fingers into just about every pie, people have started to worry about the generalisability of methods used. There are a few reasons for these concerns, but a prominent one is that the pressure and publication biases that have led to reproducibility issues in the past are also very present in ML-based science.
The Center for Statistics and Machine Learning at Princeton University hosted a workshop last July highlighting the scale of this problem. Alongside this, they released a running list of papers highlighting reproducibility issues in ML-based science. Included on this list are 20 papers highlighting errors from 17 distinct fields, collectively affecting a whopping 329 papers.
Unsurprisingly, many of the ML methods proposed in small molecule drug discovery also suffer from the impacts of data leakage, resulting in overoptimistic performance results being reported. Models can perform well on their respective test sets, but fail spectacularly when asked to make predictions on a novel target that is dissimilar to anything they might have seen before. Similarity clustering and leave-cluster-out cross-validation experiments have been used in the past to assess the generalisability of structure-based scoring functions. In this work, entries in the PDBbind dataset were classified into different clusters based on protein sequence similarities. The scoring function that was being assessed, RF-Score, saw a significant decrease in affinity prediction performance when tested in this way. Since then, similar generalisation assessments have been carried out on a number of other ML-based scoring functions.
Being vaguely aware of this approach and the problem it aims to solve (put it this way: a game of ‘drink whenever somebody mentions dataset leakage’ would quickly get you in big trouble in any of our group meetings), I was interested to read this paper recently published in JCIM. The authors report that sequence similarity-based clustering is actually quite unreliable for cross-validation, especially as far as structurally similar proteins with low protein sequence similarities are concerned. Instead, they suggest using the protein families database (Pfam) to perform the clustering. They benchmarked a total of 12 scoring functions with Pfam-based cross-validation as well as sequence-based cross-validation and a randomly clustered baseline, and found that the largest decrease in performance for most scoring functions from their original test was seen with the Pfam method. The paper concludes with a plea to researchers in the area to consider cross-validating their scoring functions with a non-sequence-based approach, but more importantly, to take care when developing their ML-based methods and be more conscious of the biases they may be learning.
Overall, as the researchers at Princeton pointed out, generalisability problems in machine learning should be the expected state of affairs until best practices become more well-established. Until then, keeping a keen eye out for performances that seem suspiciously optimistic and taking care when building our own models will have to do!