Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data

I’m delighted to report our collaboration (Ísak Valsson, Matthew Warren, Aniket Magarkar, Phil Biggin, & Charlotte Deane), on “Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data”, has been published in Nature’s Communications Chemistry (https://doi.org/10.1038/s42004-025-01428-y).


During his MSc dissertation project in the Department of Statistics, University of Oxford, OPIG member Ísak Valsson developed an attention-based GNN to predict protein-ligand binding affinity called “AEV-PLIG”. It featurizes a ligand’s atoms using Atomic Environment Vectors to describe the Protein-Ligand Interactions found in a 3D protein-ligand complex. AEV-PLIG is free and open source (BSD 3-Clause), available from GitHub at https://github.com/oxpig/AEV-PLIG, and forked at https://github.com/bigginlab/AEV-PLIG.

Ísak also developed a much more challenging protein-ligand binding affinity prediction benchmark than CASF-2016, called “Out-Of-Distribution Test”, which is also available (https://github.com/isakvals/OOD-Test). It is designed to assess how well a method generalizes to more dissimilar ligands and proteins than seen its training set. AEV-PLIG performed best in terms of Pearson correlation coefficient on OOD Test, and another tough benchmark also developed in OPIG called “0-Ligand Bias” (https://doi.org/10.1093/bioinformatics/btaf040). AEV-PLIG proved to be more accurate than RF-score, Pafnucy, OnionNet-2, PointVS, SIGN, and AEScore (Table 1).

Matthew Warren showed that augmenting our training data (PDBbind v2020) with semi-synthetic data (BindingNet) boosted the performance of AEScore (https://github.com/RMeli/aescore) and we confirmed this with AEV-PLIG.

Together, we found that with even more data (BindingDB), our augmented AEV-PLIG model’s prediction accuracy starts to approach that of Free Energy Perturbation (FEP+) for congeneric series of ligands that bind the same protein — see Figures 3 & 4 in our paper — yet we are ~400,000 times faster, while using a single GPU instead of several. We also showed AEV-PLIG performed best on our FEP Benchmark than the other ML-based scoring functions we looked at (Table 1).

Another take home: the performance of AEV-PLIG steadily improved as we increased the fraction of augmented training data with no sign of leveling off (Figure S5). Whether this putative ‘scaling law’ applies to protein-ligand binding affinity prediction remains to be seen, but the notion of a simpler model architecture with more physically-relevant features combined with more data shows great promise…

Author