Effect of Debiasing Protein-Ligand binding data on Generalization

Virtual screening is a computational technique used in drug discovery to search libraries of small molecules in order to identify those structures that bind tightly and specifically to a given protein target. Many machine learning (ML) models have been proposed for virtual screening, however, it is not clear whether these models can truly predict the molecular properties accurately across chemical space or simply overfit the training data. As chemical space contains clusters of molecules around scaffolds, memorising the properties of a few scaffolds can be sufficient to perform well, masking the fact that the model may not generalise beyond close analogue. Different debiasing algorithms have been introduced to address this problem. These algorithms systematically partition the data to reduce bias and provide a more accurate metric of the model performance.

Recently, Sundar and Colwell [1] studied the effect of debaising protein-ligand binding data on model performance. Two types of bias, namely MUV (maximum unbiased validation) and AVE (asymmetric validation embedding) were considered in the work. Genetic algorithm was used to split the data such that the bias is minimised. They also developed a new measure, far-AUC, which is the AUC on the distant held-out test sets, to assess the generalisation performance. The distant held-out test set consists of ligands that are at least 0.4 (Jaccard distance of ECFP6 fingerprint with 2048 bits) from every ligand in the training set. They trained the models with debiased data and evaluated the model performance. They showed that debiasing did not improve the performance of the models on the distant held-out test set, and therefore not systematically improve the ability to generalise (see Fig 3 in [1]). They also compared the the standard AUC with debiasing (i.e. AUC on the random held-out validation set) and the far-AUC with debiasing. The standard AUC on the validation set were not representative of the far-AUC (see Fig 5 in [1]), suggesting that they did not correlate with real-world model performance or generalisation ability.

In summary, dataset bias is an important issue in computational chemistry. Current debiasing techniques do not necessarily improve the model performance on distant held out test sets, and the model performance measured after debiasing is not representative of the ability of the model to generalise. A better framework is required to detect and remove bias in chemical data.

[1]. Vikram Sundar, Lucy Colwell. The Effect of Debiasing Protein-Ligand Binding Data on Generalization JCIM (2020) 60, 56-62

Author