Both the beauty and the downfall of learning-based methods is that the data used for training will largely determine the quality of any model or system.
While there have been numerous algorithmic advances in recent years, the most successful applications of machine learning have been in areas where either (i) you can generate your own data in a fully understood environment (e.g. AlphaGo/AlphaZero), or (ii) data is so abundant that you’re essentially training on “everything” (e.g. GPT2/3, CNNs trained on ImageNet).
This covers only a narrow range of applications, with most data not falling into one of these two categories. Unfortunately, when this is true (and even sometimes when you are in one of those rare cases) your data is almost certainly biased – you just may or may not know it.
This can have drastic consequences for any model you train using such data. In the world of structure-based scoring functions, this has recently been reported in three separate publications (here, here, and here).
There are two clear strategies to overcoming such issues: (i) fix/remove such biases from the data, or (ii) develop algorithms that can learn despite the presence of such biases.
My interest in this topic continues to grow, and in OPIG we are actively working on both approaches. We are currently preparing a manuscript on work presented at ISMB 2020 that adopts strategy (i), while a recent publication from the group (link) is an example of strategy (ii) that employs data augmentation.
One of my favourite talks from the recent ISMB 2020 (virtual) conference was a presentation from Ayse Dincer of the University of Washington (bioRxiv link):
Adversarial Deconfounding Autoencoder for Learning Robust Gene Expression Embeddings
Ayse B. Dincer, Joseph D. Janizek, Su-In Lee
bioRxiv 2020.04.28.065052; doi: https://doi.org/10.1101/2020.04.28.065052
In their work, they presented an autoencoder that learnt representations from gene expression data that was designed not to capture “confounders” or biases in such representations (Fig. 1). These can range from technical artifacts (e.g. batch effects) to uninteresting biological variables (e.g. age) or just random noise.
This is achieved through the use of an auxiliary neural network that is trained to predict the value of the “confounding” variable from the latent representation of the network (Fig. 2). The autoencoder is trained to produce a latent representation that can be used to reconstruct the input expression data, but not allow the auxiliary network to predict the confounding variable.
This is an interesting approach with seemingly broad applicability, as long as the confounder or bias is known and quantifiable (either with a class label or specific value).
A similar approach is explored by Kim and colleagues in the realm of computer vision (link):
Learning Not to Learn: Training Deep Neural Networks with Biased Data
Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, Junmo Kim
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9012-9020
Their model is trained such that the features produced by their convolutional neural network (labelled f in Fig. 3) cannot be used to predict the known bias (network h in Fig 3.), but can be used by to label the image (network g).
Both approaches are promising and much needed advances. While there is clearly much more work to be done (for example, these methods require the bias to be known a priori, which often isn’t the case), we will be much better off by both acknowledging our data is biased and trying to do something about it!