Cheminformatics | Oxford Protein Informatics Group

In supervised learning, we assume that the training data and testing are drawn from the same distribution, i.e $P_{train}(x,y) = P_{test}(x,y)$ . However this assumption is often violated in virtual screening. For example, a chemist initially focuses on a series of compounds and the information from this series is used to train a model. For some reasons, the chemist changes their focus on a new, structurally distinct series later on and we would not expect the model to accurately predict the labels in the testing sets. Here, we introduce some methods to address this problem.

Methods such as Kernel Mean Matching (KMM) or Kullback-Leibler Importance Estimation Procedure (KLIEP) have been proposed. These methods typically assume the concept remain unchanged and only the distribution changes, i.e. $P_{train}(y|x) =P_{test}(y|x)$ and $P_{train}(x) \neq P_{test}(x)$ . In general, these methods reweight instances in the training data so that the distribution of training instances is more closely aligned with the distribution of instances in the testing set. The appropriate importance weighting factor $w(x)$ for each instance x in the training set is:

$w(x) = \frac{p_{test}(x)}{p_{train}(x)}$

where $p_{train}(x)$ is the training set density and $p_{test} (x)$ is the testing set density. Note that only the feature vector values (not their labels) are used in reweighting. The major difference between KMM and KLIEP is the objective function: KLIEP is based on the minimisation of the Kullback-Leibler divergence while KMM is based on the minimisation of Maximum Mean Discrepancy (MMD). For more detail, please see reference.

Reference:

Masashi Sugiyama ,Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, Motoaki Kawanabe.: Direct importance estimation for Covariate Shift Adaptation. Ann Inst Stat Math. 2008
Jiayuan Huang, Alex Smola, Arthur Gretton, Karsten Borgwardt, Bernhard Scholkopf.:Correcting Sample Selection Bias by Unlabeled Data. NIPS 06.
Mcgaughey, Georgia ; Walters, W Patrick ; Goldman, Brian.: Understanding covariate shift in model performance. F1000Research, 2016,

In early July I attended the the Seventh Joint Sheffield Conference on Cheminformatics. There was a variety of talks with speakers at all stages of their career. I was lucky enough to be invited to speak at the conference, and gave my first conference talk! I have written two blog posts about the conference: part 1 briefly describes a talk that I found interesting and part 2 describes the work I spoke about at the conference.

One of the most interesting parts of the conference was the active twitter presence. #ShefChem16. All of the talks were live tweeted which provided a summary of each talk and also included links to software or references. It also allowed speakers to gain insight and feedback on their talk instantly.

One of the talks I found most interesting presented the Protein-Ligand Interaction Profiler (PLIP). It is a method for the detection of protein-ligand interactions. PLIP is open-source and has a web-based online tool and a command-line tool. Unlike PyMol which only calculates polar contacts, and not the type of interaction, PLIP calculates 8 different types of interactions: hydrogen bonding, hydrophobic, $π- π stacking, π-cation interactions, salt bridges, water bridges, halogen bonds, metal complexes. For a given pdb file the interactions are calculated and shown in a publication quality figure shown here.$

$The display can also be downloaded as a PyMol session so the display can be modified.$

This tool is an extremely useful way to calculate protein-ligand interactions and can be used to find the types of interactions formed by the protein-ligand complex.

PLIP can be found here: https://projects.biotec.tu-dresden.de/plip-web/plip/

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Cheminformatics

Covariate Shift in Virtual Screening

Seventh Joint Sheffield Conference on Cheminformatics Part 1 (#ShefChem16)