Category Archives: Machine Learning

Covariate Shift in Virtual Screening

In supervised learning, we assume that the training data and testing are drawn from the same distribution, i.e P_{train}(x,y) = P_{test}(x,y). However this assumption is often violated in virtual screening. For example, a chemist initially focuses on a series of compounds and the information from this series is used to train a model. For some reasons,  the chemist changes their focus on a new, structurally distinct series later on and we would not expect the model to accurately predict the labels in the testing sets.  Here, we introduce some methods to address this problem.

Methods such as Kernel Mean Matching (KMM) or Kullback-Leibler Importance Estimation Procedure (KLIEP) have been proposed.  These methods typically assume the concept remain unchanged and only the distribution changes, i.e. P_{train}(y|x) =P_{test}(y|x) and P_{train}(x) \neq P_{test}(x).  In general, these methods  reweight instances in the training data so that the distribution of training instances is more closely aligned with the distribution of instances in the testing set. The appropriate importance weighting factor w(x) for each instance x in the training set is:

w(x) = \frac{p_{test}(x)}{p_{train}(x)}

where p_{train}(x) is the training set density and p_{test} (x) is the testing set density. Note that only the feature vector values (not their labels) are used in reweighting. The major difference between KMM and KLIEP is the objective function: KLIEP is based on the minimisation of the Kullback-Leibler divergence while KMM is based on the minimisation of Maximum Mean Discrepancy (MMD).  For more detail, please see reference.

Reference:

  1. Masashi Sugiyama ,Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, Motoaki Kawanabe.: Direct importance estimation for Covariate Shift Adaptation. Ann Inst Stat Math. 2008
  2. Jiayuan Huang,  Alex Smola, Arthur Gretton, Karsten Borgwardt, Bernhard Scholkopf.:Correcting Sample Selection Bias by Unlabeled Data. NIPS 06.
  3. Mcgaughey, Georgia ; Walters, W Patrick ; Goldman, Brian.: Understanding covariate shift in model performance. F1000Research, 2016,

 

Neuronal Complexity: A Little Goes a Long Way…To Clean My Apartment

The classical model of signal integration in both biological and Artificial Neural Networks looks something like this,

f(\mathbf{s})=g\left(\sum\limits_i\alpha_is_i\right)

where g is some linear or non-linear output function whose parameters \alpha_i adapt to feedback from the outside world through changes to protein dense structures near to the point of signal input, namely the Post Synaptic Density (PSD). In this simple model integration is implied to occur at the soma (cell body) where the input signals s_i are combined and broadcast to other neurons through downstream synapses via the axon. Generally speaking neurons (both artificial and otherwise) exist in multilayer networks composing the inputs of one neuron with the outputs of the others creating cross-linked chains of computation that have been shown to be universal in their ability to approximate any desired input-output behaviour.

See more at Khan Academy

Models of learning and memory have relied heavily on modifications to the PSD to explain modifications in behaviour. Physically these changes result from alterations in the concentration and density of the neurotransmitter receptors and ion channels that occur in abundance at the PSD, but, in actuality these channels occur all along the cell wall of the dendrite on which the PSD is located. Dendrites are something like a multi-branched mass of input connections belonging to each neuron. This begs the question as to whether learning might in fact occur all along the length of each densely branched dendritic tree.

Continue reading