Category Archives: Uncategorized

Clustering Algorithms

Clustering is a task of organizing data into groups (called clusters), such that members of each group are more similar to each other than to members of other groups. This is a brief description of three popular clustering algorithms – K-Means, UPGMA and DBSCAN.

Cluster analysis

K-Means is arguably the simplest and most popular clustering algorithm. It takes one parameter – the expected number of clusters k. At the initialization step a set of k means m₁, m₂,…, m_kis generated (hence the name). At each iteration step the objects in the data are assigned to the cluster whose mean yields the least within-cluster sum of squares. After the assignments, the means are updated to be centroids of the new clusters. The procedure is repeated until convergence (the convergence occurs when the means no longer change between iterations).

The main strength of K-means is that it’s simple and easy to implement. The largest drawback of the algorithm is that one needs to know in advance how many clusters the data contains. Another problem is that with wrong initialization the algorithm can easily converge to a local minimum, which may result in suboptimal partitioning of data.

UPGMA is a simple hierarchical clustering method, where the distance between two clusters is taken to be the average of distances between individual objects in the clusters. In each step the closest clusters are combined, until all objects are in clusters where the average distance between objects is lower than a specified cut-off.

The UPGMA algorithm is often used for construction of phenetic trees. The major issue with the algorithm is that the tree it constructs is ultrametric, which means that the distance from root to any leaf is the same. In context of evolution, this means that the UPGMA algorithm assumes a constant rate of accumulation of mutations, an assumption which is often incorrect.

DBSCAN is a density-based algorithm which tries to separate the data into regions of high density, labelling points that lie in low-density areas as outliers. The algorithm takes two parameters – ε and minPts and looks for points that are density-connected with respect to ε. A point p is said to be density reachable from point q if there are points between p and q, such that one can traverse the path from p to q never moving further than ε in any step. Because the concept of density-reachability is not symmetric a concept of density-connectivity is introduced. Two points p and q are density-connected if there is a point o such that both p and q are density reachable from o. A set of points is considered a cluster if all points in the set are mutually density-connected and the number of points in the set is equal to or greater than minPts. The points that cannot be put into clusters are classified as noise.

a) Illustrates the concept of density-reachability.
b) Illustrates the concept of density-connectivity

The DBSCAN algorithm can efficiently detect clusters with non-globular shapes since it is sensitive to changes in density only. Because of that, the clustering reflects real structure present in the data. The problem with the algorithm is the choice of parameter ε which controls how large the difference in density needs to be to separate two clusters.

Investigating structural mechanisms with customised Natural Moves

This is a brief overview of my current work on a protocol for studying molecular mechanisms in biomolecules. It is based on Natural Move Monte Carlo (NMMC), a technique pioneered by one of my supervisors, Peter Minary.

NMMC allows the user to decompose a protein or nucleic acid structure into segments of atoms/residues. Each segment is moved collectively and treated as a single body. This gives rise to a sampling strategy that considers the structure as a fluid of segments and probes the different arrangements in a Monte Carlo fashion.

Traditionally the initial decomposition would be the only one that is sampled. However, this decomposition might not always be optimal. Critical degrees of freedom might have been missed out or chosen sub-optimally. Additionally, if we want to test the causality of a structural mechanism it can be informative to perform NMMC simulations for a variety of decompositions. Here I show an example of how customised Natural Moves may be applied on a DNA system.

Investigating the effect of epigenetic marks on the structure of a DNA toy model

Figure 1. Three levels at which epigenetic activity takes place. Figure adopted from Charles C. Matouk, and Philip A. Marsden Circulation Research. 2008;102:873-887

Epigenetic marks on DNA nucleotides are involved in regulating gene expression (Point 1 in figure 1). We have a limited understanding of the underlying molecular mechanism of this process. There are two mechanisms that are thought to be involved: 1) Direct recognition of the epigenetic mark by DNA binding proteins and 2) indirect recognition of changes in the local DNA structure caused by the epigenetic mark. Using customised Natural Moves we are currently trying to to gain insight into these mechanisms.

One type of epigenetic mark is the 5-hydroxymethylation (5hm) on cytosines. Lercher et al. have recently solved a crystal structure of the Drew-Dickerson Dodecamer with two of these epigenetic marks attached. Given the right sequence context (e.g. CpG) they have found that this epigenetic mark can form a hydrogen bond between two neighbouring bases on the same strand. This raises the question: Can an intra-strand CpG hydrogen bond alter the DNA helical parameters? We used this as a toy model to test our technology.

Figure 2b

Figure 2a

Figure 2a shows the Drew-Dickerson Dodecamer schematically. Figure 2b shows the three sets of degrees of freedom that we used as test cases.

Top (11): Default sampling – Serves as a reference simulation.
Middle (01): Fixed 5hm torsions – By forcing the hydroxyl group of 5hmC towards the guanine we significantly increase the chance of the hydrogen bond forming.
Bottom (00): Collective movement of the neighbouring bases 5hmC and G – By grouping the two bases into a segment we aim to emulate the dampening effect that a hydrogen bond may have on their movements relative to each other.

By modifying these degrees of freedom (1=active/ 0=inactive) we attempted to amplify any effects that the CpG hydrogen bond may have on the DNA structure.

Below you can see animations of the three test cases 11, 01, 00 during simulation, zoomed in on the 5hm modification and the neighbouring base pair:

It appeared that the default degrees of freedom were not sufficient to detect a change in the DNA structure when comparing simulations of unmodified and modified structures. The other two test cases with customised Natural Moves, however, showed alterations in some of the helical parameters.

Lercher et al. saw no differences in their modified and unmodified crystal structures. It seems that single 5hm epigenetic marks are not sufficient to significantly alter DNA structure. Rather, clusters of these modifications with accumulated structural effects may be required to cause significant changes in DNA helical parameters. CpG islands may be a promising candidate.

@samdemharter

Journal Club: The Origin of CDR H3 Structural Diversity

Antibody binding site is broadly composed of the six hypervariable loops, the CDRs. There are three loops on the antibody light chain (L1, L2 and L3) and three loops on the antibody heavy chain (H1, H2 and H3).

Out of the six loops, five appear to adopt a constrained set of structural conformations (L1, L2, L3, H1 and H2). The conformations of H3 appear much less constrained, which was suggested to be the result of its higher relative importance in antigen recognition (however it is not a necessary condition). The only observations to date about the shapes of CDR-H3 is the existence of the extended and kinked conformations of its anchor.

The function of the kink was investigated recently by Weitzner et al. Here, the authors contrasted the geometry found in the antibody CDR-H3 loops to a set of 15k non-antibody polypeptides. They found that even though the extended conformation appears to be more favorable, the kinked one can also be found in many cases, particularly in the PDZ domains.

Weitzner et al. find that the extended conformation is much more common in non-antibody loops. However, the kinked conformation, even though less frequent is not outright rare. The situation is the opposite in antibodies where the majority of H3 conformations are kinked rather than extended.

The authors contrasted the sequence patterns of kinked antibody loops and kinked non-antibody loops and did not find anything predictive of the kinked conformation — suggesting that the effect might be non-local. Nonetheless, the secondary structure pattern of the kinked H3 and the kinked non-antibody loops appears similar.

Even though there might be no sequence-kink link, the authors indicate how their findings might improve H3 structure prediction. They demonstrate that admixing the kinked non-antibody loops into a template dataset for an H3 modeling software might provide more relevant templates.

In conclusion, the main message of the paper (selon moi) is putting forward of the hypothesis as to the role of the H3 kink. Since the kink is much more prevalent in H3 than in non-antibody proteins, there is a strong suggestion that there might be a special role for it. The authors suggest that the kinked conformation allows for more structural diversity, that would otherwise be restricted in the more rigid beta-stranded extended conformation. Thus, antibodies might have opted for a system wherein, they do not need to add dramatic mutations to their H3 in order to get more structural flexibility.

Hierachical Natural Move Monte Carlo using MOSAICS

After having recently published a large scale Molecular Dynamics simulations project of TCRpMHC [1,2] interaction I have extended my research to another technique of spatial sampling. At this week’s group meeting I presented the first results of my first MOSAICS [3] project.

The MOSAICS package is a software that allows for so called hierarchical natural Monte Carlo moves. That means that the user can specify regions in the protein of interest. These regions are indented to reflect “natural” sets of atoms and are expected to move together. An example would be a stable alpha-helix. “Hierarchical” means that region can be grouped together to super-regions. For example a helix that is broken by a kink [4] in its middle could have a region for the helix parts on both sides of the kink as well as for the overall helix. An example for peptide/MHC is illustrated below.

MOSAICS uses Monte Carlo moves to rearrange these region with respect to each other. A stochastic chain closure algorithm ensures that no chain breaks occur. An example of such movements in comparison to classical all-atom Molecular Dynamics is shown below.

In this study we used MOSAICS to simulate the detachment of peptides from MHCs for experimentally known binder and non-binder. An example of such a detaching peptide is shown below

Our results show that experimentally known non-binding peptides detach significantly faster from MHC than experimentally known binding peptides (results to be reported soon).

As a first conclusion of this project:
After having worked with both MOSAICS and Molecular Dynamics simulations, I think that both techniques have their advantages and disadvantages. They are summarized below:

Which technique should be chosen for which project depends mainly on what the aims of these projects are. If large moves of well defined segments are expected then MOSAICS might be the method of choice. If the aim is to investigate fine changes and detailed dynamics Molecular Dynamics simulations might be the better choice.

1.    Knapp B, Demharter S, Esmaielbeiki R, Deane CM (2015) Current Status and Future Challenges in T-cell receptor / peptide / MHC Molecular Dynamics Simulations. Brief Bioinform accepted.
2.    Knapp B, Dunbar J, Deane CM (2014) Large Scale Characterization of the LC13 TCR and HLA-B8 Structural Landscape in Reaction to 172 Altered Peptide Ligands: A Molecular Dynamics Simulation Study. PLoS Comput Biol 10: e1003748.
3.    Sim AY, Levitt M, Minary P (2012) Modeling and design by hierarchical natural moves. Proc Natl Acad Sci U S A 109: 2890-2895.
4.    Wilman HR, Ebejer JP, Shi J, Deane CM, Knapp B (2014) Crowdsourcing yields a new standard for kinks in protein helices. J Chem Inf Model 54: 2585-2593.

A new method to improve network topological similarity search: applied to fold recognition

Last week I discussed the recent paper by Lhota et al. proposing a network-based similarity search method, applied to the problem of protein fold prediction. Similarity search is the foundation of bioinformatics. It plays a key role in establishing structural, functional and evolutionary relationships between biological sequences. Although the power of the similarity search has increased steadily in recent years, a high percentage of sequences remain uncharacterized in the protein universe. Cumulative evidence suggests that the protein universe is continuous. As a result, conventional sequence homology search methods may be not able to detect novel structural, functional and evolutionary relationships between proteins from weak and noisy sequence signals. To overcome the limitations in existing similarity search methods, the authors propose a new algorithmic framework, Enrichment of Network Topological Similarity (ENTS). While the method is general in scope, in the paper, authors focus exclusively on the protein fold recognition problem.

Fig 1: ENTS pipeline for protein fold prediction.

To initialize ENTS for structure prediction, ENTS builds a structural similarity graph of protein domains (Fig 1). The structural similarity graph is a weighted graph with one node for each structural domain and an edge between two nodes only if their pairwise similarity exceeds a certain threshold. In this article, the structural similarity score is determined by TM align with a threshold of 0.4. Next, some or all the structural domains in the database are labeled with SCOP. Given a query domain sequence and the goal to predict its structure, ENTS first links the query to all nodes in the structural similarity graph. The weights of these new edges are based only on the sequence profile-profile similarity derived from HHSearch. Then random walk with restart (RWR) is applied to perform a probabilistic traversal of the instance graph across all paths leading away from the query, where the probability of choosing an edge will be proportional to its weight. The algorithm will output a ranked list of probabilities of reaching each node in the structural graph from the query, thus potentially uncovering relationships missed by pair-wise comparison methods. ENTS also uses an enrichment analysis step to assess the reliability of the detected relationships, by comparing the mean relationship strength of a SCOP cluster in the structural graph and the query, to that of random clusters.

For testing the method, the authors first constructed a structural graph using 36,003 non-redundant protein domains from the PDB. The query benchmark set consisted of 885 SCOP domains, constructed by randomly selecting each domain from folds spanning at least two super-families. An additional step before prediction on the query set was to remove all domains from the structral graph which were in the same super-family as the query. The method was compared to existing methods such as CNFPred and HHSearch and it’s network approach and enrichment analysis step were found to contribute significantly to the accuracy of fold prediction. While the method seems to be an improvement on existing methods and is a novel use of network-based approaches to fold prediction, the false positive rate is still very high. One way of overcoming this, suggested by the authos is the use of energy-based scoring functions to further prune the list of potential hits returned by the method.

What does a farm look like? – Evaluating Community Detection Methods

Let’s assume you have a child. Tom is 5 years old, he’s a piece of work. He loves running around the house and the only time you can get a bit of rest is when he’s drawing. Tom loves drawing, so you use that whenever you need a bit of time to sit down. Today is one of those days, you didn’t sleep well and there was trouble at work. However, when you pick up Tom from daycare, he’s excited and full of energy, so you suggest Tom draw a farm which you can put on the wall in your office at work. You sit down and Tom is busy for half an hour. After half an hour Tom proudly presents his work. There are pigs in sties, horses in stables and cows on the meadow. He looks up at you and asks: “Is this a good farm?”. Of course, you say it is an amazing farm.

Only, what if you didn’t know what a farm looked like? You could ask the neighbour’s kid, Emily, to draw a farm and see what the images have in common? You could do this at a whole different scale and start a drawing competition for all the 5-year-olds in the neighbourhood and look which 5-year-olds draw most accurately to see who you should believe what a farm looks like. Clearly we’re not talking about children drawing farms any more. But what if Tom were a functional similarity metric that has just evaluated the results of a community detection algorithm run on a protein interaction network to generate communities?

That was a bit of a jump there. Let me explain. I have spent the last 2 years looking into how protein interaction networks (pins) can be partitioned into functional biological modules. It is widely believed that functional modules are an important scale of organization in humans or any other organisms (c.f. Hartwell et al 1999). These modules consist of proteins which perform the same or similar biological functions via interacting with each other. Thus, there may be a module which contains proteins that interact to e.g. heal wounds.

Finding Functional Modules

We attempt to find these modules by looking at a network which contains data on which proteins interact with which other ones (pins) and then use algorithms to group proteins together based on these interactions. The resulting protein communities contain proteins that interact more with each other than with the rest of the network. So that’s that, right?

…No, sadly it isn’t. The issue is that there are probably tens, if not hundreds of algorithms to find these communities and they don’t agree with each other. Furthermore, the underlying network contains a lot of false interactions and is missing a lot of true interactions. This affects different community detection methods in different ways. So we need a way to evaluate the results these community detection algorithms are presenting. But how do we judge what a good community is when it is exactly this community that we are looking for? How do we know it is “a good farm”, when we don’t know what a farm looks like?

Evaluating Community Detection by Functional Similarity

Luckily we do have some idea of what is “good”. Most proteins have annotations which suggest what biological functions they are involved in. As we are looking for functional modules which group together proteins involved in the same or a similar function, this is exactly along our alleyway! Lucky us :). Right?

Again, you might have expected this one, the answer is “not entirely”. Not only are there a lot of ways to find communities in a network, there are also a huge number of ways to use the above-mentioned functional annotations (GO annotations, www.geneontology.org) to calculate how “functionally similar” two proteins are. Now you might think that all these functional similarity measures use the same functional annotation data, so they should generally agree on what is a “good” and what is a “bad” community. This was my first intuition, so I checked. Below are two graphs which show the results of this test. In both cases I evaluate exactly the same sets of communities which I get from partitioning a pin using Link Clustering (Ahn et al 2010) at different resolutions. The plots show the average functional homogeneity of communities in different community size bands as judged by the Pandey Method on the left (Pandey et al 2008), and the simGIC method on the right (Pesquita et al 2007).

Functional homogeneity plots of a protein interaction network partitioned into communities by Link Clustering at different resolutions. The average functional homogeneity of communities is shown in different community size bands to give the coloured lines. Functional homogeneity is calculated as the average pairwise functional similarity between all protein pairs in a community. In Figure A) the Pandey method is used to calculate functional similarity between two proteins, while in Figure B) the simGIC method is used

You can clearly see that the two plots look different. Now it would be okay if they looked different and still agreed on the most important part: where is my pin partitioned into communities in the best way? To check this I look for maxima in the plots. Figure A) tells me that at a resolution of about 0.2 communities of size 2-35 are on average quite functionally homogeneous. At a slightly lower resolution (approx 0.1) communities of size 36+ look like they are partitioned decently. So depending on the community sizes I’m looking for, I can say which resolution I should go for. However, these peaks don’t show up in Figure B). We clearly need an evaluation of the evaluation metric. Which one should I believe?

Features that are common to both cases (the yellow peak around 0.4 and the magenta maximum around 0.3) might be “true”. But what is to say that the first set of peaks isn’t also important? In our earlier analogy, the neighbour’s daughter Emily has drawn a farm that looks quite different. You’re not about to say Tom’s wrong only because Emily drew a different picture, right? So right about now we need that drawing competition!

Evaluating Evaluation metrics: The drawing competition

Now we’ve stumbled twice already in how to evaluate our results. This time, we should make sure that we can evaluate the different results confidently. The plan is that we let the kids draw something else. Not a farm¸ because we don’t know how that looks like. We ask them to draw something related, like a house. We live in a house, so that should be easier to evaluate. And apparently houses and farms are not too dissimilar, so that the kids that do well in their house drawings, may be the ones that have the best farm drawings as well (or so we hope).

In terms of PINs, we have to create a network with community structure which looks a bit like a PIN. We can then use a community detection algorithm to find the communities and let our kids (functional similarity metrics) go away and evaluate the communities at different resolutions. This time we however know which maxima should show up, as we created the network and therefore know the community structure that should be found.

To create a PIN-similar network is not a mean feat and can be the topic of a whole PhD thesis, so I have focussed on a small, simple network, which is hopefully PIN-similar enough to be a meaningful evaluation for the functional similarity metrics. My network is generated on the basis of functional labels ordered in the ontology below.

A tree defining the relationship between labels 1-15. Nodes are annotated with these labels to generate a network. Labels which share more parent labels are closer related (i.e. 14 and 15 share the parent label 5).

Each node is assigned one or two labels between 6 and 15 randomly without replacement. Then the ontology is traced upwards to get the ancestral label sets associated with each node (i.e. label 15 has the ancestral labels 5 and 1). Edge probabilities are calculated between two nodes based on the number of common labels these nodes have in their ancestral label sets. Finally edges are added based on the edge probabilities to give a network with a density comparable to a PIN.

When this network is clustered into communities and functionally evaluated at different resolutions, the results look like this:

Average functional homogeneity of communities of size 3+ in a PIN-like random graph of 500 nodes. The top left graph (black) is based on label overlaps and is thus ground truth, while the other three are generated by the Pandey method (red), simGIC (blue), and simUI (green). The coloured graphs are generated by using the above-mentioned functional similarity metrics after 33% of the labels from 6-15 and 20% of the labels from 2-5 where “hidden” (reverted to the parent label).

The black graph shows how many labels nodes in the same community have in common at different resolutions. As these labels overlaps were used to generate the edge probabilities, this graph serves as a gold standard. The other three graphs were generated using the functional similarity metrics to be compared. To make the PIN-like network more realistic some labels were hidden, as we don’t always know the exact function of every protein when we evaluate PINs.

Now that we have the results, we just need to see how all the coloured graphs compare to the black ones, or in our analogy: how all the kid’s drawings compare to what our house looks like. But because we are doing science, one drawing competition sadly won’t do. What if one child draws our house very well, but happens to do a bit worse exactly in the parts where the house is similar to a farm. We’d think he’s the guy to listen to, and start thinking a farm looks like a shed?!

What we really need, is more drawing competitions. Lots of them. And this is where I am happy I’m at a computer where running one of these competitions takes maybe a minute. To get a bit more confidence in the results, I ran every simulation 100 times, for 27 different sets of parameters. And the results? Well, that’s the best bit. The results say Tom was drawing a great farm all along. Tom, the kid you’ve seen grow up and been bringing to daycare for years now, he was right. You weren’t lying at all when you said it was a great farm. Only in real life he’s not called Tom, he’s called the Pandey method ;).

Predicting Antibody Affinities – Progress?

Antibodies are an invaluable class of therapeutic molecules — they have the capacity to bind any molecule (Martin and Thornton, 1996), and this comes from an antibody’s remarkable diversity (Georgiou et al., 2014). In particular, antibodies can bind their targets with high affinity, and most drug design campaigns seek to ‘mature’ an antibody, i.e. increase the antibody’s affinity for its target. However, the process of antibody maturation is, in experimental terms, time-consuming and expensive — if we had 6 CDRs (as in a typical antibody), with 10 residues each, and if you can have any of the 20 amino acids in the CDR positions, there are 20^60 mutants to test (and this is before considering any double or triple mutations!)

So hold on, what is affinity exactly? Affinity represents the strength of binding, and it’s calculated as either a ratio of concentrations, or as a ratio of rate constants, i.e.In the simplest affinity maturation protocol, three steps are compulsory:

Mutate the antibody’s structure correctly
Assess the impact of mutation on KD
Select and repeat.

For the past year, we have centralised around part 2 — affinity prediction. This is a fundamental aspect of the affinity maturation pipeline in order to rationalise ‘good’ and ‘bad’ mutations in the context of maturing an antibody. We developed a statistical potential, CAPTAIN; essentially the idea is to gather contact statistics that are represented in antibody-antigen complexes, and use this information to predict affinities.

But why use contact information? Does it provide anything useful? Based on analyses of the interfaces of antibody-antigen complexes in comparison to general protein-protein interfaces, we definitely see that antibodies rely on a different binding mode compared to general protein-protein complexes, and other studies have confirmed this notion (Ramaraj et al., 2012; Kunik and Ofran, 2013; Krawczyk et al., 2013).

For our methodology, we trained on general protein-protein complexes (as most scoring functions do!) and specifically on antibody-protein complexes from the PDB. For our test set of antibody-protein complexes, we outperformed 16 other published methods, though for our antibody-peptide test set, we were one of the worst performers. We found that other published methods predict antibody-protein affinities poorly, though they make better predictions for antibody-peptide binding affinities. Ultimately, we achieve stronger performance as we use a more appropriate training set (antibody-antigen complexes) for the problem in hand (predicting antibody-antigen affinities). Our correlations were by no means ideal, and we believe that there are other aspects of antibody structures that must be used for improving affinity prediction, such as conformational entropy (Haidar et al., 2013) and VH-VL domain orientation (Dunbar et al., 2013; Fera et al., 2014).

What’s clear though, is that using the right knowledge base is key to improving predictions for solving greater problems like affinity maturation. At present, most scoring functions are trained on general proteins, but this ‘one-fits-all’ approach has been subject to criticism (Ross et al., 2013). Our work supports the idea that scoring functions should be tailored specifically for the problem in hand.

OOMMPPAA: A tool to aid directed synthesis by the combined analysis of activity and structural data

Motivation

Recently I published a paper on OOMMPPAA, my 3D matched-molecular pair tool to aid drug discovery. The tool is available to try here, download here and read about here. OOMMPPAA aims to tackle the big-data problem in drug discovery – where X-ray structures and activity data have become increasingly abundant. Below I will explain what a 3D MMP is.

What are MMPs?

MMPs are two compounds that are identical apart from one structural change, as shown in the example below. Across a set of a thousand compounds, tens of thousands of MMPs can be found. Each pair can be represented as a transformation. Below would be Br >> OH. Within the dataset possibly hundreds of Br >> OH will exist.

An example of an MMP

Each of these transformations will also be associated with a change in a measurable compound property, (solubility, acidity, binding energy, number of bromine atoms…). Each general transformation (e.g. Br>>OH) therefore would have a distribution of values for a given property. From this distribution we can infer the likely effect of making this change on an unmeasured compound. From all the distributions (for all the transformations) we can then predict the most likely transformation to improve the property we are interested in.

The issue with MMPs

Until recently the MMP approach had been of limited use for predicting compound binding energies. This is for two core reasons.

1) Most distributions would be pretty well normally distributed around zero. Binding is complex and the same change can have both positive and negative effects.

2) Those distributions that are overwhelmingly positive, by definition, produce increased binding against many proteins. A key aim of drug discovery is to produce selective compounds that increase binding energy only for the protein you are interested in. So increasing binding energy like that is not overall very useful.

3D MMPs to the rescue

3D MMPs aim to resolve these issues and allow MMP analysis to be applied to binding energy predictions. 3D MMPs use structural data to place the transformations in the context of the protein. One method, VAMMPIRE, asks what is the overall effect of Br>>OH when it is near to a Leucine, Tyrosine and Tryptophan (for example). In this way selective changes can be found.

Another method by BMS aggregates these changes across a target class, in their cases kinases. From this they can ask questions like, “Is it overall beneficial to add a cyclo-proply amide in this region of the kinase binding site”.

What does my tool do differently?

OOMMPPAA differs from the above in two core ways. First, OOMMPPAA uses a pharmacophore abstraction to analyse changes between molecules. This is an effective way of increasing the number of observations for each transition. Secondly OOMMPPAA does not aggregate activity changes into general trends but considers positive and negative activity changes separately. We show in the paper that this allows for more nuanced analysis of confounding factors in the available data.

How to (not) perform a Molecular Dynamics simulation study

At this week’s group meeting I presented my recently published paper:

Knapp B, Dunbar J, Deane CM. Large scale characterization of the LC13 TCR and HLA-B8 structural landscape in reaction to 172 altered peptide ligands: a molecular dynamics simulation study. PLoS Comput Biol. 2014 Aug 7;10(8):e1003748. doi: 10.1371/journal.pcbi.1003748. 2014.

This paper was presented on the front page of PLoS Computational Biology:

The paper describes Molecular Dynamics simulations (MD) of T-cell receptor (TCR) / peptide / MHC complexes.

MD simulation calculate the movement of atoms of a given system over time. The zoomed-in movements of the TCR regions (CDRs) recognizing a peptide/MHC are shown here:

Often scientists perform MD simulations of two highly similar complexes with different immunological outcome. An example is the Epstein-Barr Virus peptide FLRGRAYGL which leads to induction of an immune response when presented by HLA-B*08:01 to the LC 13 TCR. If position 7 of the peptide is mutated to A then the peptide does not induce an immune reaction any-more.
In both simulations the TCR and MHC are identical only the peptide differs in one amino acid. Once the simulations of those two complexes are done one could for example investigate the root mean square fluctuation of both peptides to investigate if one of them is more flexible. And indeed this is the case:

Another type of analysis would be the solvent accessible surface area of the peptide or, as shown below, the CDR regions of the TCR beta chain:

… and again it is obvious that those two simulations strongly differ from each other.

Is it then a good idea to publish an article about these findings and state something like “Our results indicate that peptide flexibility and CDR solvent accessible surface area play an important in the activation of a T cell” ?

As we know from our statistics courses a 1 vs 1 comparison is an anecdotic example but those often do not hold true for larger sample sizes.

So, let’s try it with many more peptides and simulations (performing these simulations took almost 6 months on the Oxford Advanced Research Computing facility). We split the simulations in two categories, those which are more immunogenic and those which are less immunogenic:

Let’s see what happens to the difference in RMSF that we found before:

So much for “More immunogenic peptides are more/less flexible as less immunogenic ones” …

How about the SASA that we “found” before:

… again almost perfectly identical for larger sample sizes.

And yes, we have found some minor differences between more and less immunogenic peptides that hold true for larger sample sizes but non of them was striking enough to predict peptide immunogenicity based on MD simulations.

What one really learns from this study is that you should not compare 1 MD simulation against 1 other MD simulation as this will almost certainly lead to false positive findings. Exactly the same applies for experimental data such as x-ray structures because this is a statistical problem rather than a prediction based on. If you want to make sustainable claims about something that you found then you will need a large sample size and a proper statistical sample size estimation and power analysis (as done in clinical studies) before you run any simulations. Comparing structures is always massive multiple testing and therefore high sample sizes are needed to draw valid conclusions.

link pdf

Finding Communities in Networks: Link Clustering and the Affiliation Graph Model

The research I do revolves around how to break down protein interaction networks (PINs) into functional modules, or communities, using community detection methods. These communities are groups of proteins which interact more with one another than with the rest of the network. We decompose the PINs in this manner, as it is very difficult to determine how a protein functions through its interactions by simply looking at the entire network. The sheer amount of data in one of these networks, clouds the information we would actually like to see, as explained in my previous blog post.

In a journal club in August, I presented the overlapping community detection method Link Clustering by Ahn et al. which is the first community detection method I am aware of that assigns nodes to communities by clustering the edges between the nodes. I contrasted Link Clustering with the Affiliation Graph Model method proposed by Yang and Leskovec, as the authors use the same comparison technique to validate their methods. In a world where studies usually only work on a single network using a specific method, having two independent papers that use the same validation method is something that gets me much more excited than it probably should.

Link Clustering

Instead of assigning nodes to communities, Link Clustering focusses on the edges in a network. Clustering edges is essentially an equivalent problem to clustering nodes. This becomes obvious if you take a graph, decide to call all the edges “blobs” and say two of your new “blobs” have an edge between them if they share a node in the original network. This is called creating the “line graph” of the network and performing “blob” clustering on this is the same as performing edge clustering on the initial network.

To come back from my tangent, if an edge is assigned to community A, this means the nodes on either side of this edge are assigned community A by association. By assigning each edge to an individual community, Link Clustering creates overlapping node communities as shown in Figure 1 below.

Example of a simple, link-clustered network. Nodes can be seen to be in multiple communities as edges are assigned to communities (c.f. node 4 in red, green and yellow communities). [Ahn et al., 2010]

Figure 1: Example of a simple, link-clustered network. Nodes can be seen to be in multiple communities as edges are assigned to communities (c.f. node 4 in red, green and yellow communities). [Ahn et al., 2010]

In Link Clustering, edges are placed into the same community, based on similarity scores computed between all connected edges. Edge clusters are computed from these similarity scores using single-linkage clustering, where the edges with the highest similarity scores are iteratively merged into the same community. Using this method all edges end up in the same community in the end, so a threshold for the similarity score needs to be found at which the network is partitioned into communities in the best way. This threshold value represents a type of resolution parameter of the for community detection. At a low threshold there are few, large communities as many edges have been clustered together, while at a high threshold only highly similar edges are in the same community and there are many, small communities. Ahn et al. propose maximizing a function called the partition density to find the optimal resolution threshold. This function is simply a weighted average of the density of the communities. For those incredibly keen, it is given by the equation:

$D = \frac{2}{M} \sum_c m_c \frac{m_c - (n_c -1)}{(n_c - 2)(n_c -1)}$

Here, $M$ represents the total number of edges in the network and $m_c$ and $n_c$ represent the number of edges and nodes in the community $c$ respectively.

All of this partitioning is based on a similarity score between two edges. The crux of this method, the scoring measure, is explained in Figure 2.

Schematic Diagram showing how the similarity between edges e_ik and and e_jk is calculated in Link Clustering. The overlap of the inclusive neighbour sets of nodes i and j is divided by the union of these sets. The inclusive neighbour set of node i is computed by taking the neighbours of node i and including node i itself. The nodes in grey are overlapping between both sets, therefore the similarity between edges e_ik and e_jk is 4/12 or 1/3. [Ahn et al 2010]

Figure 2: Schematic Diagram showing how the similarity between edges $e_{ik}$ and and $e_{jk}$ is calculated in Link Clustering. The overlap of the inclusive neighbour sets of nodes i and j is divided by the union of these sets. The inclusive neighbour set of node i is computed by taking the neighbours of node i and including node i itself. The nodes in grey are overlapping between both sets, therefore the similarity between edges $e_{ik}$ and $e_{jk}$ is 4/12 or 1/3. [Ahn et al. 2010]

Affiliation Graph Model

The method which I would like to compare Link Clustering to is the Affiliation Graph Model (AGM). Briefly, this is a statistical community detection algorithm which generates overlapping communities. Node affiliations are recorded in the Community Affiliation Matrix shown in Figure 2.

Schematic of the Community Affiliation Matrix, which shows nodes (coloured squares) being affiliated with communities (circles A, B, and C) as shown by the arrows. Nodes can have multiple community affiliations as indicated by red and purple node squares. The probability of interaction between two nodes which are both in only community A, is $p_A$ . [Yang and Leskovec, 2012]

The AGM has a probability of interaction for each community, denoted $p_A$ for community $A$ . For nodes $u$ and $v$ , which may share several community affiliations, the probability of interaction, $p(u,v)$ is given by the equation:

$p(u,v) = 1 - \prod_{k \in C_{uv}} (1 - p_k),$

where $k$ denotes a community in the set of communities shared by nodes $u$ and $v$ , $C_{uv}$ . This model is be fitted to the network using a Metropolis-Hastings algorithm with a Markov chain constructed by pre-defined defined steps in the space of possible Community Affiliation Matrices (c.f. Yang and Leskovec 2012).

Result Comparison Method

Link Clustering itself has some very interesting results, especially relating to looking at real-world networks at different resolutions (which can be found in the paper). However, I want to focus on the validation method used in the paper which compares results generated from Link Clustering with that of other methods. This comparison method evaluates four different aspects of the partitions generated by different methods. These four aspects are: Community Quality, Overlap Quality, Community Coverage, and Overlap Coverage. The “Quality” aspects relate to network metadata indicating how good the obtained communities are, and the “Coverage” aspects relate to the amount of information extracted from a network.

The metadata referred to above is a general term for additional data available about the nodes in a network. In the case of PINs this metadata is related to the function of proteins in the network (their Gene Ontology annotations). Loosely, proteins with similar functions should be placed in the same community to improve the Community Quality measure. Overlap Quality concerns the boundaries between generated communities. If the functions assigned to proteins show similar boundaries between groups of functions, the Overlap Quality is high.

The “Coverage” values are more basic calculations. A partition has a high Community Coverage if the fraction of nodes assigned to communities with three or more nodes is high (non-trivial communities). A community of size two is uninformative, as it is the result of a single edge being in a community by itself. Overlap Coverage is simply the number of non-trivial community memberships per node. The two “Coverage” values are thus equal for non-overlapping community detection methods.

When comparing community detection methods, these four measures are computed for a partition generated by each method, and then rescaled so that the maximum score achieved by any method is 1 for each measure independently. These values are then added to give a score between 0 and 4 for each community detection method.

Results

The results shown in Figure 4 were generated by Yang and Leskovec to show their AGM outperforms Link Clustering (and other methods) on a variety of networks. While it is noteable that they used the same metric and networks used by Ahn et al. when proposing Link Clustering, this Figure must be taken with a pinch of salt. Yang and Leskovec acknowledge that the metric proposed by Ahn et al. rewards methods which find more communities in a network and thus do not fit the number of communities judged to be best by the AGM, but instead fit the same number as fitted in Link Clustering. Furthermore, it is peculiar that half of the networks used for comparison by Ahn et al. have disappeared in this comparison.

Community detection method comparison for Link Clustering (L), Clique Percolation (C), Mixed-Membership Stochastic Block Model (M) and the Affiliation Graph Model (A) on different PINs (PPI) and other social networks. Methods are compared by the metric proposed by Ahn et al. [Yang & Leskovec 2012]

Figure 4: Community detection method comparison for Link Clustering (L), Clique Percolation (C), Mixed-Membership Stochastic Block Model (M) and the Affiliation Graph Model (A) on different PINs (PPI) and other social networks. Methods are compared by the metric proposed by Ahn et al. [Yang & Leskovec 2012]

To conclude we can say that while it is very interesting that two papers use the same method to compare their methods to others for validation purposes, it wasn’t done completely accurately here. The comparison metric is however an interesting development to create a gold standard for method comparison. For my purposes, Community Quality is the most important, so maybe a weighted version of this may be more interesting.

Oxford Protein Informatics Group

or "OPIG" to friends