Category Archives: Uncategorized

A designed conformational shift to control protein binding specificity

header

Proteins and their binding partners with complementary shapes form complexes. Fisher was onto something when he introduced the “key and lock mechanism” in 1896. For the first time the shape of molecules was considered crucial for molecular recognition. Since then there have been various attempts to improve upon this theory in order incorporate the structural diversity of proteins and their binding partners.

Koshland proposed the “induced fit” mechanism in 1956, which states that interacting partners undergo local conformational changes upon complex formation to strengthen binding. An additional mechanism “conformational selection” was introduced by Monod, Wyman and Changeux who argued that the conformational change occurred before binding driven by the inherent conformational heterogeneity of proteins. Given a protein that fluctuates between two states A and B and a substrate C that only interacts with one of these states, the chance of complex formation depends on the probability of our protein being in state A or B. Furthermore, one could imagine a scenario where a protein has multiple binding partners, each binding to a different conformational state. This means that some proteins exists in an equilibrium of different structural states, which determines the prevalence of interactions with different binding partners.

Figure1

Figure 1. The “pincer mode”.

Based on this observation Michielssens et al. used various in-silico methods to manipulate the populations of binding-competent states of ubiquitin in order to change its protein binding behaviour. Ubiquitin is known to take on two equally visited states along the “pincer mode” (the collective motion describing the first PCA-eigenvector); closed and open.


 

 

Figure2

Figure 2. A schematic of the conformational equilibrium of ubiquitin that can take on a closed or open state. Depending on its conformation i can bind different substrates.

Different binding partners prefer either the closed, open or both states. By introducing point mutations in the core of ubiquitin, away from the binding interface, Michielssens et al. managed to shift the conformational equilibrium between open and closed states, thereby changing binding specificity.

 

 

Point mutations were selected according to the following criteria:

⁃ residues must be located in the hydrophobic core
⁃ binding interface must be unchanged by the mutation
⁃ only hydrophobic residues may be introduced (as well as serine/threonine)
⁃ glycine and tryptophan were excluded because of their size

Figure3

Figure 3. Conformational preference of ubiquitin mutants. ddG_mut = dG_open – dG_closed.

Fast growth thermal integration (FGTI), a method based on molecular dynamics, was used to calculate the relative de-/stabilisation of the open/closed state caused by each mutation. Interestingly, most mutations that caused stabilisation of the open states were concentrated on one residues, Isoleucine 36 (Slide 7).
For the 15 most significant mutations a complete free energy profile was calculated using Umbrella sampling.

Figure4

Figure 4. Free energy profiles for six different ubiquitin mutants, calculated using umbrella sampling simulations. Mutants preferring the closed substate are shown in red, open substate stabilizing mutants are depicted in blue, those without a preference are shown in gray.

Figure5

Figure 5. Prediction of binding free energy differences between wild-type ubiquitin and different point mutations (ddG_binding = dG_binding,mutant􏰵 – dG_binding,wild-type).

To further validate that they correctly categorised their mutants into stabilising the open or closed state, six X-ray structure of ubiquitin in complex with a binding partner that prefers either the open or closed state were simulated with each of their mutations. Figure 5 shows the change in binding free energy that is caused by the mutation in compatible, neutral and incompatible complexes (compatible may refer to an “open favouring mutation” (blue) in an open complex (blue) and vice versa).

Figure6

Figure 6. Comparison of change in binding free energy predicted from the calculated results for ubiquitin and the experimental result.

In their last step a selection of open and closed mutations was introduced into an open complex and the change in binding free energy was compared between experiment (NMR) and their simulations. For this example, their mutants behaved as expected and an increase in binding free energy was observed when the closed mutations were introduced into the open complex while only subtle changes were seen when the “compatible” closed mutations were introduced.

The authors suggest that in the future this computational protocol may be a corner stone to designing allosteric switches. However, given that this approach requires pre-existing knowledge and is tailored to proteins with well defined conformational states it may take some time until we discover its full potential.

Molecular Dynamics of Antibody CDRs

Efficient design of antibodies as therapeutic agents requires understanding of their structure and behavior in solution. I have recently performed molecular dynamics simulations to investigate the flexibility and solution dynamics of complementarity determining regions (CDRs). Eight structures of the Fv region of antibody SPE7 were found in the Protein Data Bank with identical sequences. Twenty-five replicas of 100 ns simulations were performed on the Fvregion of one of these structures to investigate whether the CDRs adopted the conformation of one of the other X-Ray structures. The simulations showed the H3 and L3 loops start from one conformation and adopt another experimentally determined conformation.

This confirms the potential of molecular dynamics to be used to investigate antibody binding and flexibility. Further investigation would involve simulating different systems, for example using solution NMR resolved structures, and comparing the conformations deduced here to the canonical forms of CDR loops. Looking forward it is hoped molecular dynamics could be used to predict the bound conformation of an antibody from the unbound structure.

Click here for simulation videos.

Using B factors to assess flexibility

In my work of analysing antibody loops I have reached the point where I was interested in flexibility, more specifically challenging the somewhat popular belief that they have a high flexibility, especially the H3 loop. I wanted to use for this the B/Temperature/Debye-Waller factor which can be interpreted as a measure of the temperature dependent vibration of the atoms in the crystal, or in more gentler terms the flexibility at a certain position. I was keen to use the backbone atoms, and possibly the Cβ, but unfortunately the B factor shows some biases as it is used to mask other uncertainties due to high resolution, low electron density and as a result poor modelling. If we take a non redundant set of loops and split them in resolution shells of 0.2A we see how pronounced this bias is (Fig. 1 (a)).

b_factor_vs_res

Fig. 1(a) Comparison of average backbone B factors for loops found in structures at increasing resolution. A clear bias can be observed that correlates with the increase in resolution.

norm_b_factor_vs_res

Fig. 1(b) Normalization using the average the Z-score of the B factor of backbone atoms shows no bias at different resolution shells.

Comparing loops in neighbouring shells is virtually uninformative, and can lead to quite interesting results. In one analysis it came up that loops that are directly present in the binding site of antibodies have a higher average B factor than loops in structures without antigen where the movement is less constrained.

The issue here is that a complex structure (antibody-antigen) is larger, and has a poorer resolution, and therefore more biased B factors. To solve this issue I decided to normalize the B factors using the Z-score of the PDB file, where the mean and the standard deviation are computed from all the backbone atoms of amino acids inside the PDB file. This method to my knowledge was first described by (Parthasarathy and Murthy, 1997) [1] , although I came to the result without reading their paper, the normalization being quite intuitive. Using this measure we can finally compare loops from different structures at different resolutions (Fig. 1 (b)) with each other and we see what is expected: loops found in bound structures are less flexible than loops in unbound structures (Fig. 2). We can also answer our original question: does the H3 loop present an increased flexibility? The answer from Fig, 2 is no, if we compare a non-redundant sets of loops from antibodies to general proteins.

norm_b_factor_plot_heavy

Fig. 2 Flexibility comparison using the normalized B factor between a non-redundant set of non-IG like protein loops and different sets of H3 loops: bound to antigen (H3 bound), unbound (H3 unbound), both (H3). For each comparison ten samples with same number of examples and similar length distribution have been generated  and amassed (LMS) to correct for the possibility of length bias induced by the H3 loop which is known to have a propensity for longer loops than average.

References

[1] Parthasarathy, S. ; Murthy, M. R. N. (1997) Analysis of temperature factor distribution in high-resolution protein structures Protein Science, 6 (12). pp. 2561-2567. ISSN 0961-8368

Protein loops – why do we care?

In my DPhil research, I work on the development of new methods for predicting protein loop structures. But what exactly are loops, and why should we care about their structures?

Many residues in a given protein will form regions of regular structure, in α-helices and β-sheets. The segments of the protein that join these secondary structure elements together, that do not have easily observable regular patterns in their structure, are referred to as loops. This does not mean, though, that loops are only a minor component of a protein structure – on average, half of the residues in a protein are found in loops [1], and they are typically found on the surface of the protein, which is largely responsible for its shape, dynamics and physiochemical properties [2].

Connecting different secondary structures together is often not the only purpose of a loop – they are often vitally important to a protein’s function. For example, they are known to play a role in protein-protein interactions, recognition sites, signalling cascades, ligand binding, DNA binding, and enzyme catalysis [3].

As regular readers of the blog are probably aware by now, one of the main areas of research for our group is antibodies. Loops are vital for an antibody’s function, since its ability to bind to an antigen is mainly determined by six hypervariable loops (the complementarity determining regions). The huge diversity in structure displayed by these loops is the key to how antibodies can bind to such different substances. Knowledge of loop structures is therefore extremely useful, enabling predictions to be made about the protein.

Loops involved in protein function: a methyltransferase binding to DNA (top left, PDB 1MHT); the active site of a triosephosphate isomerase enzyme (bottom left, PDB 1NEY); an antibody binding to its antigen (blue, surface representation) via its complementarity determining regions, shown as the coloured loops (centre, PDB 3NPS); the activation loop of a tyrosine kinase has a different conformation in the active (pink) and inactive (blue) forms (top right, PDBs 1IRK and 1IR3); a zinc finger, where the zinc ion is coordinated by the sidechain atoms of a loop (bottom right, PDB 4YH8).

Loops involved in protein function: a methyltransferase binding to DNA (top left, PDB 1MHT); the active site of a triosephosphate isomerase enzyme (bottom left, PDB 1NEY); an antibody binding to its antigen (blue, surface representation) via its complementarity determining regions, shown as the coloured loops (centre, PDB 3NPS); the activation loop of a tyrosine kinase has a different conformation in the active (pink) and inactive (blue) forms (top right, PDBs 1IRK and 1IR3); a zinc finger, where the zinc ion is coordinated by the sidechain atoms of a loop (bottom right, PDB 4YH8).

More insertions, deletions and substitutions occur in loops than in the more conserved α-helices and β-sheets [4]. This means that, for a homologous set of proteins, the loop regions are the parts that vary the most between structures. While this often makes the protein’s function possible, as in the case of antibodies, it leads to unaligned regions in a sequence alignment, standard homology modelling techniques can therefore not be used. This makes prediction of their structure difficultit is frequently the loop regions that are the least accurate parts of a protein model.

There are two types of loop modelling algorithm: knowledge-based and ab initio. Knowledge-based methods look for appropriate loop structures from a database of previously observed fragments, while ab initio methods generate possible loop structures without prior knowledge. There is some debate about with approach is the best. Knowledge-based methods can be very accurate when the target loop is close in structure to one seen before, but perform poorly when this is not the case; ab initio methods are able to access regions of the conformational space that have not been seen before, but fail to take advantage of any structural data that is available. For this reason, we are currently working on developing a new method that combines aspects of the two approaches, allowing us to take advantage of the available structural data whilst allowing us to predict novel structures.

[1] L. Regad, J. Martin, G. Nuel and A. Camproux, Mining protein loops using a structural alphabet and statistical exceptionality. BMC Bioinformatics, 2010, 11, 75.

[2] A. Fiser and A. Sali, ModLoop: automated modeling of loops in protein structures. Bioinformatics, 2003, 19, 2500-2501.

[3] J. Espadaler, E. Querol, F. X. Aviles and B. Oliva, Identification of function-associated loop motifs and application to protein function prediction. Bioinformatics, 2006, 22, 2237-2243.

[4] A. R. Panchenko and T. Madej, Structural similarity of loops in protein families: toward the understanding of protein evolution. BMC Evolutionary Biology, 2005, 5, 10.

Antibody binding site re-design

In this blog post I describe three successful studies on structure based re-design of antibody binding sites, leading to significant improvements of binding affinity.

In their study Clark et al.[1] re-designed a binding site of antibody AQC2 to improve its binding affinity to the I domain of human integrin VLA1. The authors assessed the effects of the mutations on the binding energy using the CHARMM[2,3] potential with the electrostatic and desolations energies calculated using the ICE software[4]. In total, 83 variants were identified for experimental validation, some of which included multiple mutations. The mutated antibodies were expressed in E. Coli and the affinity to the antigen was measured. The best mutant included a total of four mutations which improved the affinity by approximately one order of magnitude from 7 nM to 850 pM. The crystal structure of the best mutant was solved to further study the interaction of the mutant with the target.

Figure 1: Comparison of calculated and experimental binding free energies. (Lippow et al., 2007)

Figure 1: Comparison of calculated and experimental binding free energies. (Lippow et al., 2007)

Lippow et al.[5] studied the interactions of three antibodies – the anti-epidermal growth factor receptor drug cetuximab[6], the anti-lysozyme antibody D44.1 and the anti-lysosyme antibody D1.3 with their respective antigens. The energy calculations favoured mutations to large amino acids (such as Phe or Trp) of which most were found to be false positives. More accurate results were obtained using only the electrostatic term of the energy function. The authors improved the binding affinity of D44.1 by one order of magnitude and the affinity of centuximab by 2 orders of magnitude. The antibody D1.3 didn’t show many opportunities for electrostatic improvement and the authors suggest it might be an anomalous antibody.

Computational methods have recently been used to successfully introduce non-canonical amino acids (NCAA) into the antibody binding site. Xu et al.[7] introduced L-DOPA (L-3,4-dihydroxephenyalanine) into the CDRs of anti-protective antigen scFv antibody M18 to crosslink it with its native antigen. The authors used the program Rosetta 3.4 to create models of antibody-antigen complex with L-DOPA residues. The distance between L-DOPA and a lysine nucleophile was used as a predictor of crosslinking was. The crosslinking efficiency was quantified as a fraction of antibodies that underwent a mass change, measured using Western blot assays. The measured average efficiency of the mutants was 10% with the maximum efficiency of 52%.

[1]      Clark LA, Boriack-Sjodin PA, Eldredge J, Fitch C, Friedman B, Hanf KJM, et al. Affinity enhancement of an in vivo matured therapeutic antibody using structure-based computational design. Protein Sci 2006;15:949–60. doi:10.1110/ps.052030506.

[2]      Brooks BR, Bruccoleri RE, Olafson DJ, States DJ, Swaminathan S, Karplus M. CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations. J Comput Chem 1983;4:187–217.

[3]      MacKerel Jr. AD, Brooks III CL, Nilsson L, Roux B, Won Y, Karplus M. CHARMM: The Energy Function and Its Parameterization with an Overview of the Program. In: v. R. Schleyer et al. P, editor. vol. 1, John Wiley & Sons: Chichester; 1998, p. 271–7.

[4]      Kangas E, Tidor B. Optimizing electrostatic affinity in ligand–receptor binding: Theory, computation, and ligand properties. J Chem Phys 1998;109:7522. doi:10.1063/1.477375.

[5]      Lippow SM, Wittrup KD, Tidor B. Computational design of antibody-affinity improvement beyond in vivo maturation. Nat Biotechnol 2007;25:1171–6. doi:10.1038/nbt1336.

[6]      Sato JD, Kawamoto T, Le AD, Mendelsohn J, Polikoff J, Sato GH. Biological effects in vitro of monoclonal antibodies to human epidermal growth factor receptors. Mol Biol Med 1983;1:511–29.

[7]      Xu J, Tack D, Hughes RA, Ellington AD, Gray JJ. Structure-based non-canonical amino acid design to covalently crosslink an antibody-antigen complex. J Struct Biol 2014;185:215–22. doi:10.1016/j.jsb.2013.05.003.

GO annotations: Warning – secondary IDs!

I’m writing this post after a bit of a scare that has cost me at least 2 hours of my life so far, and possibly may cost me a lot more. This is something I wished I had read about before starting to use gene ontology annotations (GO terms), so hopefully this will reach some people that are in the position I was in a couple of years ago. The post assumes you know about the GO directed acyclic graph (DAG), which describes the relationships between GO terms, and a bit about functional similarity. Good reviews on this were written by Pesquita et al (2009) and Guzzi et al (2012)

 

Secondary GO-terms

Yesterday evening I discovered GO terms can have secondary accession numbers. For example the GO term GO:0060555 is a secondary ID for GO:0070266 which describes the biological process term “necroptotic process”. The existence of secondary IDs in itself is not particularly surprising given that ontologies like the gene ontology are at the forefront of research and thus also have to be updated with the latest knowledge. As is the case for the UniProt KB and the RefSeq databases identifiers are merged if they are found to refer to the same thing. RefSeq has a good overview for when and why this occurs in their FAQ. Keeping a record of secondary IDs in the database is important for compatibility with past applications of the ontology.

 

Why is this a problem?

This can become a problem when the reverse compatibility is not fully committed to. For example, if the downloadable annotation set for human proteins contains GO terms that are secondary IDs, while the downloadable ontology structure does not include these secondary IDs in the directed acyclic graph (DAG). This means someone may download all the annotations and check what their parent terms are in the DAG, but as these terms are not included in the DAG, they do not have a parent term set.

 

So why the scare?

It seems like this behaviour should become obvious very fast. It should either give a key error that you are looking for something that isn’t there, or an empty set should ring some alarm bells. The scare comes in when neither happens, and you just don’t notice it at all, only to find out a couple of years later…

In order to calculate the functional similarity between two proteins, I need the full GO term sets associated with every protein including ancestral terms. I have a local copy of the ontology structure which I can use to query for these ancestral terms based on the GO terms directly associated with the proteins. As I need to do this for between 6800 and 13000 proteins depending on the network, I automated the process. The problem is that MySQL returns an empty set when asking for a match on an accession number that isn’t there. Returning an empty set has the effect that terms not in the ontology structure are automatically filtered out. This came to light now that I’m looking to explicitly calculate the information content of each GO term (a metric for how prevalent a term is in a data set) and three terms could not be found in the ancestral term sets for any proteins.

 

How do I prevent this from happening to me?

As there are only few secondary ID terms associated with proteins in the downloadable association file, it is easy to manually look for mappings to the respective primary IDs in the EBI QuckGO tool. Then, before querying for ancestral term sets, just map the secondary IDs across to the newer IDs and you’re all set. It’s really not a hassle if you know it’s there. Sadly, there is only a sentence on the gene ontology site about secondary IDs and no mention whatsoever that these IDs are not incorporated in the GO-basic ontology download, which is recommended for finding ancestral term sets.

 

Are you okay now?

As I hinted at before, there are very few proteins annotated with secondary ID terms. In my case I found three secondary ID GO terms annotated to five different proteins, out of a total 6861 proteins I was checking. One protein was also annotated with the corresponding primary ID term, so there was no effect, and another protein was annotated with the only parent term of the secondary ID, which no other proteins were annotated to. Thus, the effect really boils down to three proteins with the same secondary ID. Of those three, only one is heavily affected in the worst case by its similarity score with 10 other proteins changing from 9.3 to 2.6 without the primary ID annotation (scores range from ~ 1.5 to 12). Those are 10 scores out of a total of approximately 24 000 000… I think I will survive. But I am yet to test on a further data set.

 

The effect is basically that you ignore a couple of annotations. Given that we do not have the full set of annotations anyway, this change is the same as if an experiment or two hadn’t been conducted. Furthermore, the three proteins I found to be affected had a lot of annotations, so missing one should not have the worst-case effect I investigated. Nonetheless, you may want to do it properly first time round and take all the data into account when you start out. It should save you a lot of hassle later.

 

Stay vigilant, my friends!

Ways to compare networks

Nowadays network comparison is becoming increasingly relevant. Why is this?  Mainly because it is a desirable way to compare complex systems that can often be represented as networks.

Network comparison aims to account for properties that are generated by the simultaneous interaction of all units rather than the properties of each single individual. Here are some cases where network comparison could be helpful:

–  Showing and highlighting “significant” changes on network evolution. A particular characteristic of interest could be the speed with which information flows.

– Quantifying how “close” two networks  are. Even when the networks have a different different number of nodes and edges, or in the case of spatially embedded networks, different scale.

As an example, look at the following two networks. Are the structures of these two road networks similar?

Screen Shot 2015-06-15 at 14.56.21

(Images obtained from the mac Maps app)

Or what about the structure of these two other networks?
two_geometric

One of the difficulties in comparing networks is that there is no clear way to compare networks as complete whole entities. Network comparison methods only compare certain attributes of the network, among these: density of edges, global clustering coefficient, degree distribution, counts of smaller graphs embedded in the network and others. Here are some ideas and methods that have been used as ways to compare networks.

  • Networks can be compared by their global properties and summary statistics, like network density, degree distribution, transitivity, average shortest path length and others. Usually a statistic combining the differences of several global properties was used.
  • Another way to compare networks is based on the fit of a statistical network model (eg.  ERGM) to each network. Then, the coefficients of the fitted models could be compared. Note that in this case, a good fit of the models would be required.
  • Statistics directly built for network comparison via subgraph counts. These statistics do not make any assumptions of the network generation process. For example, Netdis and GCD are two network comparison statistics that try to measure the structural difference between networks. These network comparison measures are based on counts of small subgraphs (3-5 nodes), like triangles, 2-stars, 3-stars, squares, cliques and others (see Figure below). subgraph_plot_jpeg These network comparison statistics create frequency vectors of subgraphs and then compare these frequencies between the networks to obtain an idea of the similarity of the networks relative to their subgraph counts.
  • Lastly, another way to indirectly compare networks is via network alignment methods. The objective of these methods is to create a “mapping”, f, from the node set of one network to the node set of another.  The following Figure shows two networks, light and dark blue. An alignment of the two networks is shown in red.Screen Shot 2015-06-15 at 21.43.29One of the objectives of an alignment is to maximise the number of conserved interactions, that is, if (u,v) is an edge in the first network and f is an alignment, then an edge (u,v) is conserved if (f(u),f(v)) is an edge in the second network. It can be noted that the edge demarcated by the green oval is a non-conserved interaction.
    NETAL is a commonly used network alignment method, although there are several, more recent, network alignment methodologies.

In the end, despite the variety of ways to compare networks, saying that two networks are similar, or different, is not that easy, as all methods face their own particular challenges, like networks that come from the same model but have different node size.

 

Diseases exploiting the Power of Networks

Networks are pretty amazing. If you have a bunch of nodes that sit there and do very simple tasks that you can train a dog to do, and then you connect them in the right way, you can create a programme that can get through a level of Mario! In neural networks, all the nodes are doing is getting an input, then deciding what works best based on that input, and sending that information somewhere else. It’s like teaching your dog to get help when you’re in trouble. You say “Help!”, it makes a decision where to go to find help, and then someone else is alerted to your situation. Now you link those inputs and outputs, and suddenly you can solve some very interesting problems. Like making your dog get another dog to do something, who tells another dog… okay… at some point the analogy may break down. What I want to get across, is that networks can take very simple things and create complexity by just adding a few connections between them.

Getting back to the biological theme, proteins aren’t really all that simple. From what you’ll have read on other posts in this blog there’s a lot of effort that goes into modelling them. The position of even a small peptide loop can have a big effect on how a protein functions. So if you have a network of 15 – 25 thousand proteins you can imagine that the complexity of that system is quite incredible. That system is the human body.

Looking at systems from the perspective of interactions between components creating complex behaviour, it’s really not that surprising that human diseases often function by disrupting the network of protein interactions as found by this paper recently published in Cell. Assuming there’s a human disease which has survived for generations and generations, it is likely to be quite effective in how it functions. And if you have a complex system like the human body, the easiest way to disrupt that arising behaviour called “a healthy life”, is to alter a few of the interactions. Showing this for a large group of disease-associated missense mutations is however pretty impressive and time-consuming, which is probably why there are about 50 people from over 20 different institutions in the author list.

So what exactly did they show? They showed what happens when there is a mutation in a gene that causes a disease. Such a mutation replaces an amino acid in the protein sequence encoded by the gene. The resulting protein is then somehow different. The group around Marc Vidal and Susan Lindquist showed that rather than affecting the stability of the protein, the system tends to be targeted via the interactions of the protein. What is more, they showed that different mutations of the same gene can affect different sets of interactions. Using their “edgotyping” approach it is possible to characterize the type of effect a mutation will have and possibly predict the change in phenotype (i.e. the disease state).

Edgotyping classifies the effect of disease associated mutations (Sahni et al)

Edgotyping classifies the effect of disease associated mutations (Sahni et al)

Now if the possibility of predicting how a single mutation (1 in ~300 amino acids), in a single protein (1 in ~ 25,000 proteins) affects the human body doesn’t convince you that networks are pretty powerful, I really don’t know what will. But now back to the important stuff…. how do you make a real life neural network with communicating dogs? And could you make them reach level two on Mario? Or… Super Mario Dog?

Clustering Algorithms

Clustering is a task of organizing data into groups (called clusters), such that members of each group are more similar to each other than to members of other groups. This is a brief description of three popular clustering algorithms – K-Means, UPGMA and DBSCAN.

Cluster analysis

Cluster analysis

K-Means is arguably the simplest and most popular clustering algorithm. It takes one parameter – the expected number of clusters k. At the initialization step a set of k means m1, m2,…, mk is generated (hence the name). At each iteration step the objects in the data are assigned to the cluster whose mean yields the least within-cluster sum of squares. After the assignments, the means are updated to be centroids of the new clusters. The procedure is repeated until convergence (the convergence occurs when the means no longer change between iterations).

The main strength of K-means is that it’s simple and easy to implement. The largest drawback of the algorithm is that one needs to know in advance how many clusters the data contains. Another problem is that with wrong initialization the algorithm can easily converge to a local minimum, which may result in suboptimal partitioning of data.

UPGMA is a simple hierarchical clustering method, where the distance between two clusters is taken to be the average of distances between individual objects in the clusters. In each step the closest clusters are combined, until all objects are in clusters where the average distance between objects is lower than a specified cut-off.

The UPGMA algorithm is often used for construction of phenetic trees. The major issue with the algorithm is that the tree it constructs is ultrametric, which means that the distance from root to any leaf is the same. In context of evolution, this means that the UPGMA algorithm assumes a constant rate of accumulation of mutations, an assumption which is often incorrect.

DBSCAN is a density-based algorithm which tries to separate the data into regions of high density, labelling points that lie in low-density areas as outliers. The algorithm takes two parameters – ε and minPts and looks for points that are density-connected with respect to ε. A point p is said to be density reachable from point q if there are points between p and q, such that one can traverse the path from p to q never moving further than ε in any step. Because the concept of density-reachability is not symmetric a concept of density-connectivity is introduced. Two points p and q are density-connected if there is a point o such that both p and q are density reachable from o. A set of points is considered a cluster if all points in the set are mutually density-connected and the number of points in the set is equal to or greater than minPts. The points that cannot be put into clusters are classified as noise.

a)Illustrates the concept of density-reachability.  b)  Illustrates the concept of density-connectivity

a) Illustrates the concept of density-reachability.
b) Illustrates the concept of density-connectivity

The DBSCAN algorithm can efficiently detect clusters with non-globular shapes since it is sensitive to changes in density only. Because of that, the clustering reflects real structure present in the data. The problem with the algorithm is the choice of parameter ε which controls how large the difference in density needs to be to separate two clusters.

Investigating structural mechanisms with customised Natural Moves

This is a brief overview of my current work on a protocol for studying molecular mechanisms in biomolecules. It is based on Natural Move Monte Carlo (NMMC), a technique pioneered by one of my supervisors, Peter Minary.

NMMC allows the user to decompose a protein or nucleic acid structure into segments of atoms/residues. Each segment is moved collectively and treated as a single body. This gives rise to a sampling strategy that considers the structure as a fluid of segments and probes the different arrangements in a Monte Carlo fashion.

Traditionally the initial decomposition would be the only one that is sampled. However, this decomposition might not always be optimal. Critical degrees of freedom might have been missed out or chosen sub-optimally. Additionally, if we want to test the causality of a structural mechanism it can be informative to perform NMMC simulations for a variety of decompositions. Here I show an example of how customised Natural Moves may be applied on a DNA system.


Investigating the effect of epigenetic marks on the structure of a DNA toy model

epigeneticsintro

Figure 1. Three levels at which epigenetic activity takes place. Figure adopted from Charles C. Matouk, and Philip A. Marsden Circulation Research. 2008;102:873-887

Epigenetic marks on DNA nucleotides are involved in regulating gene expression (Point 1 in figure 1). We have a limited understanding of the underlying molecular mechanism of this process. There are two mechanisms that are thought to be involved: 1) Direct recognition of the epigenetic mark by DNA binding proteins and 2) indirect recognition of changes in the local DNA structure caused by the epigenetic mark. Using customised Natural Moves we are currently trying to to gain insight into these mechanisms.

One type of epigenetic mark is the 5-hydroxymethylation (5hm) on cytosines. Lercher et al. have recently solved a crystal structure of the Drew-Dickerson Dodecamer with two of these epigenetic marks attached. Given the right sequence context (e.g. CpG) they have found that this epigenetic mark can form a hydrogen bond between two neighbouring bases on the same strand. This raises the question: Can an intra-strand CpG hydrogen bond alter the DNA helical parameters? We used this as a toy model to test our technology.

cDOFsSchematic

Figure 2b

dna_schematic

Figure 2a

Figure 2a shows the Drew-Dickerson Dodecamer schematically. Figure 2b shows the three sets of degrees of freedom that we used as test cases.

Top (11): Default sampling – Serves as a reference simulation.
Middle (01): Fixed 5hm torsions – By forcing the hydroxyl group of 5hmC towards the guanine we significantly increase the chance of the hydrogen bond forming.
Bottom (00): Collective movement of the neighbouring bases 5hmC and G – By grouping the two bases into a segment we aim to emulate the dampening effect that a hydrogen bond may have on their movements relative to each other.

By modifying these degrees of freedom (1=active/ 0=inactive) we attempted to amplify any effects that the CpG hydrogen bond may have on the DNA structure.

Below you can see animations of the three test cases 11, 01, 00 during simulation, zoomed in on the 5hm modification and the neighbouring base pair:

fullfullfixedCHMfullfixedGC


It appeared that the default degrees of freedom were not sufficient to detect a change in the DNA structure when comparing simulations of unmodified and modified structures. The other two test cases with customised Natural Moves, however, showed alterations in some of the helical parameters.

Lercher et al. saw no differences in their modified and unmodified crystal structures. It seems that single 5hm epigenetic marks are not sufficient to significantly alter DNA structure. Rather, clusters of these modifications with accumulated structural effects may be required to cause significant changes in DNA helical parameters. CpG islands may be a promising candidate.

@samdemharter