Category Archives: Uncategorized

Isoform-resolved protein interaction networks and poker

Every time I talk about protein interaction networks, I put up a the nice little figure below. This figure suggests that experiments are done to detect which proteins interact with each other, and lines are drawn between them to create a network. Sounds pretty easy, right? It does, and it probably should, because otherwise you wouldn’t listen past the words “Yeast Two-Hybrid”. However, sometimes it’s important to hear about how it’s not at all easy, how there are compromises made, and how those compromises mean there are some inherent limitations. This post is one of those things it is important to listen to if you care about protein interaction networks (and if you don’t… they’re pretty cool, so don’t be a downer!).

Classical image showing the design of a protein interaction network

Schematic image showing the assembly of a human protein interaction network with ~25 000 interactions

So what’s wrong with that nice figure? Just to deal with the obvious: the colour scheme isn’t great… there’s a red protein and the interactions are also red… and then a yellow protein and red interactions aren’t that easy on the eyes. But you’re not here to judge my choice of colours (I hope… otherwise you’d be better off criticizing the graphs in this beast). You’re here to hear me rant about networks… much more fun ;). So here goes:

  1. Not all interactions come from the same experiments, they investigate different “types” of binding.
  2. Each experiment has experimental errors associated with it, the interactions are not all correct.
  3. The network is not complete (estimation is that there are somewhere between 150k and 600k interactions depending on what paper you read).
  4. People tend to investigate proteins that they know are associated with a disease, so more interactions are known for certain proteins and their neighbours resulting in an “inspection bias”.
  5. That’s a stupid depiction of a network, it’s just a blue blob and you can’t see anything (hence termed “ridiculogram” by Mark Newman).
  6. The arrow is wrong, you are not reporting interactions between proteins!

Points 1-5 should more or less make sense as they’re written. Point 6 however sounds a little cryptic. They’re called PROTEIN interaction networks, why would they not be interactions between proteins, you might ask.. and you’d be right in asking, because it’s really quite annoying. The problem is one of isoforms. The relation gene to protein is not 1-to-1. After a gene is transcribed into RNA, that piece of RNA is cut up and rearranged into mRNA, which is then translated into a protein. This rearranging process can occur in different ways so that a single gene may encode for different proteins, or as they are called, different protein isoforms. Testing for interactions between isoforms is difficult, so what tends to be done is that people test for interactions for THE ONE isoform of a protein to rule them all  (the “reference isoform”) and then report these interactions as interactions for the gene. Sneaky! What you end up seeing are interactions mostly tested between reference isoforms (or any that happened to be in the soup) and reported as interactions for the gene product.

So how much does it matter if we don’t know isoform interaction information? Are there even different interacting partners for different isoforms? Do they have different functions? Well… yes, they can have different interactions and different functions. As to whether they matter… according to Corominas et al that answer is also a resounding yes… or at least in Autism Spectrum Disorder (ASD) it is.

The paper is the result of a 5-year investigation which investigates isoform interactions and the effect of knowing them vs not knowing them on predicting candidate ASD genes of interest. And seeing as a bunch of people spent a lot of time on this stuff, it was definitely worth a read. Corominas et al found that in an ASD-related protein interaction network, there is a significant number of interactions that would not be found if only the reference isoform interactions were used. Furthermore, compared to a “high-quality” literature curated network, the generated isoform-resolved ASD network added a lot of interactions. They then went on to show that these additional interactions played an important role in the decision of which genes to prioritize as important “players in Autism”.

Should these results make us renounce the use of currently available non-isoform-resolved protein interaction networks lest we burn in the depths of bioinformatics hell? Well… probably not. While the paper is very interesting and shows the importance of isoforms, it does so in the context of Autism only. The paper itself states that ASD is a brain-related disease which is an environment known for many isoforms. In many cases, it will likely be the case that the “dominant isoform” is just that… dominant. Moreover, the results may sound a little stronger than they are. The literature curated network that was compared to, to say that this isoform-resolved network is really important, was quoted as being “high-quality”. It is likely that many of the isoform interactions would be included in lower quality networks, but they have simply not been as well-studied as dominant isoforms. Thus, their isoform-resolved network would just confirm lower quality interactions as high-quality ones. That being said, if you want to look at the specific mechanisms causing a phenotype, it is likely that isoform information will be necessary. It really depends on what you want to achieve.

Let’s say you’re playing Texas Hold’em poker and you have two pairs. You’d like to have that full house, but it’s elusive and you’re stuck with the hand you have when your opponent bids high. That’s the situation we were in with protein interaction networks: you know you’re missing something, but you don’t know how bad it is that you’re missing it. This paper addresses part of that problem. We now know that your opponent could have the flush, but possibly only if you’re in Vegas. If you only want to play in a local casino, you’ll likely be fine.

Molecular Diversity and Drug Discovery

reportdraft_2 copyFor my second short project I have developed Theox, molecular diversity software, to aid the selection of synthetically viable molecules from a subset of diverse molecules. The selection of molecules for synthesis is currently based on synthetic intuition. The developed software indicates whether the selection is an adequate representation of the initial dataset, or whether molecular diversity has been compromised. Theox plots the distribution of diversity indices for 10,000 randomly generated subsets of the same size as the chosen subset. The diversity index of the chosen subset can then be compared to the distributions, to determine whether the molecular diversity of the chosen subset is sufficient. The figure shows the distribution of the Tanimoto diversity indices with the diversity index of the subset of molecules shown in green.

A designed conformational shift to control protein binding specificity

header

Proteins and their binding partners with complementary shapes form complexes. Fisher was onto something when he introduced the “key and lock mechanism” in 1896. For the first time the shape of molecules was considered crucial for molecular recognition. Since then there have been various attempts to improve upon this theory in order incorporate the structural diversity of proteins and their binding partners.

Koshland proposed the “induced fit” mechanism in 1956, which states that interacting partners undergo local conformational changes upon complex formation to strengthen binding. An additional mechanism “conformational selection” was introduced by Monod, Wyman and Changeux who argued that the conformational change occurred before binding driven by the inherent conformational heterogeneity of proteins. Given a protein that fluctuates between two states A and B and a substrate C that only interacts with one of these states, the chance of complex formation depends on the probability of our protein being in state A or B. Furthermore, one could imagine a scenario where a protein has multiple binding partners, each binding to a different conformational state. This means that some proteins exists in an equilibrium of different structural states, which determines the prevalence of interactions with different binding partners.

Figure1

Figure 1. The “pincer mode”.

Based on this observation Michielssens et al. used various in-silico methods to manipulate the populations of binding-competent states of ubiquitin in order to change its protein binding behaviour. Ubiquitin is known to take on two equally visited states along the “pincer mode” (the collective motion describing the first PCA-eigenvector); closed and open.


 

 

Figure2

Figure 2. A schematic of the conformational equilibrium of ubiquitin that can take on a closed or open state. Depending on its conformation i can bind different substrates.

Different binding partners prefer either the closed, open or both states. By introducing point mutations in the core of ubiquitin, away from the binding interface, Michielssens et al. managed to shift the conformational equilibrium between open and closed states, thereby changing binding specificity.

 

 

Point mutations were selected according to the following criteria:

⁃ residues must be located in the hydrophobic core
⁃ binding interface must be unchanged by the mutation
⁃ only hydrophobic residues may be introduced (as well as serine/threonine)
⁃ glycine and tryptophan were excluded because of their size

Figure3

Figure 3. Conformational preference of ubiquitin mutants. ddG_mut = dG_open – dG_closed.

Fast growth thermal integration (FGTI), a method based on molecular dynamics, was used to calculate the relative de-/stabilisation of the open/closed state caused by each mutation. Interestingly, most mutations that caused stabilisation of the open states were concentrated on one residues, Isoleucine 36 (Slide 7).
For the 15 most significant mutations a complete free energy profile was calculated using Umbrella sampling.

Figure4

Figure 4. Free energy profiles for six different ubiquitin mutants, calculated using umbrella sampling simulations. Mutants preferring the closed substate are shown in red, open substate stabilizing mutants are depicted in blue, those without a preference are shown in gray.

Figure5

Figure 5. Prediction of binding free energy differences between wild-type ubiquitin and different point mutations (ddG_binding = dG_binding,mutant􏰵 – dG_binding,wild-type).

To further validate that they correctly categorised their mutants into stabilising the open or closed state, six X-ray structure of ubiquitin in complex with a binding partner that prefers either the open or closed state were simulated with each of their mutations. Figure 5 shows the change in binding free energy that is caused by the mutation in compatible, neutral and incompatible complexes (compatible may refer to an “open favouring mutation” (blue) in an open complex (blue) and vice versa).

Figure6

Figure 6. Comparison of change in binding free energy predicted from the calculated results for ubiquitin and the experimental result.

In their last step a selection of open and closed mutations was introduced into an open complex and the change in binding free energy was compared between experiment (NMR) and their simulations. For this example, their mutants behaved as expected and an increase in binding free energy was observed when the closed mutations were introduced into the open complex while only subtle changes were seen when the “compatible” closed mutations were introduced.

The authors suggest that in the future this computational protocol may be a corner stone to designing allosteric switches. However, given that this approach requires pre-existing knowledge and is tailored to proteins with well defined conformational states it may take some time until we discover its full potential.

Molecular Dynamics of Antibody CDRs

Efficient design of antibodies as therapeutic agents requires understanding of their structure and behavior in solution. I have recently performed molecular dynamics simulations to investigate the flexibility and solution dynamics of complementarity determining regions (CDRs). Eight structures of the Fv region of antibody SPE7 were found in the Protein Data Bank with identical sequences. Twenty-five replicas of 100 ns simulations were performed on the Fvregion of one of these structures to investigate whether the CDRs adopted the conformation of one of the other X-Ray structures. The simulations showed the H3 and L3 loops start from one conformation and adopt another experimentally determined conformation.

This confirms the potential of molecular dynamics to be used to investigate antibody binding and flexibility. Further investigation would involve simulating different systems, for example using solution NMR resolved structures, and comparing the conformations deduced here to the canonical forms of CDR loops. Looking forward it is hoped molecular dynamics could be used to predict the bound conformation of an antibody from the unbound structure.

Click here for simulation videos.

Using B factors to assess flexibility

In my work of analysing antibody loops I have reached the point where I was interested in flexibility, more specifically challenging the somewhat popular belief that they have a high flexibility, especially the H3 loop. I wanted to use for this the B/Temperature/Debye-Waller factor which can be interpreted as a measure of the temperature dependent vibration of the atoms in the crystal, or in more gentler terms the flexibility at a certain position. I was keen to use the backbone atoms, and possibly the Cβ, but unfortunately the B factor shows some biases as it is used to mask other uncertainties due to high resolution, low electron density and as a result poor modelling. If we take a non redundant set of loops and split them in resolution shells of 0.2A we see how pronounced this bias is (Fig. 1 (a)).

b_factor_vs_res

Fig. 1(a) Comparison of average backbone B factors for loops found in structures at increasing resolution. A clear bias can be observed that correlates with the increase in resolution.

norm_b_factor_vs_res

Fig. 1(b) Normalization using the average the Z-score of the B factor of backbone atoms shows no bias at different resolution shells.

Comparing loops in neighbouring shells is virtually uninformative, and can lead to quite interesting results. In one analysis it came up that loops that are directly present in the binding site of antibodies have a higher average B factor than loops in structures without antigen where the movement is less constrained.

The issue here is that a complex structure (antibody-antigen) is larger, and has a poorer resolution, and therefore more biased B factors. To solve this issue I decided to normalize the B factors using the Z-score of the PDB file, where the mean and the standard deviation are computed from all the backbone atoms of amino acids inside the PDB file. This method to my knowledge was first described by (Parthasarathy and Murthy, 1997) [1] , although I came to the result without reading their paper, the normalization being quite intuitive. Using this measure we can finally compare loops from different structures at different resolutions (Fig. 1 (b)) with each other and we see what is expected: loops found in bound structures are less flexible than loops in unbound structures (Fig. 2). We can also answer our original question: does the H3 loop present an increased flexibility? The answer from Fig, 2 is no, if we compare a non-redundant sets of loops from antibodies to general proteins.

norm_b_factor_plot_heavy

Fig. 2 Flexibility comparison using the normalized B factor between a non-redundant set of non-IG like protein loops and different sets of H3 loops: bound to antigen (H3 bound), unbound (H3 unbound), both (H3). For each comparison ten samples with same number of examples and similar length distribution have been generated  and amassed (LMS) to correct for the possibility of length bias induced by the H3 loop which is known to have a propensity for longer loops than average.

References

[1] Parthasarathy, S. ; Murthy, M. R. N. (1997) Analysis of temperature factor distribution in high-resolution protein structures Protein Science, 6 (12). pp. 2561-2567. ISSN 0961-8368

Protein loops – why do we care?

In my DPhil research, I work on the development of new methods for predicting protein loop structures. But what exactly are loops, and why should we care about their structures?

Many residues in a given protein will form regions of regular structure, in α-helices and β-sheets. The segments of the protein that join these secondary structure elements together, that do not have easily observable regular patterns in their structure, are referred to as loops. This does not mean, though, that loops are only a minor component of a protein structure – on average, half of the residues in a protein are found in loops [1], and they are typically found on the surface of the protein, which is largely responsible for its shape, dynamics and physiochemical properties [2].

Connecting different secondary structures together is often not the only purpose of a loop – they are often vitally important to a protein’s function. For example, they are known to play a role in protein-protein interactions, recognition sites, signalling cascades, ligand binding, DNA binding, and enzyme catalysis [3].

As regular readers of the blog are probably aware by now, one of the main areas of research for our group is antibodies. Loops are vital for an antibody’s function, since its ability to bind to an antigen is mainly determined by six hypervariable loops (the complementarity determining regions). The huge diversity in structure displayed by these loops is the key to how antibodies can bind to such different substances. Knowledge of loop structures is therefore extremely useful, enabling predictions to be made about the protein.

Loops involved in protein function: a methyltransferase binding to DNA (top left, PDB 1MHT); the active site of a triosephosphate isomerase enzyme (bottom left, PDB 1NEY); an antibody binding to its antigen (blue, surface representation) via its complementarity determining regions, shown as the coloured loops (centre, PDB 3NPS); the activation loop of a tyrosine kinase has a different conformation in the active (pink) and inactive (blue) forms (top right, PDBs 1IRK and 1IR3); a zinc finger, where the zinc ion is coordinated by the sidechain atoms of a loop (bottom right, PDB 4YH8).

Loops involved in protein function: a methyltransferase binding to DNA (top left, PDB 1MHT); the active site of a triosephosphate isomerase enzyme (bottom left, PDB 1NEY); an antibody binding to its antigen (blue, surface representation) via its complementarity determining regions, shown as the coloured loops (centre, PDB 3NPS); the activation loop of a tyrosine kinase has a different conformation in the active (pink) and inactive (blue) forms (top right, PDBs 1IRK and 1IR3); a zinc finger, where the zinc ion is coordinated by the sidechain atoms of a loop (bottom right, PDB 4YH8).

More insertions, deletions and substitutions occur in loops than in the more conserved α-helices and β-sheets [4]. This means that, for a homologous set of proteins, the loop regions are the parts that vary the most between structures. While this often makes the protein’s function possible, as in the case of antibodies, it leads to unaligned regions in a sequence alignment, standard homology modelling techniques can therefore not be used. This makes prediction of their structure difficultit is frequently the loop regions that are the least accurate parts of a protein model.

There are two types of loop modelling algorithm: knowledge-based and ab initio. Knowledge-based methods look for appropriate loop structures from a database of previously observed fragments, while ab initio methods generate possible loop structures without prior knowledge. There is some debate about with approach is the best. Knowledge-based methods can be very accurate when the target loop is close in structure to one seen before, but perform poorly when this is not the case; ab initio methods are able to access regions of the conformational space that have not been seen before, but fail to take advantage of any structural data that is available. For this reason, we are currently working on developing a new method that combines aspects of the two approaches, allowing us to take advantage of the available structural data whilst allowing us to predict novel structures.

[1] L. Regad, J. Martin, G. Nuel and A. Camproux, Mining protein loops using a structural alphabet and statistical exceptionality. BMC Bioinformatics, 2010, 11, 75.

[2] A. Fiser and A. Sali, ModLoop: automated modeling of loops in protein structures. Bioinformatics, 2003, 19, 2500-2501.

[3] J. Espadaler, E. Querol, F. X. Aviles and B. Oliva, Identification of function-associated loop motifs and application to protein function prediction. Bioinformatics, 2006, 22, 2237-2243.

[4] A. R. Panchenko and T. Madej, Structural similarity of loops in protein families: toward the understanding of protein evolution. BMC Evolutionary Biology, 2005, 5, 10.

Antibody binding site re-design

In this blog post I describe three successful studies on structure based re-design of antibody binding sites, leading to significant improvements of binding affinity.

In their study Clark et al.[1] re-designed a binding site of antibody AQC2 to improve its binding affinity to the I domain of human integrin VLA1. The authors assessed the effects of the mutations on the binding energy using the CHARMM[2,3] potential with the electrostatic and desolations energies calculated using the ICE software[4]. In total, 83 variants were identified for experimental validation, some of which included multiple mutations. The mutated antibodies were expressed in E. Coli and the affinity to the antigen was measured. The best mutant included a total of four mutations which improved the affinity by approximately one order of magnitude from 7 nM to 850 pM. The crystal structure of the best mutant was solved to further study the interaction of the mutant with the target.

Figure 1: Comparison of calculated and experimental binding free energies. (Lippow et al., 2007)

Figure 1: Comparison of calculated and experimental binding free energies. (Lippow et al., 2007)

Lippow et al.[5] studied the interactions of three antibodies – the anti-epidermal growth factor receptor drug cetuximab[6], the anti-lysozyme antibody D44.1 and the anti-lysosyme antibody D1.3 with their respective antigens. The energy calculations favoured mutations to large amino acids (such as Phe or Trp) of which most were found to be false positives. More accurate results were obtained using only the electrostatic term of the energy function. The authors improved the binding affinity of D44.1 by one order of magnitude and the affinity of centuximab by 2 orders of magnitude. The antibody D1.3 didn’t show many opportunities for electrostatic improvement and the authors suggest it might be an anomalous antibody.

Computational methods have recently been used to successfully introduce non-canonical amino acids (NCAA) into the antibody binding site. Xu et al.[7] introduced L-DOPA (L-3,4-dihydroxephenyalanine) into the CDRs of anti-protective antigen scFv antibody M18 to crosslink it with its native antigen. The authors used the program Rosetta 3.4 to create models of antibody-antigen complex with L-DOPA residues. The distance between L-DOPA and a lysine nucleophile was used as a predictor of crosslinking was. The crosslinking efficiency was quantified as a fraction of antibodies that underwent a mass change, measured using Western blot assays. The measured average efficiency of the mutants was 10% with the maximum efficiency of 52%.

[1]      Clark LA, Boriack-Sjodin PA, Eldredge J, Fitch C, Friedman B, Hanf KJM, et al. Affinity enhancement of an in vivo matured therapeutic antibody using structure-based computational design. Protein Sci 2006;15:949–60. doi:10.1110/ps.052030506.

[2]      Brooks BR, Bruccoleri RE, Olafson DJ, States DJ, Swaminathan S, Karplus M. CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations. J Comput Chem 1983;4:187–217.

[3]      MacKerel Jr. AD, Brooks III CL, Nilsson L, Roux B, Won Y, Karplus M. CHARMM: The Energy Function and Its Parameterization with an Overview of the Program. In: v. R. Schleyer et al. P, editor. vol. 1, John Wiley & Sons: Chichester; 1998, p. 271–7.

[4]      Kangas E, Tidor B. Optimizing electrostatic affinity in ligand–receptor binding: Theory, computation, and ligand properties. J Chem Phys 1998;109:7522. doi:10.1063/1.477375.

[5]      Lippow SM, Wittrup KD, Tidor B. Computational design of antibody-affinity improvement beyond in vivo maturation. Nat Biotechnol 2007;25:1171–6. doi:10.1038/nbt1336.

[6]      Sato JD, Kawamoto T, Le AD, Mendelsohn J, Polikoff J, Sato GH. Biological effects in vitro of monoclonal antibodies to human epidermal growth factor receptors. Mol Biol Med 1983;1:511–29.

[7]      Xu J, Tack D, Hughes RA, Ellington AD, Gray JJ. Structure-based non-canonical amino acid design to covalently crosslink an antibody-antigen complex. J Struct Biol 2014;185:215–22. doi:10.1016/j.jsb.2013.05.003.

GO annotations: Warning – secondary IDs!

I’m writing this post after a bit of a scare that has cost me at least 2 hours of my life so far, and possibly may cost me a lot more. This is something I wished I had read about before starting to use gene ontology annotations (GO terms), so hopefully this will reach some people that are in the position I was in a couple of years ago. The post assumes you know about the GO directed acyclic graph (DAG), which describes the relationships between GO terms, and a bit about functional similarity. Good reviews on this were written by Pesquita et al (2009) and Guzzi et al (2012)

 

Secondary GO-terms

Yesterday evening I discovered GO terms can have secondary accession numbers. For example the GO term GO:0060555 is a secondary ID for GO:0070266 which describes the biological process term “necroptotic process”. The existence of secondary IDs in itself is not particularly surprising given that ontologies like the gene ontology are at the forefront of research and thus also have to be updated with the latest knowledge. As is the case for the UniProt KB and the RefSeq databases identifiers are merged if they are found to refer to the same thing. RefSeq has a good overview for when and why this occurs in their FAQ. Keeping a record of secondary IDs in the database is important for compatibility with past applications of the ontology.

 

Why is this a problem?

This can become a problem when the reverse compatibility is not fully committed to. For example, if the downloadable annotation set for human proteins contains GO terms that are secondary IDs, while the downloadable ontology structure does not include these secondary IDs in the directed acyclic graph (DAG). This means someone may download all the annotations and check what their parent terms are in the DAG, but as these terms are not included in the DAG, they do not have a parent term set.

 

So why the scare?

It seems like this behaviour should become obvious very fast. It should either give a key error that you are looking for something that isn’t there, or an empty set should ring some alarm bells. The scare comes in when neither happens, and you just don’t notice it at all, only to find out a couple of years later…

In order to calculate the functional similarity between two proteins, I need the full GO term sets associated with every protein including ancestral terms. I have a local copy of the ontology structure which I can use to query for these ancestral terms based on the GO terms directly associated with the proteins. As I need to do this for between 6800 and 13000 proteins depending on the network, I automated the process. The problem is that MySQL returns an empty set when asking for a match on an accession number that isn’t there. Returning an empty set has the effect that terms not in the ontology structure are automatically filtered out. This came to light now that I’m looking to explicitly calculate the information content of each GO term (a metric for how prevalent a term is in a data set) and three terms could not be found in the ancestral term sets for any proteins.

 

How do I prevent this from happening to me?

As there are only few secondary ID terms associated with proteins in the downloadable association file, it is easy to manually look for mappings to the respective primary IDs in the EBI QuckGO tool. Then, before querying for ancestral term sets, just map the secondary IDs across to the newer IDs and you’re all set. It’s really not a hassle if you know it’s there. Sadly, there is only a sentence on the gene ontology site about secondary IDs and no mention whatsoever that these IDs are not incorporated in the GO-basic ontology download, which is recommended for finding ancestral term sets.

 

Are you okay now?

As I hinted at before, there are very few proteins annotated with secondary ID terms. In my case I found three secondary ID GO terms annotated to five different proteins, out of a total 6861 proteins I was checking. One protein was also annotated with the corresponding primary ID term, so there was no effect, and another protein was annotated with the only parent term of the secondary ID, which no other proteins were annotated to. Thus, the effect really boils down to three proteins with the same secondary ID. Of those three, only one is heavily affected in the worst case by its similarity score with 10 other proteins changing from 9.3 to 2.6 without the primary ID annotation (scores range from ~ 1.5 to 12). Those are 10 scores out of a total of approximately 24 000 000… I think I will survive. But I am yet to test on a further data set.

 

The effect is basically that you ignore a couple of annotations. Given that we do not have the full set of annotations anyway, this change is the same as if an experiment or two hadn’t been conducted. Furthermore, the three proteins I found to be affected had a lot of annotations, so missing one should not have the worst-case effect I investigated. Nonetheless, you may want to do it properly first time round and take all the data into account when you start out. It should save you a lot of hassle later.

 

Stay vigilant, my friends!

Ways to compare networks

Nowadays network comparison is becoming increasingly relevant. Why is this?  Mainly because it is a desirable way to compare complex systems that can often be represented as networks.

Network comparison aims to account for properties that are generated by the simultaneous interaction of all units rather than the properties of each single individual. Here are some cases where network comparison could be helpful:

–  Showing and highlighting “significant” changes on network evolution. A particular characteristic of interest could be the speed with which information flows.

– Quantifying how “close” two networks  are. Even when the networks have a different different number of nodes and edges, or in the case of spatially embedded networks, different scale.

As an example, look at the following two networks. Are the structures of these two road networks similar?

Screen Shot 2015-06-15 at 14.56.21

(Images obtained from the mac Maps app)

Or what about the structure of these two other networks?
two_geometric

One of the difficulties in comparing networks is that there is no clear way to compare networks as complete whole entities. Network comparison methods only compare certain attributes of the network, among these: density of edges, global clustering coefficient, degree distribution, counts of smaller graphs embedded in the network and others. Here are some ideas and methods that have been used as ways to compare networks.

  • Networks can be compared by their global properties and summary statistics, like network density, degree distribution, transitivity, average shortest path length and others. Usually a statistic combining the differences of several global properties was used.
  • Another way to compare networks is based on the fit of a statistical network model (eg.  ERGM) to each network. Then, the coefficients of the fitted models could be compared. Note that in this case, a good fit of the models would be required.
  • Statistics directly built for network comparison via subgraph counts. These statistics do not make any assumptions of the network generation process. For example, Netdis and GCD are two network comparison statistics that try to measure the structural difference between networks. These network comparison measures are based on counts of small subgraphs (3-5 nodes), like triangles, 2-stars, 3-stars, squares, cliques and others (see Figure below). subgraph_plot_jpeg These network comparison statistics create frequency vectors of subgraphs and then compare these frequencies between the networks to obtain an idea of the similarity of the networks relative to their subgraph counts.
  • Lastly, another way to indirectly compare networks is via network alignment methods. The objective of these methods is to create a “mapping”, f, from the node set of one network to the node set of another.  The following Figure shows two networks, light and dark blue. An alignment of the two networks is shown in red.Screen Shot 2015-06-15 at 21.43.29One of the objectives of an alignment is to maximise the number of conserved interactions, that is, if (u,v) is an edge in the first network and f is an alignment, then an edge (u,v) is conserved if (f(u),f(v)) is an edge in the second network. It can be noted that the edge demarcated by the green oval is a non-conserved interaction.
    NETAL is a commonly used network alignment method, although there are several, more recent, network alignment methodologies.

In the end, despite the variety of ways to compare networks, saying that two networks are similar, or different, is not that easy, as all methods face their own particular challenges, like networks that come from the same model but have different node size.

 

Diseases exploiting the Power of Networks

Networks are pretty amazing. If you have a bunch of nodes that sit there and do very simple tasks that you can train a dog to do, and then you connect them in the right way, you can create a programme that can get through a level of Mario! In neural networks, all the nodes are doing is getting an input, then deciding what works best based on that input, and sending that information somewhere else. It’s like teaching your dog to get help when you’re in trouble. You say “Help!”, it makes a decision where to go to find help, and then someone else is alerted to your situation. Now you link those inputs and outputs, and suddenly you can solve some very interesting problems. Like making your dog get another dog to do something, who tells another dog… okay… at some point the analogy may break down. What I want to get across, is that networks can take very simple things and create complexity by just adding a few connections between them.

Getting back to the biological theme, proteins aren’t really all that simple. From what you’ll have read on other posts in this blog there’s a lot of effort that goes into modelling them. The position of even a small peptide loop can have a big effect on how a protein functions. So if you have a network of 15 – 25 thousand proteins you can imagine that the complexity of that system is quite incredible. That system is the human body.

Looking at systems from the perspective of interactions between components creating complex behaviour, it’s really not that surprising that human diseases often function by disrupting the network of protein interactions as found by this paper recently published in Cell. Assuming there’s a human disease which has survived for generations and generations, it is likely to be quite effective in how it functions. And if you have a complex system like the human body, the easiest way to disrupt that arising behaviour called “a healthy life”, is to alter a few of the interactions. Showing this for a large group of disease-associated missense mutations is however pretty impressive and time-consuming, which is probably why there are about 50 people from over 20 different institutions in the author list.

So what exactly did they show? They showed what happens when there is a mutation in a gene that causes a disease. Such a mutation replaces an amino acid in the protein sequence encoded by the gene. The resulting protein is then somehow different. The group around Marc Vidal and Susan Lindquist showed that rather than affecting the stability of the protein, the system tends to be targeted via the interactions of the protein. What is more, they showed that different mutations of the same gene can affect different sets of interactions. Using their “edgotyping” approach it is possible to characterize the type of effect a mutation will have and possibly predict the change in phenotype (i.e. the disease state).

Edgotyping classifies the effect of disease associated mutations (Sahni et al)

Edgotyping classifies the effect of disease associated mutations (Sahni et al)

Now if the possibility of predicting how a single mutation (1 in ~300 amino acids), in a single protein (1 in ~ 25,000 proteins) affects the human body doesn’t convince you that networks are pretty powerful, I really don’t know what will. But now back to the important stuff…. how do you make a real life neural network with communicating dogs? And could you make them reach level two on Mario? Or… Super Mario Dog?