Tag Archives: journal club

Journal Club: Quantification and functional analysis of modular protein evolution in a dense phylogenetic tree

For journal club this week I decided to look at a paper by Moore et al. on the modular evolution of proteins.

Modular evolution, or the rearrangement of the domain architecture of a protein, is one of the key drivers behind functional diversification in the protein universe. I used the example in my talk of the multi-domain protein Peptidase T, which contains a catalytic domain homologous to Carboxypeptidase A, a zinc dependent protease. The additional domain in Peptidase T induces the formation of a dimer, which restricts the space around the active site and so affects the specificity of the enzyme.

peptidase_t

The multi-domain protein Peptidase T in a dimer (taken from Bashton and Chothia 2007). The active site is circled in green. Carboxypeptidase A is made up of a single domain homologous to the catalytic domain (in blue) of Peptidase T.

 
I took this case study from a really interesting paper, The generation of new protein functions by the combination of domains (Bashton and Chothia, 2007), which explores several other comparisons between the functions of multi-domain proteins and their single domain homologues.

What this paper does not address however is the directionality of such domain reorganisations. In all these examples, it is not clear whether the multi-domain organisation has evolved from the single domain enzyme or vice versa. Which brings me back to the paper I was presenting on, which attempts a reconstruction of domain arrangements followed by a categorisation of rearrangement events.

Essentially, given a phylogenetic tree of 20 closely related pancrustacean species, the paper takes the different domain arrangements on the genomes (step 1), assigns the presence or absence of each arrangement at interior nodes on the tree (step 2), and then assigns each gained arrangement to one of four possible rearrangement events (step 3).

1. Domain Annotation
The authors use different methods to annotate domains on the genomes. They conclude the most effective methodology is to use the clan level (where families with suspected homologies are joined together… similar to our beloved superfamily classification) of Pfam-A (high quality family alignments with a manually chosen seed sequence). Moreover, they collapse any consecutive stretches of identical domains into one “pseudo-domain”, eliminating the effect of the domain repeat number on an arrangement’s definition.

2. Ancestral State Reconstruction
The ancestral state reconstruction of each domain arrangement (it’s presence/absence at each internal node on the tree) is a result of a 2-pass sweep across the tree: the first from leaves to root, and the second from the root to the leaves. On the first pass the presence of an arrangement on a parent node is decided by majority rule on the state of its children. If the arrangement is present in one child node but absent in the other, the state at the parent node is defined as uncertain. Any uncertain child nodes have a neutral impact on their parent node’s state (i.e. if a parent has a child with the arrangement and a child with an uncertain state the arrangement will be annotated as present in the parent node). On the second pass (from root to leaves) uncertain nodes are decided by the state at their parent node. An uncertain arrangement at the root will be annotated as present. For more details and a clearer explanation see Box 1 in the figure below.

FigureS2

A schematic for the assignment of domain recombination events. Box 1 gives the algorithm for the ancestral state reconstruction. Figure S2 from Moore et al. 2013.

3. Rearrangement events
For each gained event on a particular branch, the authors then postulated one of four simple rearrangement events dependent on the arrangements on the parent’s predicted proteome.

i) Fusion: A gained domain arrangement (A,B,C) on a child’s proteome is the result of fusion if the parent’s proteome contains both the arrangements (A,B) AND (C) (as one example).
ii) Fission: A gained arrangement (A,B,C) is the result of fission if the parent contains the arrangement (A,B,C,D) AND the child also contains the arrangement (D).
iii) Terminal Loss: A gained arrangement (A,B,C) is the result of terminal loss if the parent contains the arrangement (A,B,C,D) AND the child does not contain the arrangement (D).
iv) Domain gain: A gained arrangement (A,B,C) is the result of domain gain if the parent contains (A,B) but not (C).

Any gained arrangement which cannot be explained by these cases (as a single-step solution) is annotated as having no solution.

Results

The authors find, roughly speaking, that the domain arrangements they identify fall into a bimodal distribution. The vast majority are those which are seen on only one genome, of which over 80% are multi-domain arrangements. There are also a sizeable number of arrangements seen on every single genome, the vast majority of which are made up of a single domain. I do wonder though, how much of this signal is due to the relative difficulty of identifying and assigning multiple different domains compared to just a single domain. While it seems unlikely that this would explain the entirety of this observation (on average, 75% of proteins per genome were assigned) it would be interesting to have seen how the authors address this possible bias.

Interestingly, the authors also find a slight inflation in fusion events over fission events across the tree (around 1 more per million years), although there are more fusion events nearer the root of the tree, with fission dominating nearer the leaves, and in particular, on the dense Drosophila subtree.

Finally, the authors performed a functional term enrichment analysis on the domain arrangements gained by fusion and fission events and showed that, in both cases, terms relating to signalling were significantly overrepresented in these populations, emphasising the potential importance that modular evolution may play in this area.

Journal Club: A mathematical framework for structure comparison

For my turn at journal club I decided to look at this paper by Liu et. al. from FSU.

The paper, efficiently titled ‘A mathematical framework for protein structure comparison’, is motivated by the famously challenging problem of structure comparison. Historically, and indeed presently, the most respected methods of structural classification (dividing up the protein universe by structural similarity) have required a significant amount of manual comparison. This is essentially because we have no formal consensus on what similarity between protein structures truly means and how it should be measured. There are computational challenges to efficient alignment as well but, without a formal framework and a well-defined metric, structure comparison remains an ill-posed problem.

The solution for the authors was to represent each protein structure as a continuous, elastic curve and define the deformation between any two curves as the distance between their structures. We consider, hypothetically at first, a geometric space where each point represents a 3D curve (or potential protein structure). One problem with studying protein structures as geometric objects in Euclidean 3D space is that we really want to ignore certain shape-preserving transformations, such as translations, rotations, scaling and re-parametrization. Ideally therefore, we’d like our geometric space to represent curves unique up to a combination of these transformations. That is, if a protein structure is picked up, turned upside down, then put down somewhere else, it should still be treated as the same shape as it was before. Let’s call this space the shape space S.

Key to the formulation of this space is the representation of a protein structure by its square root velocity function (SRVF). We can represent a protein structure by a continuous function β, which maps the unit interval onto 3D space: β: [0,1] → R3. The SRVF of this curve is then defined as:

q

where β'(t) is the derivative of β. q(t) contains both the speed and the direction of β but, since it is the derivative, it is invariant to any linear translation of β. So, simply by using the SRVF representation of a curve, we have eliminated one of the shape-preserving transformations. We can eliminate another, rescaling, by requiring β(t) to have unit length: ∫|β'(t)|2dt = ∫|q(t)|2dt = 1.

The space of q‘s is not the shape space S as we still have rotations and re-parametrizations to account for but it is a well-defined space (a unit sphere in the Hilbert space L2 if you were wondering). We call this space the preshape space C. The main advantage of the SRVF representation of curves is that the standard metric on this space explicitly measures the amount of deformation, in terms of stretching and bending, between two curves.

All that is left to do is to eliminate the effects of an arbitrary rotation and re-parametrization. In fact, instead of working directly in S, we choose to remain in the preshape space C as its geometry is so well-defined (well actually, it’s even easier on the tangent space of C but that’s a story for another time). To compare two curves in C we fix the first curve, then find the optimal rotation and parametrization of the second curve so as to minimise the distance between them. Computationally, this is the most expensive step (although almost every other global alignment will need this step and more) and consists of firstly applying singular value decomposition to find the rotation then a dynamic programming algorithm to find the parametrization (this is the matching function between the two backbone structures… you could add secondary structure/sequence constraints here to improve the relevance of the alignment).

Now we can compare q1 and q2 in C. Not only can we calculate the distance between them in terms of the deformation from one to the other, we can also explicitly visualise this optimal deformation as a series of curves between q1 and q2, calculated as the geodesic path between the curves in C:

geodesic_eq

where θ = cos-1( ∫ <q1,q2>dt) is the distance between q1 and q2 in S.

geodesic

The geodesic path between protein 1MP6, the left-most structure, and protein 2K98, the right-most structure.

As a consequence of this formal framework for analysing protein structures it’s now possible, in a well-defined way, to treat structures as random variables. Practically, this means we can study populations of protein structures by calculating their mean structure and covariances.

Journal Club: The complexity of Binding

Molecular recognition is the mechanism by which two or more molecules come together to form a specific complex. But how do molecules recognise and interact with each other?

In the TIBS Opinion article by Ruth Nussinov group, an extended conformational selection model is described. This model includes the classical lock-and-key, induced fit, conformational selection mechanisms and their combination.

The general concept of equilibrium shift of the ensemble was proposed nearly 15 years ago, or perharps earlier. The basic idea is that proteins in solution pre-exist in a number of conformational substates, including those with binding sites complementary to a ligand. The distribution of the substates can be visualised as free energy landscape (see figure above), which helps in understanding the dynamic nature of the conformational equilibrium.

This equilibrium is not static, it is sensitive to the environment and many other factors. An equilibrium shift can be achieved by (a) sequence modifications of special protein regions termed protein segments, (b) post-translational modifications of a protein, (c) ligand binding, etc.

So why are these concepts discussed and published again?

While the theory is straight-forward, proving conformational selection is hard and it is even harder to quantify it computationally. Experimental techniques such Nuclear Magnetic Resonance (NMR), single molecule studies (e.g. protein yoga), targeted mutagenesis and its effect on the energy landscape, plus molecular dynamics (MD) simulations have been helping to conceptualise conformational transitions. Meanwhile, there is still a long way to go before a full understanding of atomic scale pathways is achieved.

Journal club: Principles for designing ideal protein structures

The goal of protein design is to generate a sequence that assumes  a certain structure and/or performs a specific function. A recent paper in Nature has attempted to design sequences for each of five naturally occurring protein folds. The success rate ranges from 10-40%.

This recent work comes from the Baker group, who are best known for Rosetta and have made several previous steps in this direction. In a 2003 paper this group stripped several naturally occurring proteins down to the backbone, and then generated sequences whose side-chains were consistent with these backbone structures. The sequences were expressed and found to fold into proteins, but the structure of these proteins remained undetermined. Later that same year the group designed a protein, Top7, with a novel fold and confirmed that its structure closely matched that of the design (RMSD of 1.2A).

The proteins designed in these three pieces of work (the current paper and the two papers from 2003) all tend to be more stable than naturally occurring proteins. This increased stability may explain why, as with the earlier Top7, the final structures in the current work closely match the design (RMSD 1 or 2A), despite ab initio structure prediction rarely being this accurate. These structures are designed to sit in a deep potential well in the Rosetta energy function, whereas natural proteins presumably have more complicated energy landscapes that allow for conformational changes and easy degradation. Designing a protein with two or more conformations is a challenge for the future.

In the current work, several sequences were designed for each of the fold types. These sequences have substantial sequence similarity to each other, but do not match existing protein families. The five folds all belong to the alpha + beta or alpha/beta SCOP classes. This is a pragmatic choice: all-alpha proteins often fold into undesired alternative topologies, and all-beta proteins are prone to aggregation. By contrast, rules such as the right-handedness of beta-alpha-beta turns have been known since the 1970s, and can be used to help design a fold.

The authors describe several other rules that influence the packing of beta-alpha-beta, beta-beta-alpha and alpha-beta-beta structural elements. These relate the lengths of the elements and their connective loops with the handedness of the resulting subunit. The rules and their derivations are impressive, but it is not clear to what extent they are applied in the design of the 5 folds. The designed folds contain 13 beta-alpha-beta subunits, but only 2 alpha-beta-beta subunits, and 1 beta-beta-alpha subunit.

An impressive feature of the current work is the use of the Rosetta@home project to select sequences with funnelled energy landscapes, which are less likely to misfold. Each candidate sequence was folded >200000 times from an extended chain. Only ~10% of sequences had a funnelled landscape. It would have been interesting to validate whether the rejected sequences really were less likely to adopt the desired fold — especially given that this selection procedure requires vast computational resources.

The design of these five novel proteins is a great achievement, but even greater challenges remain. The present designs are facilitated by the use of short loops in regions connecting secondary structure elements. Functional proteins will probably require longer loops, more marginal stabilities, and a greater variety of secondary structure subunits.