Category Archives: Uncategorized

Expanding Anfinsen’s Principle (Journal Club)

Paper: Expanding Anfinsen’s Principle: Contributions of Synonymous Codon Selection to Rational Protein Design.

In 1961, Anfinsen performed his now (in)famous denaturing experiment upon ribonuclease A, a small one hundred residue globular protein. He showed that it could both unfold and refold via the addition and subsequent removal of chemical substances. From this he concluded that a protein’s fold is that of its global free energy minimum and, consequently, all the information required to know the folded structure of a given protein is solely encoded within its sequence of amino acids. In 1972, Anfinsen was awarded the Nobel prize for this work from which stemmed the vast field of protein folding prediction, a global arms race to see who could best predict/find the elusive global minimum for any given protein.

Unfortunately, protein fold prediction is still in its infancy with regards to its predictive power. As a scientific community, we have made huge progress using homology models, whereby we use the structure of a protein with similar sequence to the one under investigation to provide a reasonable starting point for refinement. However, when there is no similar structure in existence, we struggle abysmally due to being forced to resort to de novo models. This lack of ability when we are given solely a sequence to work with, shows that that our fundamental understanding of the protein folding process must be incomplete.

An increasingly common viewpoint, one that is at odds with Anfinsen’s conclusions, is that there is additional information required for a protein to fold. One suggested source of information is in the production of the protein itself at the ribosome. Coined as cotranslational folding, it has been shown that a protein undergoing synthesis will fold as it emerges from the ribosome, not waiting until the entire sequence is synthesised. As such, the energy landscape that the protein must explore to fold is under constant change and development as more and more of the protein emerges from the ribosome. It is an iterative process of smaller searches as the energy landscape is modulated in steps to that of the complete amino acid sequence.

Another suggested source of information is within the degeneracy observed within the genetic code. Each amino acid is encoded for by up to 6 different codons, and as such, one can never determine exactly the coding DNA that created a given protein. While this degeneracy has been suggested as merely an effect to reduce the deleterious nature of point mutations, it has also been found that each of these codons are translated at a different rate. It is evident that information is consumed when RNA is converted into protein at the ribosome, sine reverse translation is impossible, and it is hypothesised that these variations in speed can alter the final protein structure.

Figure 1. Experimental design for kinetically controlled folding. (a) Schematic of YKB, which consists of three half-domains connected by flexible (AGQ)5 linkers (black lines). The Y (yellow) and B (blue) half-domains compete to form a mutually exclusive kinetically trapped folded domain with the central K (black) half-domain. The red wedge indicates the location of synonymous codon substitutions (see text). (b) Energy landscapes for proteins that fold under kinetic control have multiple deep minima, representing alternative folded structures, separated by large barriers. The conformations of the unfolded protein and early folding intermediates (colored arrows) determine the final folded state of the protein. Forces that constrict the unfolded ensemble (gray cone) can bias folding toward one structure. (c) During translation of the nascent chain by the ribosome (orange), folding cannot be initiated from the untranslated C-terminus, which restricts the ensemble of unfolded states and leads to the preferential formation of one folded structure.

Figure 1. Experimental design for kinetically controlled folding. (a) Schematic of YKB, which consists of three half-domains connected by flexible (AGQ)5 linkers (black lines). The Y (yellow) and B (blue) half-domains compete to form a mutually exclusive kinetically trapped folded domain with the central K (black) half-domain. The red wedge indicates the location of synonymous codon substitutions (see text). (b) Energy landscapes for proteins that fold under kinetic control have multiple deep minima, representing alternative folded structures, separated by large barriers. The conformations of the unfolded protein and early folding intermediates (colored arrows) determine the final folded state of the protein. Forces that constrict the unfolded ensemble (grey cone) can bias folding toward one structure. (c) During translation of the nascent chain by the ribosome (orange), folding cannot be initiated from the untranslated C-terminus, which restricts the ensemble of unfolded states and leads to the preferential formation of one folded structure. Image sourced from J. Am. Chem. Soc., 2014, 136(3),

The journal club paper by Sander et al. looked experimentally at whether both cotranslational folding and codon choice can have effect on the resultant protein structure. This was achieved through the construction of a toy model protein, consisting of three half domains as shown in Figure 1. Each of these half domains were sourced from bifluorescent proteins, a group of protein half domains that when combined fluoresce. The second half domain (K) could combine with either the first (Y) or the third (B) half domains to create a fluorophore, crucially this occurring in a non-reversible fashion such that once a full domain was formed it could not form the other. By choosing flurophores that differed in wavelength, it was simple to measure the ratio in which the species, YK-B or Y-KB, were formed.

They found that the ratio of these two species differed between in-vitro and in-vivo formation. When denatured Y-K-B species were allowed to refold, a racemic mixtrue was produced, both species found the be equally likely to form. In contrast, when synthesised at the ribosome, the protein showed an extreme bias to form the YK-B species as shown in Figure 2. They concluded that this is caused by cotranslational folding, the half domains Y and K having time to form the YK species before B was finished being produced. As pointed out by some members within the OPIG group, it would have been appreciated to see if similar results were produced if the species were reversed, such that B was synthesised first and Y last, but this point does not invalidate what was reported.

Figure 2. Translation alters YKB folded structure. (a) Fluorescence emission spectra of intact E. coli expressing the control fluorescent protein constructs YK (yellow) or KB (cyan). (b) Fluorescence emission spectra of intact E. coli expressing YKB constructs with common or rare codon usage (green versus red solid lines) versus the same YKB constructs folded in vitro upon dilution from a chemical denaturant (dashed lines). Numbers in parentheses correspond to synonymous codon usage; larger positive numbers correspond to more common codons. (c) E. coli MG1655 relative codon usage(3) for codons encoding three representative YKB synonymous mutants: (+65) (light green), (−54) (red), and (−100) (pink line).

Figure 2. Translation alters YKB folded structure. (a) Fluorescence emission spectra of intact E. coli expressing the control fluorescent protein constructs YK (yellow) or KB (cyan). (b) Fluorescence emission spectra of intact E. coli expressing YKB constructs with common or rare codon usage (green versus red solid lines) versus the same YKB constructs folded in vitro upon dilution from a chemical denaturant (dashed lines). Numbers in parentheses correspond to synonymous codon usage; larger positive numbers correspond to more common codons. (c) E. coli MG1655 relative codon usage(3) for codons encoding three representative YKB synonymous mutants: (+65) (light green), (−54) (red), and (−100) (pink line). Image sourced from J. Am. Chem. Soc., 2014, 136(3).

Following the above, they also probed the role of codon choice using this toy model system. They varied the codons choice over a small segment of residues between the K and B half domains, such that the had multitude of species which would be encoded either “faster” or “slower” across this region. Codon usage was used as the measure of speed, though this has yet to established within the literature as to its appropriateness. They found that the slower species increased the bias towards the YK-B over Y-KB, while faster species reduced it. This experiment shows clearly that codon choice has a role on a protein’s final structure, though they only show a large global effect. My work is primarily on whether codon choice has a role at the secondary structure level, so I will be avidly hoping that more experiments will follow that show the role of codons at finer levels.

In conclusion, Sander et al. performed one of the cleanest experimental proofs of cotranslational folding to date. Older evidence is more anecdotal in nature, with reports of protein X or Y changing in response to a single synonymous mutation. In contrast, the experiment reported here is systematic in the approach and leaves little room for doubt over the results. Secondly and more ground breaking, is the (again) systematic nature in which codon choice is investigated and shown to effect the global protein structure. This is one of those rare pieces of science which the conclusions are clear and forthcoming to all readers.

Activity cliffs

Mini-perspective on activity cliffs as a medicinal chemistry tool

Recently in group meeting we discussed activity cliffs and their application to medicinal chemistry. The talk was based largely on the excellent mini-perspective on the subject written by Stumpfe et al.

What is an activity cliff?

Activity cliffs are two compounds that represent a small structural change but a large activity change. They are used commonly in the design of new compounds targeting a particular protein (drugs). They work on the principal that if a given structural change has previously had a large affect on activity it is likely to have a similar affect on a different compound series. In this way they can be used as predictive tools to suggest chemical transformations that are likely to improve activity for a given compound.

To define an activity cliff, one must consider what a small structural change and a large activity change mean.

Small structural change

Structural changes can be measured using a seemingly endless array of methods. A lot of methods will condense the chemical information of the molecule into a bit-vector. Each bit indicates the molecule contains a particular type of chemical functionality, e.g. a methyl group. Molecular similarity is then assessed by comparing the bit-vectors, most commonly by finding the Tanimoto similarity between the them. This then returns a single value between 0 and 1 indicating how similar the two molecules are (the greater the more similar). To define small structural change, one must decide upon a threshold value above which two molecules are sufficiently similar.
An alternative method is to find matched molecular pairs – compounds which are identical apart from one structural change. An example of one is shown below. For matched molecular pairs the only parameter required is the size of the non-matching part of the pair. This is usually measured in non-hydrogen atoms. The threshold to use for this parameter is chosen equally arbitrarily however it has a much more intuitive effect.

mmp

An example of a matched molecular pair

Which method to use?

Similarity methods are less rigid and are capable of finding molecules that are very similar, however that differ in two or more subtle ways. They however are also liable to find molecules similar when they would not be perceived as so. In this work Stumpfe et al. show that different similarity methods do not agree greatly on which molecules are “similar”. They compare six different fingerprint methods used to find similar molecules. Each method finds around 30% similar molecules in the datasets used, however the consensus between the methods is only 15%. This indicates that there is no clear definition of “similar” using bit-string similarity. Interestingly a third of the molecules found to be similar by all six fingerprint methods are not considered matched molecular pairs. This demonstrates a downside of the matched molecular pair approach, that it is liable to miss highly similar molecules that differ in a couple of small ways.

Matched molecular pairs are, however, least liable to find false-positives, i.e. compounds that are seen as similar but in fact are not actually similar. The transformations they represent are easily understood and this can be easily applied to novel compounds. For these reasons matched molecular pairs were chosen by Stumpfe et al. for this work to indicate small structural changes.

Large activity change

A large activity change is an equally arbitrary decision to make. The exact value that indicates an activity cliff will depend on the assay used and the protein being tested against. Stumpfe et al. reasonably suggest that approximate measures should not be used and that activity scores found between different assays should not be compared.

Rationales for activity cliffs

If structural data is available for an activity cliff, rationales for their corresponding activity change can be suggested. These can then be used to suggest other alterations that might have a similar impact. Stumpfe et al. consider the five most common rationales for activity cliffs.

  • H-bond and or ionic interactions: these interactions will increase the binding energy forming specific interactions with the protein
  • Lipophilic and aromatic groups: these groups can form specific protein-ligand interactions, e.g. pi-pi stacking and also form favourable interactions with hydrophobic residues in the protein
  • Water molecules: One molecule in the pair displaces water molecules from the active site, altering the binding energy
  • Stereochemistry changes: for example altering an enantiomeric form of a compound alters the projection of a group, forming or losing favourable/disfavourable protein-ligand interactions
  • Multiple effects: a combination of the above, and thus difficult to establish the dominant feature.

Are they generally useful?

Stumpfe et al. consider whether activity cliffs are more useful for some proteins or protein classes than others. They investigate how many compounds form activity cliffs for many protein targets for which activity data is openly available. For proteins with more than 200 compounds with activity data the number of activity cliff forming compounds is roughly equivalent (around 10%). This is an interesting and unexpected result. The proteins used in this study have different binding sites attracting different opportunities for protein-ligand interactions. It would not, therefore naturally be expected that these would attract similar opportunities for generating activity cliffs. This result shows that the activity cliff concept is generically useful, irrespective of the protein being targeted.

Are they predictive?

Although activity cliffs make intuitive sense, Stumpfe et al. consider whether it has been quantitatively successful in previous drug discovery efforts. They investigate all of the times that activity cliff information was available from openly available data. They then find all the times this information was used in a different compound series and if it was used whether it had a positive or negative effect on activity.

Interestingly available activity cliff information had not been used in 75% of cases. They suggest that this indicates this information is an as yet underused resource. Secondly, in the cases where it was used, 60% of the time it was successful in improving activity and 40% of the time is was unsuccessful. They suggest this indicates the activity cliff method is useful for suggesting novel additions to compounds. Indeed it is true that a method that gives a 60% success rate in predicting more potent compounds would be considered useful by most if not all medicinal chemists. It would be interesting to investigate if there were patterns in protein environment or the nature of the structural changes in the cases where the activity cliff method is not successful.

Have they been successful?

Finally Stumpfe et al. investigate whether using activity cliff information gives a higher probability of synthesising a compound in the 10% most active against the target protein. They show that in 54% of cases using activity cliff information a compound in the 10% most active is formed. Conversely when this information is not used only 28% of pathways produce a compound in the 10% most active. They argue this indicates that using activity cliff information improves the chances of producing active compounds.

Conclusion

The paper discussed here offers an excellent overview of the activity cliff concept and its application. They demonstrate, in this work and others, that activity cliffs are generally useful, predictive and currently underused. The method can therefore be used in new tools to improve the efficiency of drug discovery.

CAPRI, pyDock and the state of protein-protein docking.

Intro

We have discussed a paper highlighting the latest progress in the field of protein-protein docking. It covers the results of the participation of the pyDock group in CAPRI. The latest round of the experiment is particularly interesting as it challenged the participants with affinity prediction which is going one step beyond modeling protein-protein complexes.

Role of the group in this CAPRI edition

For those that are not very familiar with CAPRI, one can participate in any of the following roles:

  • Predictors: Participants are given the unbound structures or sequences of one or both interacting partners. Predictors are then supposed to submit a ranked list of modeled complexes which can be achieved by any docking/prediction tools and some human intervention. This tests the current ad hoc capacity to model protein-protein complexes.
  • Servers: Participants are given the unbound structures or sequences of one or both interacting partners that are supposed to be submitted to an automatic server which needs to return the results (ideally) within 24 hours. This tests the automated capacity to model protein-protein models.
  • Scorers: The community contributes sets of poses (unranked) for each target. The scorers are then supposed to re-rank the poses. This is a form of cross-validation of methods since one could expect that the group that samples the poses also performs better scoring. This exercise is supposed to quantify this.

The pyDock group participates in the capacity of Predictors and Scorers. Below we first outline the general docking methodology, followed by the particular implementation of the pyDock group.

General Docking Methodology

Protein-protein docking starts with two interacting structures at input: (ligand (L) and receptor (R)) (if the structures are unavailable a model is created). The docking techniques can be broadly divided into two types: template based and template-free. In template-based docking, one needs either a complex of homologs or one that is sufficiently structurally similar. However since it is estimated that the number of complexes in the PDB only covers ~4% of the presumed total, this approach is not applicable in a great number of cases.

Template-free docking does not rely on pre-solved complexes and it is the main focus of this post. Standard modus-operandi of a template-free protein-protein docker is the following:

  1. Sample N poses.
  2. Re-score top D sampled poses.
  3. Refine the top K poses.

In the first step, possible poses of ligand with respect to the receptor. This is very often achieved by FFT (ZDOCK) or more elaborate methods such as geometric hashing/pose clustering (PatchDock). The possible orientations are ranked by a simplified energy function or a statistical potential. In most cases this step is performed in rigid-body mode (intra-molecular distances are not allowed to change) for the computational efficiency. The number of sampled poses is usually in the thousands (N~ 2k-10k).

The re-scoring set uses a more elaborate energy function/statistical potential to identify more close-to native poses. Notable examples include pyDock and ZRANK. Since such functions are usually more expensive computationally (for instance pyDock calculates the LJ potential, Coulombic electrostatics and desolvation terms), it is more tractable to apply them to a smaller set of (D<<N) of top poses as returned by the rigid-body sampler. Additionally, the final poses might be pruned for redundancy by removing structures which are very similar to each other. One method in particular (ClusPro) owes its success to scoring the rankings based on the numbers of similar decoys per cluster out of a number of top predictions.

The final step constitutes the refinement of the select top K poses (K<<D). Here, flexibility might be introduced so as to account for conformational changes upon binding. Tools used to achieve this are very computationally expensive energy fields such as AMBER or CHARMM. The coordinates of side-chains or even backbones are distorted, for instance using normal mode calculations, until the energy function reaches an energetic minimum via a flavor of gradient descent.

Fig. 1: Breakdown of results of the pyDock group at the latest CAPRI round. The results are presented for each target, in the predictor as well as scorer capacity.

Fig. 1: Breakdown of results of the pyDock group at the latest CAPRI round. The results are presented for each target, in the predictor as well as scorer capacity.

The pyDock group Methodology and Results

The pyDock group follows the pattern outlined above. As a sampler they employ ZDOCK and FTDOCK, both fast rigid-body Fast Fourier Transform-based methods. They use their pyDock function to score the best models. Additionally, they remove redundancy from the final lists of decoys by removing these entries which are too similar to the higher-scoring ones according to ligand rmsd. The final refining step is carried out using TINKER.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

The pipeline outlined above is carried out in most of targets, however there are some exceptions. For some targets, additional docking poses were generated using SwarmDock (T53, T54, T57 and T58) and RotBUS (T46 and T47). In some cases rather than using TINKER at the refinement stage, CHARMM or AMBER were used.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Such ad hoc approaches were employed for almost every target. Available information was employed to the fullest in order to achieve best models. For instance, in cases where good homology structures were available (T47), or it was known which residues appear of importance to the complex, appropriate distance constraints were imposed. The pyDock group achieves better accuracy as predictors rather than scorers when it comes to docking (67% correct models submitted against 57%). They follow a trend wherein predictors usually do better than scorers (See Fig. 1). This trend is however broken at the stage of predicting the water molecules at the interface of T47. Using DOWSER In the predictor capacity they correctly identify 20% contact-mediating water molecules and 43% as scorers.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Overall, the performance of docking is indicating that the field is steadily moving forward towards achieving the goal of modeling complexes using sequence alone. There were some cases in this round where predictor groups started with sequence only and still produced reasonable models of complexes (T47, T50 and T53). In each of these cases one of the partners was an unbound structure and the other was a sequence. The only case where both partners were sequences did not produce any reasonable models. In this case only two groups out of 40 managed to present meaningful solutions.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Furthermore, this CAPRI round was the first to introduce affinity prediction – in targets T55 – T56. The aim was to predict the binding energy changes upon mutations. The predictor designed by the pyDock group achieved a good results on this test case with more in-depth details on the approach found in a related community-wide experiment.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

 

Efficient discovery of overlapping communities in massive networks

Detecting overlapping communities is essential to analyzing and exploring natural networks such as social networks, biological networks, and citation networks. However, most existing approaches do not scale to the size of networks that we regularly observe in the real world. In the paper by Gopalan et al. discussed in this week’s group meeting, a scalable approach to community detection is developed that discovers overlapping communities in massive real-world networks. The approach is based on a Bayesian model of networks that allows nodes to participate in multiple communities, and a corresponding algorithm that naturally interleaves subsampling from the network and updating an estimate of its communities.

The model assumes there are K communities and that each node i is associated with a vector of community memberships \theta_i. To generate a network, the model considers each pair of nodes. For each pair (i,j), it chooses a community indicator z_{i \rightarrow j} from the i^{th} node’s community memberships \theta_i and then chooses a community indicator z_{i \leftarrow j} from \theta_j. A connection is then drawn between the nodes with probability \beta if the indicators point to the same community or with a smaller probability \epsilon if they do not.

This model defines a joint probability p(\theta,\beta,z,y) where y is the observed data. To estimate the posterior p(\theta,beta,z | y), the method uses a stochastic variational inference algorithm. This enables posterior estimation using only a sample of all-possible node pairs at each step of the variational inference, making the method applicable to very large graphs (e.g analyzing a large citation network of physics papers shown in the figure below identifies important papers impacting several sub-disciplines).

Community detection in a physics citation network.

Community detection in a physics citation network.

One limitation of the method is that it does not incorporate automatic estimation of the number of communities, which is a general problem with clustering algorithms. Still, enabling sophisticated probabilistic analysis of structure in massive graphs is a significant step forward.

Link

Journal club (Bernhard Knapp): NetMHCIIpan-3.0

This week’s topic of the Journalclub was about the prediction of potential T cell epitopes (=prediction of the binding affinity between peptides and MHC). In this context I presented the latest paper in the NetMHCpan series:

Karosiene E, Rasmussen M, Blicher T, Lund O, Buus S, Nielsen M. NetMHCIIpan-3.0, a common pan-specific MHC class II prediction method including all three human MHC class II isotypes, HLA-DR, HLA-DP and HLA-DQ. Immunogenetics. 2013 Oct;65(10):711-24

The reliable prediction of the peptide/MHC binding affinity is already a scientific aim for several years: Early approaches were based on the concept of anchor residues i.e. those peptide side-chain which are pointing in the direction of the MHC. The next “generation” of methods was matrix based i.e. it was simply counted how likely it is that a specific residue is at peptide position 1, 2, 3, … for binding and non-binding peptides of a specific MHC allele. In the next step methods started to incorporate higher order approaches i.e. positions within the peptide influence each other (e.g. SVM, ANN, …). While such methods are already quite reliable their major limitation is that they can only be trained on the basis of sufficient experimental binding data per allele. Therefore a prediction for alleles with only few experimental peptide binding affinities is not possible or rather poor in quality. This is especially import because there are several thousand HLA alleles: The latest version of the IPD – IMGT/HLA lists for example 1740 HLA Class I A and 2329 HLA Class I B proteins (http://www.ebi.ac.uk/ipd/imgt/hla/stats.html). Some of them are very frequent and others are not but this the aim would be a 100% coverage.

Pan specific approaches go one step further and allow the binding prediction between peptide and an arbitrary MHC allele for which the sequence is known. The NetMHCpan approach is one of them and is based on the extraction of a pseudo sequence (those MHC residues which are in close proximity to the peptide) per MHC allele. Subsequently this MHC pseudo sequence as well as the peptide sequences are encoded using some algorithm e.g. a BLOSUM matrix. Finally the encoded sequences and the experimental binding affinities are fed into an Artificial Neural Network (ANN). For an illustration of the workflow see:

NetMHCpan workflow
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2981877/bin/1745-7580-6-S2-S3-2.jpg

This approach turned out to be quite successful and (not too much) depending on the allele the area under ROC reaches in average a healthy 0.8x number which is already quite competitive with experimental approaches.

The netMHCpan series has progressed over the last years starting to cover MHC class I in humans (2007) and beyond (2009). Then MHC class II DR (2010) and in the latest version also DP and DQ (2013). With the next paper expected to be about non human MHC class II alleles the prediction of the binding affinity between peptides and pretty much every MHC should be possible.

AH^AH – Alpha Helices Assessed by Humans.

Can you help our research, by sparing about 15 minutes of your time to participate in our crowd-sourcing experiment? Click here!

We have created a game-like web-application to classify protein helices. If you participate it will take you about 15 minutes and afterwards we will show you the relation between your results and all other participants. You do not need any specific training to participate – your natural human puzzle solving skills are sufficient. Try to beat the others!

blog3

We are interested in kinked α-helices in proteins. Kinks are important for protein function, and understanding them will help to treat and design therapies for diseases. One hindrance to our work is that helix kinks are not well defined – there are many helices that are classed as kinked by some scientists, but not by others. Classifying helices as kinked or curved or straight is difficult computationally. There are lots of different way to do it, but they all give different answers. We think that humans can do it better.

Can you help us by sparing about 15 minutes of your time to assess some α-helices? Click here!

Edit: To participate in this, you need to register. While the data will be made anonymous, registration is required for a number of reasons. It allows you to log back in, and view you results at a later date, it gives us a record that you have participated (so that we can acknowledge your help), and it allows us to compare the results from people with different educational backgrounds. There are also some legal reasons, and it is much more simple if we can record that you have agreed to the privacy and cookies policy, and do not have to ask permission every time you return. We will not pass your data on to third parties, and the data is stored securely (and the passwords encrypted).

Wilcoxon-Mann-Whitney test and a small sample size

The Wilcoxon Mann Whitney test (two samples), is a non-parametric test used to compare if the distributions of two populations are shifted , i.e. say f_1(x) =f_2(x+k) where k is the shift between the two distributions, thus if k=0 then the two populations are actually the same one. This test is based in the rank of the observations of the two samples, which means that it won’t take into account how big the differences between the values of the two samples are, e.g. if performing a WMW test comparing S1=(1,2) and S2=(100,300) it wouldn’t differ of comparing S1=(1,2) and S2=(4,5). Therefore when having a small sample size this is a great loss of information.

Now, what happens when you perform a WMW test on samples of size 2 and 2 and they are as different as they can be (to what the test concerns), lest say S1=(1,2) and S2=(4,5). Then the p-value of this test would be 0.333, which means that the smallest p-value you can obtain from a WMW test when comparing two samples of size 2 and 2 is 0.3333. Hence you would only be able to detect differences between the two samples when using a level of significance greater than 0.333 .

Finally you must understand that having a sample of two is usually not enough for a statistical test. The following table shows the smallest  p-value for different small sample sizes when the alternative hypothesis is two sided. (Values in the table are rounded).

Screen Shot 2015-01-20 at 10.50.56

CABS-flex

One of the still open questions in bioinformatics is that of the flexibility of proteins, and it is one in which I am quite interested. Our main source of structural information is X-ray diffraction experiments, in which the crystalline protein is intrinsically rigid. While there is an ever increasing body of NMR data, and Molecular Dynamics simulations are becoming faster and cheaper, complete information about the dynamics of every PDB structure is a long way off. All atom MD simulations for a complete protein still take either days, or a powerful computer. So, naturally, any papers that talk about ways to get flexibility information catch my attention.

The paper that caught my attention was CABS-flex: Server for fast simulation of protein structure fluctuations.. There also is a connected paper – Consistent View of Protein Fluctuations from All-Atom Molecular Dynamics and Coarse-Grained Dynamics with Knowledge-Based Force-Field – which benchmarks their method against MD simulations.

The CABS model pops up in a number of other places – there is a CABS-fold server, and the method was first described in the paper Consistent View of Protein Fluctuations from All-Atom Molecular Dynamics and Coarse-Grained Dynamics with Knowledge-Based Force-Field from 2004.

What does the webserver do?

You give it a coordinate file, wait some hours, and it gives you back a series of (clustered) structures, a C&alpha trajectory, a residue flexibility profile, and a nice little video. Which is nice. It is, however, a little limited; you can only do a single chain (so no modelling of small molecule or peptide interactions), there is no way of fixing part of the model so it does not flex, and it can be picky about your PDB files – no chain gaps, no strange amino acids. You can tell it to analyse a PDB chain by just entering the code, but these are frequently rejected for having unacceptable coordinate files.

How does it work?

The CABS model makes a reduced representation of the protein, and explores its conformational space by making moves – which are accepted or rejected based on a (comparatively complex) force field. Many moves are attempted, and over a large number of iterations, this builds up an ensemble of structures, which show the possible confirmations of the protein. So, its a bit like MD, but rather than moving atoms based on calculated forces, you make random moves, and accept the sensible ones.

hello

The model is shown above. The protein structure is reduced to 4 atoms per residue. There are 5 types of move, and in each step, a random move of each type is tried ~n times (n being the number of amino acids in the protein). These are accepted or rejected based on their force field, and the process is repeated several hundred times. The force field used to accept/reject moves is quite complex – there are short range (sequence dependent and sequence independent), hydrogen bond, and long range (again, sequence dependent and sequence independent) interactions. There are sub-categories within these, and the relative contributions of the interactions can be changed. There is no solvent, so the long range interactions have to provide the forces to keep the protein together, and the hydrogen bond interactions act to maintain the secondary structure elements. More detail can be found in the 2004 paper.

Is it true?

The authors justify the method by comparing it to MD simulations taken from a database, and conclude that CABS-flex gives answers as similar to MD simulations as MD simulations using different force fields are to each other. Simply put, the same residues move in their simulations as move in MD simulations. They move more, which backs up their claim that they demonstrate flexibility over a longer time-frame than MD simulations. They do admit that they get the best agreement when they heavily restrict their secondary structure – raising the question of how much of the agreement between all of the methods is down to all the simulations having the same secondary structure.

To conclude, this could be a useful method, particularly in that it can give long time-frame flexibility – providing you treat it as any other measure of flexibility, with a pinch of salt. It is a bit of a shame that the interface is (perhaps frustratingly) simple, and you need to pre-process your coordinate files, as many PDB coordinate files will not be accepted.

Conservation of codon optimality

The unidirectional process of translation converts one dimensional mRNA sequences into three dimensional protein structures. The reactant, the mRNA, is built from 64 codon variants, while the product, the protein, is constructed from the 20 biologically relevant amino acids. This reduction of 64 variants to a mere 20 is indicative of the degeneracy contained within the genetic code. Multiple codons encode for the same amino acid, and hence any given protein can be encoded by a multitude of synonymous mRNA sequences.  Current structural biology research often assumes that these synonymous mRNA sequences are equivalent; the ability of certain proteins to unfold and refold without detrimental effects leading to the assumption that the information pertaining to the folded structure is contained exclusively within the protein sequence. In actuality, experimental evidence is mounting that shows synonymous sequences can in fact produce proteins with different properties; synonymous codon switches having been observed to cause a wide range of detrimental effects, such as decreases in enzyme activity, reductions in solubility, and even causing misfolding.

The ribosome (yellow) passes along the mRNA, turning the sequence of codons into a protein. Under our model, the speed of the translation for each codon varies, in turn differing the time available to the nascent peptide chain to explore the fold space. Through this method codon choice becomes an additional source of structural information.

The ribosome (yellow) passes along the mRNA, turning the sequence of codons into a protein. Under our model, the speed of the translation for each codon varies, in turn differing the time available to the nascent peptide chain to explore the fold space. Through this method codon choice becomes an additional source of structural information.

For my 10 week DTC project within the Deane group,  I was tasked with resurrecting the Coding Sequence and Structure database (CSandS; read: Sea Sands) and using it to test for the evolutionary conservation of codon choice over groups of proteins with similar structures. With experimental differences noted in protein product structure between synonymous sequences, we wished to investigate if codon choice is an evolutionary constraint on protein structure, and if so, to what degree, and in what manner. To test for evolutionary conservation we combined the theories of codon optimality and cotranslational folding, our hypothesis being that the choice of codon directly affects the translation efficiency of the ribosome; consequently different codons give the nascent polypeptide emerging from the ribosome varying amounts of time to explore the fold-space.  By assigning optimality scores to each codon, we can search for regions of fast translation and slow translation, then by looking for correlations within aligned sets of structural similar proteins we can identify sites where codon choice is of importance. While the 10 weeks project focussed mainly on methodology and implementation, my preliminary results indicate that a large percentage of proteins are under selective pressures with regards to codon choice. I found that in most sets of proteins, there is an greater amount of correlation than would be expected by chance, this crucially suggests that there is additional structural information contained within the mRNA that is lost once translation occurs.

For additional information on the background, methodology and implementation of this research, please feel free to view the presentation slides at:  http://www.slideshare.net/AlistairMartin/evolut-26981404