Tag Archives: Protein Structure Prediction

Slow and steady improvements in the prediction of one-dimensional protein features

What do you do when you have a big, complex problem whose solution is not necessarily trivial? You break the problem into smaller, easier to solve parts, solve each of these sub-problems and merge the results to find the solution of the original, bigger problem. This is an algorithm design paradigm known as the divide and conquer approach.

In protein informatics, we use divide and conquer strategies to deal with a plethora of large and complicated problems. From protein structure prediction to protein-protein interaction networks, we have a wide range of sub and sub-sub problems whose solutions are supposed to help us with the bigger picture.

In particular, prediction of the so called one-dimensional protein features are fundamental sub-problems with a wide range of applications such as protein structure modelling, homology detection, functional characterization and others. Here, one-dimensional protein features refer to secondary structure, backbone dihedral and C-alpha angles, and solvent accessible surface area.

In this week’s group meeting, I discussed the latest advancements in prediction of one-dimensional features as described in an article published by Heffernan R. and colleagues in Scientific Reports (2015):

“Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning.”

In this article, the authors describe the implementation of SPIDER2, a deep learning approach to predict secondary structure, solvent accessible surface area, and four backbone angles (the traditional dihedrals phi and psi, and the recently explored theta and tau).

“Deep learning” is the buzzword (buzz-two-words or buzzsentence, maybe?) of the moment. For those of you who have no idea what I am talking about, deep learning is an umbrella term for a series of convoluted machine learning methods. The term deep comes from the multiple hidden layers of neurons used during learning.

Deep learning is a very fashionable term for a reason. These methods have been shown to produce state-of-the-art results for a wide range of applications in several fields, including bioinformatics. As a matter of fact, one of the leading methods for contact prediction (previously introduced in this blog post), uses a deep learning approach to improve the precision of predicted protein contacts.

Machine learning has already been explored to predict one-dimensional protein features, showing promising (and more importantly, useful) results. With the emergence of new, more powerful machine learning techniques such as deep learning, previous software are now becoming obsolete.

Based on this premise, Heffernan R. and colleagues implemented and applied their deep learning approach to improve the prediction of one-dimensional protein features. Their training process was rigorous: they performed a 10-fold cross validation using their training set of ~4500 proteins and, on top of that, they also had two independent test sets (a ~1200 protein test set and a set based on the targets of CASP11). Proteins in all sets did not share more than 25% (30% sequence identity for the CASP set) to any other protein in any of the sets.

The method described in the paper, SPIDER2, was thoroughly compared with state-of-the art prediction software for each of the one-dimensional protein features that it is capable of predicting. Results show that SPIDER2 achieves a small, yet significant improvement compared to other methods.

It is just like they say, slow and steady wins the race, right? In this case, I am not so sure. It would be interesting to see how much the small increments in precision obtained by SPIDER2 can improve the bigger picture, whichever your bigger picture is. The thing about divide and conquer is that if you become marginally better at solving one of the parts, that doesn’t necessarily imply that you will improve the solution of the bigger, main problem.

If we think about it, during the “conquer” stage (that is, when you are merging the solution of the smaller parts to get to the bigger picture), you may make compromises that completely disregard any minor improvements for the sub-problems. For instance, in my bigger picture, de novo protein structure prediction, predicted local properties can be sacrificed to ensure a more globally consistent model. More than that, most methods that perform de novo structure prediction already account for a certain degree of error or uncertainty for, say, secondary structure prediction. This is particularly important for the border regions between secondary structure elements (i.e. where an alpha-helix ends and a loop begins). Therefore, even if you improve the precision of your predictions for those border regions, the best approach for structure prediction may still consider those slightly more precise border predictions as unreliable.

The other moral of this story is far more pessimistic. If you think about it, there were significant advancements in machine learning, which led to the creation of ever-more-so complicated neural network architectures. However, when we look back to how much improvement we observed when these highly elaborate techniques were applied to an old problem (prediction of one-dimensional protein features), it seems that the pay-off wasn’t as significant (at least as I would expect). Maybe, I am a glass half-empty kind of guy, but given the buzz surrounding deep learning, I think minor improvements is a bit of a let down. Not to take any credit away from the authors. Their work was rigorous and scientifically very sound. It is just that maybe we are reaching our limits when it comes to applying machine learning to predict secondary structure. Maybe when the next generation of buzzword-worthy machine learning techniques appear, we will observe an even smaller improvement to secondary structure prediction. Which leaves a very bitter unanswered question in all our minds: if machine learning is not the answer, what is?

Predicted protein contacts: is it the solution to (de novo) protein structure prediction?

So what is this buzz I hear about predicted protein contacts? Is it really the long awaited solution for one of the biggest open problems in biology today? Has protein structure prediction been solved?

Well, first things first. Let me give you a quick introduction to this predicted protein contact business (probably not quick enough for an elevator pitch, but hopefully you are not reading this in an elevator).

Nowadays, the scientific community has become very good at sequencing things (and by things I mean genetic things, like whole genomes of a bunch of different people and organisms). We are so good at it that mountains of sequence data are now available: genes, mRNAs, protein sequences. The question is what do we do with all this data?

Good scientists are coming up with new and creative ideas to extract knowledge from these mountains of data. For instance, one can build multiple sequence alignments using protein sequences for a given protein family. One of the ways in which information can be extracted from these multiple sequence alignments is by identifying extremely conserved columns (think of the alignment as a big matrix). Residues in these conserved positions are good candidates for being functionally important for the proteins in that particular family.

Another interesting thing that can be done is to look for pairs of residues that are mutating in a correlated fashion. In more practical terms, you are ascertaining how correlated is the information between two columns of a multiple sequence alignment; how often a change in one of them is countered by a change in the other. Why would anyone care about that? Simple. There is an assumption that residues that mutate in a correlated fashion are co-evolving. In other words, they share some sort of functional dependence (i.e. spatial proximity) that is under selective pressure.

Ok, that was a lot of hypotheticals, does it work? For many years, it didn’t. There were lots of issues with the way these correlations were computed and one of the biggest problems was to identify (and correct for) transitivity. Transitivity is the idea that you observe a false correlation between residues A and C because residues A,B and residues B,C are mutating in a correlated fashion. AS more powerful statistical methods were developed (borrowing some ideas from mechanical statistics), the transitivity issue has seemingly been solved.

The newest methods that detect co-evolving residues in a multiple sequence alignment are capable of detecting protein contacts with high precision. In this context, a contact is defined as two residues that are close together in a protein structure. How close? Their C-betas must be 8 Angstroms or less apart. When sufficient sequence information is available (at least 500 sequences in the MSA), the average precision of the predicted contacts can reach 80%.

This is a powerful way of converting sequence information into distance constraints, which can be used for protein structure modelling. If a sufficient number of correct distance constraints is used, we can accurately predict the topology of a protein [1]. Recently, we have also observed great advances in the way that models are refined (that is, refining a model that contains the correct topology to atomic, near-experimental resolution). If you put those two things together, we start to look at a very nice picture.

So what’s the catch? The catch was there. Very subtle. “When sufficient sequence information is available”. Currently, there is an estimate that only 15% of the de novo protein structure prediction cases present sufficient sequence information for the prediction of protein contacts. One potential solution would be to sit and wait for more and more sequences to be obtained. Yet a potential pitfall of sitting and waiting is that there is no guarantee that we will have sufficient sequence information for a large number of protein families, as they may as well present less than 500 members.

Furthermore, scientists are not very good at sitting around and waiting. They need to keep themselves busy. There are many things that the community as whole can invest time on while we wait for more sequences to be generated. For instance, we want to be sure that, for the cases where there is a sufficient number of sequences, that we get the modelling step right (and predict the accurate protein topology). Predicted contacts also show potential as a tool for quality assessment and may prove to be a nice way of ascertaining whether you have confidence that a model with correct topology was created. More than that, model refinement still needs to improve if we want to make sure that we get from the correct topology to near-experimental resolution.

Protein structure prediction is a hard problem and with so much room for improvement, we still have a long way to go. Yet, this predicted contact business is a huge step in the right direction. Maybe, it won’t be long before models generated ab initio are considered as reliable as the ones generated using a template. Who knows what promised the future holds.

References:

[1] Kim DE, Dimaio F, Yu-Ruei Wang R, Song Y, Baker D. One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins. 2014 Feb;82 Suppl 2:208-18. doi: 10.1002/prot.24374. Epub 2013 Sep 10.

Hypotheses and Perspectives onto de novo protein structure prediction

Before I start with my musings about my work and the topic of my D. Phil thesis, I would like to direct you to a couple of previous entries here on BLOPIG. If you are completely new to the field of protein structure prediction or if you just need to refresh your brain a bit, here are two interesting pieces that may give you a bit of context:

A very long introductory post about protein structure prediction

and

de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

Brilliant! Now, we are ready to start.

In this OPIG group meeting, I presented some results that were obtained during my long quest to predict protein structures.

Of course, no good science can happen without the postulation of question-driving hypotheses. This is where I will start my scientific rant: the underlying hypotheses that inspired me to inquire, investigate, explore, analyse, and repeat. A process all so familiar to many.

As previously discussed (you did read the previous posts as suggested, didn’t you?), de novo protein structure prediction is a very hard problem. Computational approaches often struggle to search the humongous conformational space efficiently. Who can blame them? The number of possible protein conformations is so astronomically large that it would take MUCH longer than the age of the universe to look at every single possible protein conformation.

If we go back to biology, protein molecules are constantly undergoing folding. More so, they manage to do so efficiently and accurately. How is that possible? And can we use that information to improve our computational methods?

The initial hypothesis we formulated in the course of my degree was the following:

“We [the scientific community] can benefit from better understanding the context under which protein molecules are folding in vivo. We can use biology as a source of inspiration to improve existing methods that perform structure prediction.”

Hence came the idea to look at biology and search for inspiration. [Side note: It is my personal belief that there should be a back and forth process, a communication, between computational methods and biology. Biology can inspire computational methods, which in turn can shed light on biological hypotheses that are hard to validate experimentally]

To direct the search for biological inspiration, it was paramount to understand the limitations of current prediction methods. I have narrowed down the limitations of de novo protein structure prediction approaches to three major issues:

1- The heuristics that rely on sampling the conformational space using fragments extracted from know structures will fail when those fragments do not encompass or correctly describe the right answer.

2- Even when the conformational space is reduced, say, to fragment space, the combinatorial problem persists. The energy landscape is rugged and unrepresentative of the actual in vivo landscape. Heuristics are not sampling the conformational space efficiently.

3- Following from the previous point, the reason why the energy landscape is unrepresentative of the in vivo landscape is due to the inaccuracy of the knowledge-based potentials used in de novo structure prediction.

Obviously, there are other relevant issues with de novo structure prediction. Nonetheless, I only have a limited amount of time for my D.Phil and those are the limitations I decided to focus on.

To counter each of these offsets, we have looked for inspiration in biology.

Our understanding from looking at different protein structures is that several conformational constraints are imposed by alpha-helices and beta-strands. That is a consequence of hydrogen bond formation within secondary structure elements. Unsurprisingly, when looking for fragments that represent the correct structure of a protein, it is much easier to identify good fragments for alpha-helical or beta-strand regions. Loop regions, on the other hand, are much harder to be described correctly by fragments extracted from known structures. We have incorporated this important information into a fragment library generation software in an attempt to address limitation number 1.

We have investigated the applicability of a biological hypothesis, cotranslational protein folding, into a structure prediction context. Cotranslational protein folding is the notion that some proteins begin their folding process as they are being synthesised. We further hypothesise that cotranslational protein folding restricts the conformational space, promoting the formation of energetically-favourable intermediates, thus steering the folding path towards the right conformation. This hypothesis has been tested in order to improve the efficiency of the heuristics used to search the conformational space.

Finally, following the current trend in protein structure prediction, we used evolutionary information to improve our knowledge-based potentials. Many methods now consider correlated mutations to improve their predictions, namely the idea that residues that mutate in a correlated fashion present spatial proximity in a protein structure. Multiple sequence alignments and elegant statistical techniques can be used to identify these correlated mutations. There is a substantial amount of evidence that this correlated evolution can significantly improve the output of structure prediction, leading us one step closer to solving the protein structure prediction problem. Incorporating this evolution-based information into our routine assisted us in addressing the lack of precision of existing energy potentials.

Well, does it work? Surprisingly or not, in some cases it does! We have participated in a blind competition: the Critical Assessment for protein Structure Prediction (CASP). This event is rather unique and it brings together the whole structure prediction community. It also enables the community to gauge at how good we are at predicting protein structures. Working with completely blind predictions, we were able to produce one correct answer, which is a good thing (I guess).

All of this comes together nicely in our biologically inspired pipeline to predict protein structures. I like to think of our computational pipeline as a microscope. We can use it to prod and look at biology. We can tinker with hypotheses, implement potentials and test them, see what is useful for us and what isn’t. It may not be exactly what get the papers published, but the investigative character of our structure prediction pipeline is definitely the favourite aspect of my work. It is the aspect that makes me feel like a scientist.

Protein Structure Prediction, my own metaphorical microscope…

de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

In this week’s OPIG group meeting, I discussed the inner-works and the algorithm behind ROSETTA, one of the most well-known software for de novo protein structure prediction.

Before we even attempt to understand how ROSETTA works, let us start with a theorem.

Theorem: given an infinite number of monkeys with typewriters and an infinite amount of time, they are very likely to recreate the works of William Shakespeare.

Monkey with a typewriter… Time to write that Shakespeare!

Well, let us be a little more modest and attempt to recreate just a phrase of old Bill, instead of his whole works:

“The fool doth think he is wise, but the wise man knows himself to be a fool.”

Well, if we exclude spaces and punctuation marks, that leaves us 58 positions in our phrase (the length of the quote). Considering we have 26 possible letters for each position, we would expect to generate this phrase at random once every of 26^58 times. Wow!

That means that we need to evolve from monkeys (pun intended) and appeal to our over-developed encephalon!

In order to steer our Monkey typewriter, we can reduce this problem to a Global Optimisation problem. In a Global Optimisation problem, we define a function f (named an objective function) which we want to minimise for a given set of parameters x. Bare in mind that if we want to maximise a given function f, we can define g = -f

In a global optimisation problem, we are interested in finding the values of X that minimise the function f(X).

Now, all we need is to define an objective function in order to guide our Monkey typewriter towards the right answer.

Let us define the following objective function: given our Shakespearean phrase and a sequence of 58 letters, the value of the objective function equals the number of letters that are different between the phrase and the sequence of letters.

We can now proceed to define a slightly more refined Monkey Typewriter:

1- Start with a random sequence of letters.
2- WHILE sequence != shakespearean_phrase:
3-________ Select a random position in the sequence.
4-________ Assign a new letter to that position.
5-________ IF score of new sequence < score of old sequence:
6-__________________ Accept the change.
7-________ ELSE:
8-__________________ Discard the change.

This way we can steer our Monkeys and reduce the time it would take to generate our Shakespearean phrase to a more feasible time.

Now, let’s talk about protein structure prediction (PSP). More specifically, let us talk about de novo protein structure prediction (different flavours of protein structure prediction have been discussed previously here).

One of the great ideas behind the creators of ROSETTA, was to use a combination of two different techniques to address the big problems of protein structure prediction:

1- Problem number #1 of PSP is the size of the conformational space. A protein can be represented by it’s backbone atoms, which, in turn, can be reconstructed from a sequence of torsion angles. A set of 3 torsion angles can be used to represent every protein residue. Therefore, for a protein with 100 residues, we would have a total of 300 angles. If we approximate each angle to assume one of 360 values (degrees), that gives us 360^300 possible conformations (not huge at all, han?).

One of the main ideas behind ROSETTA was to reduce the search space by using fragments extracted from known structures. The use of fragments restricts the possible angles to a set of values that are known to occur in nature. Therefore, instead of looking at 360^300 possible angles, we deal with a much more feasible search space.

The name ROSETTA is based on the Rosetta Stone, an archaeological artefact that allowed modern civilisation to interpret and convert between different alphabets. In reality, ROSETTA can be seen as a very elegant monkey typewriter. ROSETTA uses sequence and structure similarity to define a structural alphabet. For every single position in our protein sequence, we have a set of fragments extracted from now protein structures to represent that position. Originally, each position would be represented by 25 fragments (letters?). If you combine the different pieces of known structures in the right order, you will get your Shakespearean Phrase in the end (the correct Protein Structure!).

2- Well, we still have a pretty big conformational space considering we have 25 fragments per position (approximately 25^100 possible conformations, for a protein with 100 residues). The second technique employed by ROSETTA is Simulated Annealing.

Simulated Annealing is a Global Optimisation heuristic. It attempts to find a good enough solution to the problem of minimising a given function f. It is very similar to our Monkey Typewriter algorithm. The main difference is that Simulated Annealing implements some tricks to avoid local minima entrapment. In simpler terms, if we ONLY accept favourable changes (Line 5 of Monkey Typewriter pseudo-code), once we reach a local minimum, we get trapped. No possible change would lead to an improvement, yet we are still far from finding the global minimum.

In order to mitigate that entrapment effect, Simulated Annealing defines a probability of accepting an unfavourable change. This probability is higher at the beginning of the simulation and it becomes lower and lower as the simulation progresses. This process is usually referred to as “cooling down”.

Ok! So we reduced our PSP problem to an elegant Monkey Typewriter. We have our Monkeys working to create the best possible Shakespeare, in a pretty clever and sophisticated manner. Well, we should be able to create some fine piece of literature, correct?

Not quite!

There are still several problems with this whole pipeline. I will mention a few:

When you define your structural alphabet, you may not have the right fragment to represent a certain position. This would be the same as trying to get to a Shakespearean phrase without using vowels for the first 10 letters or only using consonants in the middle of the sentence. It would never happen…
Despite the many efforts to define a very good objective function, no current software presents a function that truly mimics the behaviour of an energy function. This implies that we have a vague idea of how the Shakespearean phrase should look like, but we cannot precisely pinpoint where each letter goes.
No matter how elegant our Monkey typewriter becomes, the combinatorial problem still persists. We are still dealing with 25^100 possible conformations and it is impossible to try every single conformation.
The objective function, if plotted in a graph, would look completely hideous (unlike the picture above). We are talking about a gigantic multi-dimensional surface, filled with local minima that confuse and entrap our simulations. Combine that with the fact that our objective function is not accurate and you waste most of your computing power into generating solutions that are completely useless.
Another common technique to address the previous limitations is to increase the number of Monkeys in order to speed up the search process. If you use thousands and thousands of Monkeys (multiple runs of ROSETTA), each individual Monkey will get to a local minimum (decoy = something that looks like a phrase). In recent years, tens of thousands of decoys are generated in order to predict a single structure. A new problem arises, because out of these tens of thousands of phrases, we cannot tell apart Hamlet from Twilight. We don’t know which Monkeys got close to the right answer. All we know is that for some cases some of them did.

In conclusion, de novo Protein Structure Prediction still has a long way to go.

Kinetic Modelling of Co-translational Protein Folding (Journal Club)

Following up on last week’s entry, this post will explore the same topic: polypeptide chains assuming native-like conformations as they are extruded from the ribosome, or for the less intimate with the concept, co-translational protein folding.

Before addressing some important questions concerning co-translational protein folding, I would like to make a parenthesis: I want to dedicate a paragraph or two to talk about time.

Biological processes are dynamic. They are events that occur over a period of time. For instance, one can quantify the effect of mutations propagated and accumulated over millions of years of evolution. One can also quantify the femtoseconds in which subtle conformational changes occur in photoreceptor proteins like rhodopsin, when they respond to light. Time is fundamental to understand and model any sort of biological event.

Albeit it might seem obvious to the reader that time is so crucial to amass biological knowledge, those of us more theoretically inclined (bioinformaticians, computational biologists, biostatisticians, mathematical biologists and so on and so forth) are usually presented with models that tend to over-simplify reality. Surprisingly enough, there are many over-simplistic models that neglect the effect of time in order to “better” represent whatever they claim to model. Take Protein Docking for instance. The biological process at hand presents a complicated dynamic. There is a kinetic equilibrium, in which a vast amount of protein and ligand molecules interact, associating into complexes and dissociating. Nonetheless, Protein Docking is traditionally reduced to the binding affinity between a pair of molecules. As one might say, this is only a problem if I can present a solution… Luckily, Protein Docking is not my subject of expertise, so I will leave this question open to more tenacious minds than my own.

One of the areas in which I am truly interested in is the co-translational aspect of protein folding. If one performs a quick Google Images search, using the terms “Protein Synthesis” or “Protein Translation”, the results tell a very interesting story. The vast majority of nascent protein chains are represented as fully elongates peptide chains. In a majority of pictures, the growing peptides do not even present secondary structure. They are mostly represented by long, unfolded, almost linear polymers.

Now, any first year Biochemistry student learns about something called Hydrophobicity (or hydrophilicity depending on whether you are a glass half empty or half full type of person). It is biochemistry-introductory-text-book stuff that some residues are polar and some residues are apolar, and hence will hide from water, forming a hydrophobic core. That (hydrophobicity) is one of the main driving forces of protein folding.

Hence, most of the images that appear in our Google Images search are not very representative. They are plain wrong. It is simple physics that the growing peptide chains will form secondary and tertiary structures during the process of protein synthesis. One has to remember that this process is dynamic, it is happening over time. Under these circumstances, time should not be neglected. The time scale at which extrusion occurs is slow enough to allow the nascent chain to probe conformations and simply abide to the laws of physics. A fully elongated, completely unfolded and denatured peptide chain would not exist during protein synthesis. These nascent chains would adopt intermediate conformations simply as a result of apolar residues trying to hide from water.

Ok. Now, the BIG question that can be raised is whether those intermediate conformations actually resemble the native state of the fully elongated protein. I do not want to incur in Baby Kicking, but one thing that evolution has taught us is that cells have evolved to be highly efficient systems. There is no room for wasted energy. It makes sense to hypothesize that over millions of years, the cellular machinery has adapted to explore these intermediate conformations in order to make the process of protein folding more efficient.

Over the past couple of years, substantial evidence has been amassed that codon usage and the degeneracy of the genetic code could be exploited by cells to ensure that protein folding occurs accurately and efficiently. There are many theoretical ways that such exploitation could occur: the codon translation speed could facilitate the formation of certain intermediates that are beneficial for protein folding, that increase stability or that prevent protein aggregation. There is even a biomedical impact given that some observed pathologies have been associated with synonymous codon mutations that may lead to misfolded proteins.

In the paper I presented during this journal club [1], O’Brien and colleagues have devised and described a very interesting kinetic model for protein translation. Their model was used to describe possible scenarios in which both fast and slow translation speed codons are coordinators of co-translational protein folding. Please note that, in this context, co-translational protein folding is perceived as an enrichment of intermediate conformations of the nascent chains, which resemble the native structure of the fully elongated protein.

In the model described in the paper, they opted for a probabilistic approach instead of an analytical (differential equations) approach. The time is modelled by the use of probabilities. The authors derived a formula to quantify the expected proportion of nascent chains of a given length that would be in a Folded intermediate state (one that resembles the native structure). They have managed to express this in terms of a rate of codon translation. Therefore, they stablish a direct relationship between Co-Translational protein folding and codon translation speed.

Their analysis is robust as none of the constants and kinetic rates need to be experimentally derived in order to provide insights about the protein folding process. Overall, I think the way the model was built was quite ingenious and very interesting. I would suggest any interested reader to read the article if they want to understand how the whole modelling was carried out.

Overall, I think the authors present a compelling argument for how cells could explore codon degeneracy and co-translational aspects of protein folding to improve folding efficiency. One of their results present a scenario in which fast translation speed codons can be used to assist in the fold of unstable protein regions, preventing the formation of misfolded intermediates.

One of the many functions of mathematical models is to provide insights into the underlying biology of the phenomena they attempt to model. The lack of any experimental evidence to support this paper’s results does not make it any less interesting. The article presents to the readers a sound and solid mathematical argument as to how co-translational aspects of protein folding could be beneficial for cell efficiency. If anything, they provide interesting hypotheses that might drive experimentalists in the future.

[1] Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates.

A very long introductory post about protein structure prediction

If you are a protein informatician, bioinformatician, biochemist, biologist or simply a person well informed about science, you probably heard about protein structure prediction. If that is the case, you might be wondering what all the fuss is about, right? If you never heard those terms before, don’t panic! You are about to find out what protein structure prediction is all about!

Based on my group meeting’s presentation last Wednesday, this blog entry will discuss why protein structure prediction is important and the potential limitations of existing methods. I will also discuss how the quality of input may be a potential source for lack of accuracy in existing software.

First, let us remember a little biology: our genetic code encrypts the inner-works of a complicated cellular machinery tightly regulated by other (macro)molecules such as proteins and RNAs. These two types of macromolecules are agents that perform the set of instructions codified by DNA. Basically, RNAs and proteins are involved in a series of processes that regulate cellular function and control how the genetic code is accessed and used.

For that reason, a huge chunk of genomic data can be ~~pretty useless~~ not that useful if considered on their own. Scientists around the globe have invested millions of moneys and a huge chunk of time in order to amass piles and piles of genome sequencing data. To be fair, this whole “gotta sequence ’em all” mania did not provide us with the fundamental answers everybody was hoping for. Cracking the genetic code was like watching an episode of Lost, in which we were left with more questions than answers. We got a very complicated map that we can’t really understand just yet.

For that reason, I feel obliged to justify myself: protein structures ARE useful. If we know a protein structure, we can formulate a very educated guess about that protein’s function. Combine that with empirical data (e.g. where and when the protein is expressed) and it can help us unveil a lot of info about the protein’s role in cellular processes. Basically, it can answer some of the questions about the (genomic) map. ~~If only we could do that with Lost…~~

There is also evidence that knowing a protein’s structure can help us design specific drugs to target and inhibit that protein. Although the evidence of such biomedical application is sparse, I believe that with development of the field, there is a trend for protein structures to become more and more important in drug discovery protocols.

Still, if we look at the number of known genome sequences and known protein structures and at the growth of those figures over the past decade, we look at a drastic scenario:

There is a tendency for the gap between the number of protein sequences and protein structures to increase. Hence, we are getting more and more questions and little to no answers. Observe how the green line (the protein sequences associated with a known or predicted function) is very close to the red line (the number of known protein structures). However, there is a growing gap between the red and the blue line (the number of protein sequences). Source: http://gorbi.irb.hr/en/method/growth-of-sequence-databases/

Well, gathering protein structure data is just as important, if not more important, than gathering sequence data. This motivated the creation of Structural Genomics Consortiums (SGC), facilities that specialize in solving protein structures.

I am sorry to tell you that this is all old news. We have known this for years. Nonetheless, the graph above hasn’t changed. Why? The cost limitations and the experimental difficulties associated with protein structure determination are holding us back. Solving protein structures in the lab is hard and time consuming and we are far from being as efficient at structure determination as we are at genome sequencing.

There is a possible solution to the problem: you start with a protein sequence (a sequential aminoacid list) and you try to predict its structure. This is known as protein structure prediction or protein structure modelling. Well, we have a limited number of building blocks (20) and a good understanding of their physicochemical properties, it shouldn’t be that hard right?

Unfortunately, modelling protein structure is not as simple as calculating how fast a block slides on an inclined plane. Predicting protein structure from sequence is a very hard problem indeed! It has troubled a plethora of minds throughout the past decades, making people lose many nights of sleep (I can vouch for that).

We can attribute that to two major limitations:

1- There are so many possible ways one can combine 20 “blocks” in a sequence of hundreds of aminoacids. Each aminoacid can also assume a limited range of conformations. We are looking at a massive combinatorial problem. The conformational space (the space of valid conformations a protein with a given sequence can assume) is so large that if you could check a single conformation every nanosecond, it would still take longer than the age of the universe to probe all possible conformations.

2- Our physics (and our statistics) are inaccurate. We perform so many approximations in order to make the calculations feasible with current computers that we end up with very inaccurate models.

Ok! So now you should know what protein structure prediction is, why it is important and, more importantly, why it is such a hard problem to solve. I am going to finish off by giving you a brief overview of the two most commons approaches to perform protein structure prediction: template-based modelling (also known as homology modelling) and de novo structure prediction.

There is a general understanding that if two proteins have very similar sequences (namely, if they are homologs), than they will have similar structures. So, we can use known structures of homologs as templates to predict other structures. This is known as homology modelling.

One can do a lot of fancy talk to justify why this works. There is the evolutionary argument: “selective pressure acts on the phenotype level (which can encompass a protein structure) rather than the genotype level. Hence protein structures tend to be more conserved than sequence. For that reason and considering that sequence alone is enough to determine structure, similar sequences will have even more similar structures.”

One can also formulate some sort of physics argument: “a similar aminoacid composition will lead to a similar behaviour of the interacting forces that keep the protein structure packed together. Furthermore, the energy minimum where a certain protein structure sits is so stable that it would take quite a lot of changes in the sequence to disturb that minimum energy conformation drastically.”

Probably the best argument in favour of homology modelling is that it works somewhat well. Of course, the accuracy of the models has a strong dependency on the sequence similarity, but for proteins with more than 40% identity, we can use this method in order to obtain good results.

This raises another issue: what if we can’t find a homolog with known structure? How can we model our templateless protein sequence then? Well, turns out that if we group proteins together into families based on their sequence similarity, more than half of the families would not have a member with known structure. [This data was obtained by looking at the representativeness of the Pfam (a protein family database) on the PDB (a protein structure database).]

Ergo, for a majority of cases we have to perform predictions from scratch (known as free modelling or de novo modelling).

Well, not necessarily from scratch. There is a specific approach to free modelling where we can build our models using existing knowledge. We can use chunks of protein, contiguous fragments extracted from known structures, to generate models. This is known as a fragment-based approach to de novo protein structure prediction. And that is one big name!

One can think of this as a small scale homology modelling, where both the physics and evolutionary arguments should still hold true to some degree. And how do we do? Can we generate good models? We perform appallingly! Accuracies are too low to generate any useful knowledge in a majority of cases. The problem with the rare cases when you get it right is that you have no means to know if you actually got the right answer.

The poor quality of the results can be justified by the 2 biggest limitations discussed above. Yet something else might be in play. In homology modelling, if you use a bad template, you will most certainly get a bad model. In a similar way, using a bad set of fragments will lead you to a very poor final model.

Considering we already have the other two big issues (size of conformational space and accuracy of current potentials) to worry about, we should aim to use the best fragment library we possibly can. This has been the recent focus of my work. An attempt to make a small contribution to solve such a hard problem.

I would love to detail my work on finding better fragments here, but I believe this post is already far too long for anyone to actually endure it and read it until the end. So, congratulations if you made it through!

Oxford Protein Informatics Group

or "OPIG" to friends

Tag Archives: Protein Structure Prediction

Slow and steady improvements in the prediction of one-dimensional protein features

Predicted protein contacts: is it the solution to (de novo) protein structure prediction?

Hypotheses and Perspectives onto de novo protein structure prediction

de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

“The fool doth think he is wise, but the wise man knows himself to be a fool.”

Kinetic Modelling of Co-translational Protein Folding (Journal Club)

A very long introductory post about protein structure prediction