Proteins are fascinating. They are ubiquitous in living organisms, carrying out all kinds of functions: from structural support to unbelievably powerful catalysis. And yet, despite their ubiquity, we are still bemused by their functioning, not to mention by how they came to be. As computational scientists, our research at OPIG is mostly about modelling proteins in different forms. We are a very heterogeneous group that leverages approaches of diverse scale: from modelling proteins as nodes in a complex interaction network, to full atomistic models that help us understand how they behave.
Given the enormous diversity of approaches, you may ask: how much can we simplify the model of a protein so that it is still useful? The standard computational chemistry answer would probably be “coarse-grained models”: approaches that treat entire aminoacids (with an average of 20 atoms per aminoacid) as one, or a few beads with an effective potential. Basically, this cuts down the number of variables manipulated by the computer, reducing the effort and achieving a sensible balance between accuracy and computational cost that is useful to study some long-timescale phenomena.
In this post I will briefly describe a clear contestant to the position of simplest conceivable model of a protein: the hydrophobic-polar lattice protein model (Lau and Dill, Macromolecules, 1989). This model manages to capture a lot of what we consider “protein-like”, while reducing the number of parameters to, well, only the primary structure. Although it is technically also a coarse-grained model, it is not what comp chem people usually refer to when mentioning “coarse-grained approaches”, often centered about the Martini force field and its variants!
Let’s try to present this method based on some reasoning. If you look at the 3D structure of a protein, you will soon realize that it is held by residues that seem to be next to each other. A typical example is the disulfide bridge, a S-S bond that forms between the thiol groups of two cysteines. Although the tertiary structure is kept by a large number of small interactions, these contacts are the most identifiable component of this stabilization energy. In fact, many methods for protein structure prediction are based in predicting the contacts from the primary sequence using statistical methods, and performing a constrained optimization to find an energy minimum.
The HP model takes this vision beyond. First, it assumes that there are only two possible aminoacids: hydrophobic (H) and polar (P). Second, it considers there is only one possible interaction: when two hydrophobic residues are contiguous in space, there is a stabilizing contact that lowers the energy by -1.0 arbitrary units. The native structure of a protein becomes the conformation of the protein that maximizes the contacts between hydrophobic residues. This captures the importance of contacts in stabilizing the tertiary structure.
If you think this is a very crude approximation, then you may be up for a surprise. In addition to the previous, the HP model also discretizes the space: residues are only allowed to occupy discrete positions on a lattice. A protein conformation becomes a self-avoiding walk on a lattice, arguably the most economical approximation for its structure. The simplest non-trivial example is the HPPH peptide, that is shown below:
This may seem very simple… but it quickly becomes tricky to fold structures of even a handful of residues. This brings up a very interesting question: how difficult is it to find the native structure of a HP peptide? It has been proved that the HP model belongs to the NP-complete class of complexity (Berger and Leighton, J. Comput. Biol., 1998). Plainly stated, this means that there is no efficient algorithm to answer the question “is there any conformation of the peptide X with energy lower than E?”. However, this problem can be verified very easily: any given solution, in the form of a conformation with energy allegedly lower than E, can be trivially checked just by counting contacts. A variant of this problem, “what is the lowest energy conformation of the peptide X?” can be seen to belong to the NP-hard class, where the problems are difficult to verify as well as to solve.
What does this mean? Say you come up with an efficient algorithm to find the lowest energy solution of a HP model – one that runs in polynomial time, as the computer scientists would say. This would be a counter-example to the single most important hypothesis in computer science, P≠NP, whose verification of refutation is one of the six unsolved Millennium Prize Problems. Besides the million dollar prize, and the almost certain Fields medal (provided you are under forty), this algorithm could be adapted to solve many extremely hard problems of enormous interest – scientific, but also economical. In other words, if you come up with such method, do let us know!
In conclusion, even such a primitive model of a protein is known to be among the most difficult classes of problems we know. And yet, most of the proteins synthesized within living organisms reach their native structures diligently in a matter of seconds. This is yet another reason why proteins are fascinating.