For my turn at journal club I decided to look at this paper by Liu et. al. from FSU.
The paper, efficiently titled ‘A mathematical framework for protein structure comparison’, is motivated by the famously challenging problem of structure comparison. Historically, and indeed presently, the most respected methods of structural classification (dividing up the protein universe by structural similarity) have required a significant amount of manual comparison. This is essentially because we have no formal consensus on what similarity between protein structures truly means and how it should be measured. There are computational challenges to efficient alignment as well but, without a formal framework and a well-defined metric, structure comparison remains an ill-posed problem.
The solution for the authors was to represent each protein structure as a continuous, elastic curve and define the deformation between any two curves as the distance between their structures. We consider, hypothetically at first, a geometric space where each point represents a 3D curve (or potential protein structure). One problem with studying protein structures as geometric objects in Euclidean 3D space is that we really want to ignore certain shape-preserving transformations, such as translations, rotations, scaling and re-parametrization. Ideally therefore, we’d like our geometric space to represent curves unique up to a combination of these transformations. That is, if a protein structure is picked up, turned upside down, then put down somewhere else, it should still be treated as the same shape as it was before. Let’s call this space the shape space S.
Key to the formulation of this space is the representation of a protein structure by its square root velocity function (SRVF). We can represent a protein structure by a continuous function β, which maps the unit interval onto 3D space: β: [0,1] → R3. The SRVF of this curve is then defined as:
where β'(t) is the derivative of β. q(t) contains both the speed and the direction of β but, since it is the derivative, it is invariant to any linear translation of β. So, simply by using the SRVF representation of a curve, we have eliminated one of the shape-preserving transformations. We can eliminate another, rescaling, by requiring β(t) to have unit length: ∫|β'(t)|2dt = ∫|q(t)|2dt = 1.
The space of q‘s is not the shape space S as we still have rotations and re-parametrizations to account for but it is a well-defined space (a unit sphere in the Hilbert space L2 if you were wondering). We call this space the preshape space C. The main advantage of the SRVF representation of curves is that the standard metric on this space explicitly measures the amount of deformation, in terms of stretching and bending, between two curves.
All that is left to do is to eliminate the effects of an arbitrary rotation and re-parametrization. In fact, instead of working directly in S, we choose to remain in the preshape space C as its geometry is so well-defined (well actually, it’s even easier on the tangent space of C but that’s a story for another time). To compare two curves in C we fix the first curve, then find the optimal rotation and parametrization of the second curve so as to minimise the distance between them. Computationally, this is the most expensive step (although almost every other global alignment will need this step and more) and consists of firstly applying singular value decomposition to find the rotation then a dynamic programming algorithm to find the parametrization (this is the matching function between the two backbone structures… you could add secondary structure/sequence constraints here to improve the relevance of the alignment).
Now we can compare q1 and q2 in C. Not only can we calculate the distance between them in terms of the deformation from one to the other, we can also explicitly visualise this optimal deformation as a series of curves between q1 and q2, calculated as the geodesic path between the curves in C:
where θ = cos-1( ∫ <q1,q2>dt) is the distance between q1 and q2 in S.
The geodesic path between protein 1MP6, the left-most structure, and protein 2K98, the right-most structure.
As a consequence of this formal framework for analysing protein structures it’s now possible, in a well-defined way, to treat structures as random variables. Practically, this means we can study populations of protein structures by calculating their mean structure and covariances.