Author Archives: Claire Marks

Protein loops – why do we care?

In my DPhil research, I work on the development of new methods for predicting protein loop structures. But what exactly are loops, and why should we care about their structures?

Many residues in a given protein will form regions of regular structure, in α-helices and β-sheets. The segments of the protein that join these secondary structure elements together, that do not have easily observable regular patterns in their structure, are referred to as loops. This does not mean, though, that loops are only a minor component of a protein structure – on average, half of the residues in a protein are found in loops [1], and they are typically found on the surface of the protein, which is largely responsible for its shape, dynamics and physiochemical properties [2].

Connecting different secondary structures together is often not the only purpose of a loop – they are often vitally important to a protein’s function. For example, they are known to play a role in protein-protein interactions, recognition sites, signalling cascades, ligand binding, DNA binding, and enzyme catalysis [3].

As regular readers of the blog are probably aware by now, one of the main areas of research for our group is antibodies. Loops are vital for an antibody’s function, since its ability to bind to an antigen is mainly determined by six hypervariable loops (the complementarity determining regions). The huge diversity in structure displayed by these loops is the key to how antibodies can bind to such different substances. Knowledge of loop structures is therefore extremely useful, enabling predictions to be made about the protein.

Loops involved in protein function: a methyltransferase binding to DNA (top left, PDB 1MHT); the active site of a triosephosphate isomerase enzyme (bottom left, PDB 1NEY); an antibody binding to its antigen (blue, surface representation) via its complementarity determining regions, shown as the coloured loops (centre, PDB 3NPS); the activation loop of a tyrosine kinase has a different conformation in the active (pink) and inactive (blue) forms (top right, PDBs 1IRK and 1IR3); a zinc finger, where the zinc ion is coordinated by the sidechain atoms of a loop (bottom right, PDB 4YH8).

More insertions, deletions and substitutions occur in loops than in the more conserved α-helices and β-sheets [4]. This means that, for a homologous set of proteins, the loop regions are the parts that vary the most between structures. While this often makes the protein’s function possible, as in the case of antibodies, it leads to unaligned regions in a sequence alignment, standard homology modelling techniques can therefore not be used. This makes prediction of their structure difficult – it is frequently the loop regions that are the least accurate parts of a protein model.

There are two types of loop modelling algorithm: knowledge-based and ab initio. Knowledge-based methods look for appropriate loop structures from a database of previously observed fragments, while ab initio methods generate possible loop structures without prior knowledge. There is some debate about with approach is the best. Knowledge-based methods can be very accurate when the target loop is close in structure to one seen before, but perform poorly when this is not the case; ab initio methods are able to access regions of the conformational space that have not been seen before, but fail to take advantage of any structural data that is available. For this reason, we are currently working on developing a new method that combines aspects of the two approaches, allowing us to take advantage of the available structural data whilst allowing us to predict novel structures.

[1] L. Regad, J. Martin, G. Nuel and A. Camproux, Mining protein loops using a structural alphabet and statistical exceptionality. BMC Bioinformatics, 2010, 11, 75.

[2] A. Fiser and A. Sali, ModLoop: automated modeling of loops in protein structures. Bioinformatics, 2003, 19, 2500-2501.

[3] J. Espadaler, E. Querol, F. X. Aviles and B. Oliva, Identification of function-associated loop motifs and application to protein function prediction. Bioinformatics, 2006, 22, 2237-2243.

[4] A. R. Panchenko and T. Madej, Structural similarity of loops in protein families: toward the understanding of protein evolution. BMC Evolutionary Biology, 2005, 5, 10.

Improving the accuracy of CDR-H3 structure prediction

When designing an antibody for therapeutic use, knowledge of the structure (in particular the binding site) is a huge advantage. Unfortunately, obtaining even one of these structures experimentally, for example by x-ray crystallisation, is very difficult and time-consuming – researchers have therefore been turning to models.

The ‘framework’ regions of antibodies are well conserved between structures, and therefore homology modelling can be used successfully. However, problems arise when modelling the six loops that make up the antigen binding site – called the complementarity determining regions, or CDRs. For five of these loops, only a small number of conformations have actually been observed, forming a set of structural classes – these are known as canonical structures. The class that a CDR loop belongs to can be predicted from its structure, making the prediction of their structures quite accurate. However, this is not the case for the H3 loop (the third CDR of the heavy chain) – there is a much larger structural diversity, making H3 structure prediction a challenging problem.

Antibody structure, showing the six CDR loops that make up the antigen binding site. The H3 loop is found in the centre of the binding site, shown in pink. PDB entry 1IGT.

H3 structure modelling can be considered as a specific case of general protein loop modelling. Starting with the sequence of the loop, and the structure of the remaining parts of the protein, there are three stages in a loop modelling algorithm: conformational sampling, the filtering out of physically unlikely structures, and ranking. There are two types of loop modelling algorithm, which differ in the way they perform the conformational sampling step: knowledge-based methods, and ab initio methods. Knowledge-based methods use databases of known structures to produce loop conformations, while ab initio methods do this computationally, without knowledge of existing structures. My research involves the testing and development of these loop modelling algorithms, with the aim of improving the standard of H3 structure prediction.

A knowledge-based method that I have tested is FREAD. FREAD uses a database of protein fragments that could possibly be used as loop structures. This database is searched, and possible structures are returned depending on the similarity of their sequence to the target sequence, and the similarity of the anchor structures (the two residues on either side of the loop). On a set of 55 unbound H3 loop targets, ranging between 8 and 18 residues long, FREAD (using a database of known H3 structures) produced an average best prediction RMSD of 2.7 Å (the ‘best’ prediction is the loop structure closest to the native of all those returned by FREAD). FREAD is obviously very sensitive to the availability of H3 structures: if no similar structure has been observed before, FREAD will either return a poor answer or fail to find any suitable fragments at all. For this reason there is huge variation in the FREAD results – for example, the best prediction for one target had an RMSD of 0.18 Å, while for another, the best RMSD was 10.69 Å. Fourteen of the targets were predicted with an RMSD of below 1 Å. The coverage for this particular set of targets was 80%, which means that FREAD failed to find an answer for one in five targets.

MECHANO is an ab initio algorithm that we have developed specifically for H3 loop prediction. Loops are built computationally, by adding residues sequentially onto one of the anchors. For each residue, φ/ψ dihedral angles are chosen from a distribution at random – the distributions used by MECHANO are residue-specific, and are a combination of general loop data and H3 loop data. Loops conformations are closed using a modified cyclic coordinate descent algorithm (CCD), where the dihedrals of each residue are changed, one at a time, to minimise the distance between the free end of the loop and its anchor point, whilst keeping the dihedral angles in the allowed regions of the Ramachandran plot. I have tested MECHANO on the same set of targets as FREAD, generating 5000 loop conformations per target: the average best prediction RMSD was 2.1 Å, and the results showed a clear length dependence – this is expected, since the conformational space to explore becomes larger as the number of residues increases. Even though the average best prediction RMSD is better than that of FREAD, only one of the best RMSDs produced by MECHANO was sub-angstrom, compared to 14 for FREAD. Since the MECHANO algorithm does not depend on previously observed structures, predictions were made for all targets (i.e. coverage = 100%).

My current work is focused upon developing a ‘hybrid’ method, which combines elements of the FREAD and MECHANO algorithms. In this way, we hope to make predictions with the accuracy that can be achieved by FREAD, whilst maintaining 100% coverage. In its current form, the hybrid method, when tested on the 55-loop dataset from before, produces an average best prediction RMSD of 1.68 Å, with 16 targets having a best RMSD of below 1 Å – a very promising result! However, possibly the most difficult part of loop prediction is the ranking of the generated loop structures; i.e. choosing the conformation that is closest to the native. This is therefore my next challenge!

Antibody CDR-H3 Modelling with Prime

In a blog post from last month, Konrad discussed the most recent Antibody Modelling Assessment (AMA-II), a CASP-like blind prediction study designed to test the current state-of-the-art in antibody modelling. In the second round of this assessment, participants were given the crystal structure of ten antibodies with their H3 loops missing – the loop usually found in the centre of the binding site that is largely responsible for the binding properties of the antibody. The groups of researchers were asked to model this loop in its native environment. Modelling this loop is challenging, since it is much more variable in sequence and structure than the other five loops in the binding site.

For eight out of the ten loops, the Prime software from Schrodinger (the non-commercial version of which is called PLOP) produced the most accurate predictions. Prime is an ab initio method, meaning that loop conformations are generated from scratch (unlike knowledge-based methods, which use databases of known loop structures). In this algorithm, described here, a ‘full’ prediction job is made up of consecutive ‘standard’ prediction jobs. A standard prediction job involves building loops from dihedral angle libraries – for each residue in the sequence, random phi/psi angles are chosen from the libraries. Loops are built in halves – lots of conformations of the first half are generated, along with many of the second half, and then all the first halves are cross-checked against the second halves to see whether any of them meet in the middle. If so, then the two halves are melded and a full loop structure is made. All loop structures are then clash-checked using an overlap factor (a cutoff on how close two atoms can get to each other). Finally, the loops are clustered, and a representative structure has its side chain conformations predicted and its energy minimised.

A full loop prediction job is made up of a series of standard jobs, with the goal of guiding the conformational search to focus on structures with low energy. The steps are as follows:

Initial – five standard jobs are run, with slightly different overlap factors.
Ref1 – the first refinement stage. The conformational space around the top 10 loops from each standard job of the Initial stage is explored further by constraining the distance between Ca atoms.
Fixed – the top 10 loops of all those generated so far are passed to this series of stages. To begin with, the first and last residues of the loop are excluded from the prediction and the rest of the loop is re-modelled. The top 10 loops after this are then taken to the second Fixed stage, where two residues at each end of the loop are kept fixed. This is repeated five times, with the number of fixed residues at each end of the loop being increased by one each time.
Ref2 – a second refinement stage, which is the same as the first, except tighter distance constraints are used.
Final – all the loop structures generated are ranked according to their energy, and the lowest energy conformation is chosen as the final prediction.

In a recent paper, Prime was used to predict the structures of 53 antibody H3 loops (using the dataset of a previous RosettaAntibody paper). 91% of the targets were predicted with sub 2-angstrom accuracy, and 81% predictions were sub-angstrom. Compared to RosettaAntibody, which achieved 53% and 17% for predictions below 2A and 1A respectively, this is very impressive. For AMA-II, however, where each group was required to give five predictions, and some poor models were included in each group’s top five, it is apparent that ranking loop conformations is still a major challenge in loop modelling.

Journal Club: Random Coordinate Descent

The paper I chose to present at last week’s group meeting was “Random Coordinate Descent with Spinor-Matrices and Geometric Filters for Efficient Loop Closure”, by Pieter Chys and Pablo Chacón.

Loop closure is an important step in the ab initio modelling of protein loops. After a loop is initially built, normally by randomly choosing φ/ψ (phi/psi) dihedral angles from a distribution (Step 1 in the figure below), it is probably not ‘closed’ – i.e. the end of the loop does not meet the rest of the protein structure on the other side of the gap. Waiting for the algorithm to produce closed initial conformations would be horribly inefficient, so it’s much better to have some method of closing the initial loop structures computationally.

The main steps in the ab initio prediction of protein loops.

Loop closure methods can be classified into three different types:

Analytical methods: the exact solution to the loop closure problem is calculated. The difficulty with this approach is that it becomes increasingly complicated the more degrees of freedom (i.e. dihedral angles) you have.
Build-up methods: the loop is built residue-by-residue to construct an approximately closed loop which can then be refined. Basically, the loop is guided to the closed position as it is being built.
Iterative methods: do just what they say on the tin – the loop is closed gradually through a series of iterations.

Of course, science is never simple, and loop closure algorithms often cannot be classified into just one of the above categories. Cyclic coordinate descent (CCD), the method on which the random coordinate descent algorithm introduced in this paper is based, is a mix of analytical and iterative methods. Starting from one anchor residue (the residues either side of the loop), the loop is initialised. To the end of the ‘open’ loop structure is added the anchor residue from the other side. This residue is therefore present twice: the ‘fixed’ anchor residue (the true structure) and the ‘mobile’ anchor residue (the one added to the loop structure). Then, starting from the end of the loop that is attached to the rest of the protein, the dihedral angles are changed sequentially to try and minimise the distance between the fixed and mobile anchor residues. The angle change that would minimise this distance is calculated analytically. Once the distance is within a particular cut-off value, the loop is considered to be closed and this is then the final structure.

Random coordinate descent (RCD) is based upon CCD, but with a number of alterations and additions:

Instead of iterating through each dihedral angle sequentially along the loop backbone, angles are chosen randomly
A spinor-matrix approach is used – this reduces loop closure times
Various geometric filters are added at various points in the algorithm – either before, during or after loop closure.
‘Switching‘ – if loop building fails, then the direction of loop building is changed to the opposite – for example, if the structure is being grown from the N-anchor, but doesn’t pass through the filters, then the loop is discarded and the next loop will be grown from the C-anchor. This should mean that the favoured loop closure direction naturally dominates.

The different geometric filters are as follows:

A grid clash filter, which checks for clashes between the loop residues and the rest of the protein structure
A loop clash filter, which checks for internal clashes between loop residues
An adaptive Ramachandran filter, which restrains the dihedral angles to the allowed regions of the Ramachandran plot.

The Ramachandran filter is a good idea, since loop closure can change the dihedral angles of a structure significantly, moving them into disallowed regions. φ (phi) angles are restricted to the range between -175˚ and -40˚, and ψ angles are restricted between -60˚ and 175˚ – this is basically the top left part of the Ramachandran plot. There are two exceptions: the φ angle of proline is fixed, and the dihedral angles of glycine residues are not restricted at all. When placed inside the loop closure routine, the filter is ‘adaptive’ – if the calculated optimum angle is outside of the allowed region, the filter calculates the maximum possible rotation that would still be allowed. When these angle changes become too small, however, the restriction is removed entirely and the angle is allowed to change freely.

By testing different combinations of filters in different places, the authors decided upon a final RCD algorithm. This version includes the grid clash filter during loop closure, and the Ramachandran filter applied both before and during loop closure. They then compare their method to some other loop closure algorithms – their method produces good results, outperforming all except a method called ‘direct tweak’ – the only other method tested that includes clash detection during loop closure. From this, the authors conclude that this is a key factor in generating accurate loop conformations. They also report that RCD is 6 to 17 times faster than direct tweak.

Overall, then, the authors of this paper have introduced an accurate and fast loop closure algorithm which outperforms most other methods. Currently, my research is focussed upon developing a new antibody-specific ab initio loop modelling method, and some of the concepts used in this paper would definitely be worth investigating further. Watch this space!

Antibody Modelling: CDR-H3 Structure Prediction

As regular readers of this blog will know (I know you’re out there somewhere!), one of the main focusses of OPIG at the moment is antibody structure. For the last ten weeks (as one of my short projects for the Systems Approaches to Biomedical Science program of the DTC) I have been working on predicting the structure of the CDR-H3 loop.

So, a quick reminder on antibody structure: antibodies, which have a characteristic shape reminiscent of the letter `Y’, consist of two identical halves, each containing a heavy and a light chain. Heavy chains are made up of four domains (three constant domains, CH1, CH2 and CH3; and one variable domain, VH), while light chains have two (one constant domain, CL; and one variable domain, VL). The variable domains of both the heavy and light chain together are known as the Fv region; most naturally occurring antibodies have two. At the ends of these Fv regions are six loops, known as the complementarity determining regions, or CDRs. There are three CDRs on each of the VH and VL domains; those located on the VL domain are labelled L1, L2 and L3, while those found on the VH domain are labelled H1, H2 and H3. It is these loops that form the most variable parts of the whole antibody structure, and so it is these CDRs that govern the binding properties of the antibody. Of the six CDRs, by far the most variable is the H3 loop, found in the centre of the antigen binding site. A huge range of H3 lengths have been observed, commonly between 3 and 25 residues but occasionally much longer. This creates a much larger structural diversity when compared to the other CDRs, each of which has at most 8 different lengths. It is the H3 loop that is thought to contribute the most to antigen binding properties. Being able to model this loop is therefore an important part of creating an accurate model, suitable for use in therapeutic antibody design.

Predicting the structure of the loop requires three steps: sampling, filtering and ranking. There are two types of loop modelling method, which differ in the way they perform the sampling step: knowledge-based methods, and ab initio methods. Knowledge-based methods, or database methods, rely upon databases of known loop structures that can be searched in order to find fragments that would form feasible structures when placed in the gap. Whilst predictions are made relatively quickly in this way, one disadvantage is that the database of fragments may not contain anything suitable, and in this situation no prediction would be made. Ab initio (or conformational searching) methods, on the other hand, do not rely upon a set of previously known loop structures – loop conformations are generated computationally, normally by sampling dihedral angles from distributions specific to each amino acid. The loops generated in this way, however, are not ‘closed’, i.e. the loop does not attach to both anchor regions, and therefore some sort of loop closure method must be implemented. The assumption is made that the native loop structure should represent the global minimum of the protein’s free energy. Ab initio methods are generally much slower than knowledge-based ones, and their accuracy is dependent on loop length (long loops are harder to predict using this method), however unlike the database methods, an answer will always be produced.

3juyE

For my project, I have examined the performance of FREAD (a knowledge-based method) and MECHANO (an ab initio method) when predicting the structure of the H3 loop. At the moment, FREAD produces better results than MECHANO, however we hope to improve the predictions made by both. By optimising the performance of both methods, we hope to create a ‘hybrid’ loop modelling method, thereby exploiting the advantages of both approaches. Since I’ve decided that this is the project I want to continue with, this will be the aim of my DPhil!

Oxford Protein Informatics Group

or "OPIG" to friends