Author Archives: Eleanor Law

A brief history of usage of the word “decoy” in protein structure prediction

Some concepts in science are counter-intuitive, like the Monty Hall problem or the Mpemba effect. Occasionally, this is also true for terminology, despite the best efforts of scientists to ensure that their work can be explained unambiguously to newcomers. Specifically, in our field of protein structure prediction, the word “decoy” has been used to mean one of many conformations generated by a de novo modelling protocol such as Rosetta, or alternative conformations of loops produced by an ab initio program e.g. Sphinx. Though slightly baffled by this usage when I started working in the field, I have now become so familiar with its strange new meaning that I have to remind myself to explain it in talks to a more general audience, or simply aim to avoid the term altogether. Nonetheless, following a heated discussion over the term in a recent group meeting, I thought it would be interesting to trace the roots of the new meaning.

Let’s begin with a definition from Google:

decoy

noun

noun: decoy; plural noun: decoys

/ˈdiːkɔɪ,dɪˈkɔɪ/

a bird or mammal, or an imitation of one, used by hunters to attract other birds or mammals.

“a decoy duck”

a person or thing used to mislead or lure someone into a trap.

“we need a decoy to distract their attention”

So we start with the idea of something distracting, resembling the true thing but with the intent to deceive. So how has this sense of the word evolved into what we use now? I attempted to dig out the earliest mention of decoy for a computationally generated protein conformation with a Google scholar search for “decoy protein”, which led to the work of Thomas and Dill published in 1996. Here the authors describe a method of distinguishing the native fold of a protein from the sequence threaded, without gaps, onto alternative structures from the PDB. This problem of discrimination between native and non-native had been carried out previously, but Thomas and Dill chose to describe the alternatives as “decoy conformations” or just “decoys”.

A similar problem was commonly attempted over the following years, of separating native structures from sets of computationally generated conformations. Due to the demands of conformer generation at this time, some sets were published themselves in online databases to be used as a resource for training scoring functions.

When it comes to the problem of de novo protein structure prediction, unfortunately it isn’t as simple as picking out the correct answer from a population of incorrect answers. Even among hundreds of thousands of conformations generated by the best methods, the exact native crystal structure will not be found (though a complication here that the protein is dynamic and will occupy an ensemble of native conformations). Therefore, the aim of any scoring function in structure prediction is instead to select which incorrect conformation is closest to the native structure, hoping to obtain at least the correct fold.

It is for this reason that we move towards the idea of choosing a model from a pool of decoys. Zhu et al. (2003) use “decoy” in precisely this way:

“One strategy for ab initio protein structure prediction is to generate a large number of possible structures (decoys) and select the most fitting ones based on a scoring or free energy function”

This seems to be where the idea of a decoy as incorrect and distracting is lost, and takes on its new meaning as one of a large and diverse set of protein-like conformations, which has continued until now.

So is it ever helpful to refer to “decoys” as opposed to “models”? What is communicated by “decoy” that is not achieved by using the word “model”? I think this may come down to the impression which is given by talking about a pool of decoys. People would not generally assume that each decoy on its own has any effective use for prediction of function. There is a sense that this is not the final result of the structure prediction pipeline, there is work yet to be done in refining, clustering, and making human judgments on the suitability of the output. Only after these stages would I feel more comfortable using the word “model”, to express the greater confidence we have in the structure (small though that may be in the de novo structure prediction world). However, the inadequacy of “model” does not alone justify this tenuous usage of “decoy”. Perhaps we could speak more often about populations of “conformations”. In any case, “decoy” is widespread in the community, and easily understood by those who are most likely to be reading, reviewing and editing the literature so I think we will be stuck with it for a while yet.

Biophysical Society 61st Annual Meeting – New Orleans, February 2017

As the sole representative of OPIG attending Biophys 2017 in New Orleans, I had to bear the heavy burden of a long and lonely flight and the fear of missing out on a week of the very grey Oxford winter. Having successfully crossed the border into the US, which was thankfully easier for me than it was for some of our scientific colleagues from around the world, I found my first time attending the conference to be full of very interesting and relevant science. While also covering a wide variety of experimental techniques and non-protein topics, the conference is so large and broad that there was more than enough to keep me busy over the five days, featuring folding, structure prediction, docking, networks, and molecular dynamics.

There were several excellent talks on the subject of folding pathways, misfolding and aggregation. A common theme was the importance of the kinetic stability of the native state, and the mechanisms by which it may be prevented from reaching a non-native global thermodynamic minimum. This is particularly important for serpins, large protease inhibitors which inactivate proteases by a suicide mechanism. The native and active state can be transformed into a lower energy conformation over long timescales. However, this also occurs by cleavage near the C-terminal end, which allows insertion of the C-terminal tail into a beta sheet, holding the cleaving protease inactive and therefore the stored energy is very important for function. Anne Gershenson described recent simulations and experiments to elucidate the order in which substructures of the complete fold assemble. There are many cooperative substructures in this case, and N-terminal helices form at an early stage. The overall topology appears to be consistent with a cotranslational folding mechanism inside the ER, but requires significant rearrangements after translation for adoption of the full native fold.

Cotranslational folding was also discussed by several others including the following: Patricia Clark is now using the YKB system of alternately folding fluorescent protein to find new translation stalling sequences; Anais Cassaignau described NMR experiments to show the interactions taking place between nascent chains and the ribosome at different stalled positions during translation; and Daniel Nissley presented a model to predict a shift in folding mechanism from post-translational to cotranslational due to specific designed synonymous codon changes, which agreed very well with experimental data.

To look more deeply into the evolution of folding mechanisms and protein stability, Susan Marqusee presented a study of the kinetics of folding of RNases, comparing the properties of inferred ancestral sequences to a present day thermophile and mesophilic E. coli. A number of reconstructed sequences were expressed, and it was found that moving along either evolutionary branch from the ancestor to modern day, folding and unfolding rates had both decreased, but the same three-state folding pathway via an intermediate is conserved for all ancestors. However, the energy transition between the intermediate and the unfolded state has evolved in opposite directions even while the kinetic stability remains similar. This has led to the greater thermodynamic stability seen in the modern day thermophile compared to the mesophile at higher temperatures and concentrations of denaturant.

Panel C shows that kinetic stability (low unfolding rate) seems to be selected for in both environments. Panel D shows that the thermodynamic stability of the intermediate (compared to the unfolded state) accounts for the differences in thermodynamic stability of the native state, when compared to the common ancestor (0,0). Link to paper

There were plenty of talks discussing the problems and mechanisms of protein aggregation, with two focussing on light chain amyloidosis. Marina Ramirez-Alvarado was investigating how fibrils begin to grow and showed using microscopy that both soluble light chains and fibrils (more slowly) are internalised by heart muscle cells. They can then be exposed at the cell surface and become a seed to recruit other soluble light chains to form fibrils. Shannon Esswein presented work on the enhancement of VL-VL dimerisation to prevent amyloid formation. The variable domain of the light chain (VL) can pair with itself in a similar orientation to its pairing with VH domains in normal antibodies, or in a non-canonical orientation. Adding disulphide bonds to stabilise these dimers prevented fibril formation, therefore they carried out a small scale screen of 27 aromatic and hydrophobic ligands to find those which would favour dimer formation by binding at the interface. Sulfasalazine was detected in this screen and was also shown to significantly reduce fibril formation and could therefore be used as a template for future drug design.

A ligand stabilises the dimer therefore fewer light chains are present as monomers, slowing the rate of the only route by which fibrils can be formed. Link to paper

Among the posters, Alan Perez-Rathke presented loop modelling by DiSGro in beta barrel membrane proteins which showed that the population of structures generated and scored favourably after relaxation at a pH 7 led to an open pore more often than at pH 5, consistent with experimental observations. There were two posters on the topic of prediction of membrane protein expression in bacteria and yeast presented by students of Bill Clemons, who also gave a great talk. Shyam Saladi has carefully curated datasets of successes and failures in expression in E. coli and trained a linear SVM on features such as RNA secondary structure and transmembrane segment hydrophobicity to predict the outcome for unknown proteins. This simple approach (preprint available here) achieved area under ROC curve of around 0.6 on a separate test set, and using more complex machine learning techniques is likely to improve this. Samuel Schulte is adapting the same method for prediction of expression in yeast.

Overall, it was a great conference and it was nice to hear about plenty of experimental work alongside the more familiar computational work. I would also highly recommend New Orleans as an excellent place to find great food, jazz and sunshine!

A beginner’s guide to Rosetta

Rosetta is a big software suite, and I mean really big. It includes applications for protein structure prediction, refinement, docking, and design, and specific adaptations of these applications (and others) to a particular case, for example protein-protein docking of membrane proteins to form membrane protein complexes. Some applications are available in one of the hassle-free servers online (e.g. ROSIE, Robetta, rosetta.design), which might work well if you’ve got just a few tests you would like to try using standard parameters and protocols. However, it’s likely that you will want to download and install a version if you’re interested in carrying out a large amount of modelling, or using an unusual combination of steps or scoring function. This is not a trivial task, as the source code is a 2.5G download, then your machine will be busy compiling for some time (around 5 hours on two cores on my old laptop). Alternatively, if the protocols and objects you’re interested in are part of PyRosetta, this is available in a pre-compiled package for most common operating systems and is less than 1G.

This brings me to the different ways to use Rosetta. Most applications come as an executable which you can find in Rosetta/main/source/bin/ after completing the build. There is documentation available on how to use most of these, and on the different flags which can be used to input PDB structures and parameters. Some applications can be run using RosettaScripts, which uses an xml file to define the protocol, including scoring functions, movers and other options. In this case, Rosetta/main/source/bin/rosetta_scripts.* is run, which will read the xml and execute the required protocol.

An example RosettaScript, used for the MPrelax protocol

PyRosetta is even more flexible, and relatively easy to use for anyone accustomed to programming in python. There are python bindings for the fast C++ objects and movers so that the increased usability is generally not greatly compromised by slower speeds. One of the really handy things about PyRosetta is the link to PyMOL which can be used to view the trajectory of your protein moving while a simulation is running. Just add the following to your .pymolrc file in your home directory to set up the link every time you open pymol:

run /PATH/TO/PYROSETTA/PyMOLPyRosettaServer.py

When it comes to finding your way around the Rosetta package, there are a few things it is very useful to know to start with. The demos directory contains plenty of useful example scripts and instructions for running your first jobs. In demos/tutorials you will find introductions to the main concepts. The demos/protocol_capture subdirectory is particularly helpful, as most papers which report a new Rosetta protocol will deposit here the scripts required to reproduce their results. These may not currently be the best methods to approach a problem, but if you have found a research article describing some results which would be useful to get for your system, they are a good starting point to learn how to make an application work. Then the world is your oyster as you explore the many possible options and inputs to change and add!

Tracked changes in LaTeX

Maybe people keep telling you Word is great but you are just too emotionally attached to LaTeX to consider using anything else. It just looks so beautiful. Besides, you would have to leave your beloved linux environment (maybe that’s just me), so you stick with what you know. You work for many weeks long and hard, finally producing a draft of a paper that gets the all clear from your supervisor to submit to journal X. Eventually you hear back and the reviewers have responded with some good ideas and a few pedantic points. Apparently this time the journal wants a tracked changes version to go with your revised manuscript.

Highlighting every change sounds like a lot of bother, and besides, you’d have to process the highlighted version to generate the clean version they want you to submit alongside it. There must be a better way, and one that doesn’t involve converting your document to Word.

Thankfully, the internet has an answer! Check out this little package changes which will do just what you need. As long as you annotate using \deleted{}, \replaced{} and \added{} along the way, you will have to change just one word of your tex source file in order to produce the highlighted and final versions. It even comes with a handy bash script to get rid of the resulting mess when you’re happy with the result, leaving you with a clean final tex source file.

Screenshot from 2016-07-12 19-45-12

The die-hard Word fans won’t be impressed, but you will be very satisfied that you have found a nice little solution that does just the job you want it to. It’s actually capable of much more, including comments by multiple authors, customisation of colours and styles, and an automatically generated summary of changes. I have heard good things about ShareLaTeX for collaboration, but this simple package will get you a long way if you are not keen to start paying money yet.

Co-translational insertion and folding of membrane proteins

The alpha-helical bundle is the most common type of fold for membrane proteins. Their diverse functions include transport, signalling, and catalysis. While structure determination is much more difficult for membrane proteins than it is for soluble proteins, it is accelerating and there are now 586 unique proteins in the database of Membrane Proteins of Known 3D Structure. However, we still have quite a poor understanding of how membrane proteins fold. There is increasing evidence that it is more complicated than the two-stage model proposed in 1990 by Popot and Engelman.

The machinery that inserts most alpha-helical membrane proteins is the Sec apparatus. In prokaryotes, it is located in the plasma membrane, while eukaryotic Sec is found in the ER. Sec itself is an alpha-helical bundle in the shape of a pore, and its structure is able both to allow peptides to pass fully across the membrane, and also to open laterally to insert transmembrane helices into the membrane. In both cases, this occurs co-translationally, with translation halted by the signal recognition particle until the ribosome is associated with the Sec complex.

Voorhees, R. M. et al. (2014) Cell, 157(7), 1632–43

If helices are inserted during the process of translation, does folding only begin after translation is finished? On what timescale are these folding processes occuring? There is evidence that a hairpin of two transmembrane helices forms on a timescale of miliseconds in vitro. Are helices already interacting during translation to form components of the native structure? It has also been suggested that helices may insert into the membrane in pairs, via the Sec apparatus.

There are still many aspects of the insertion process which are not fully understood, and even the topology of an alpha-helical membrane protein can be affected by the last part of the protein to be translated. I am starting to investigate some of these questions by using computational tools to learn more about the membrane proteins whose structures have already been solved.

Journal Club: Spontaneous transmembrane helix insertion thermodynamically mimics translocon-guided insertion

Many methods are available for prediction of topology of transmembrane helices, this being one of the success stories of protein structure prediction with accuracies over 90%. However, there are still areas where there is disagreement in some areas about the partitioning between the states of dissolved in water and positioned across a lipid bilayer. Complications arise because there are so many methods of measuring the thermodynamics of this transition – experimental and theoretical, in vivo and in vitro. It is uncertain what difference the translocon makes to the energetics of insertion – is the topology and conformation of a membrane protein the global thermodynamic minimum or just a kinetic product?

This paper uses three approaches to measure partitioning to test the agreement between different methods. The authors aim to reconcile differences calculated so far for insertion of an arginine residue into the membrane (ranging from +2 to +15 kcal/mol). This is an important question, because many transmembrane helices are only marginally hydrophobic and it is not known how and when they insert in the folding process. Arginine is chosen here because the pKa of 12.5 of the side chain is very high so it will not deprotonate in the centre of a bilayer and complications of protonation and deprotonation do not need to be considered. The same peptide is used for each method, of the form L_nRL_n, and the ratio between the interface and transmembrane states is used to calculate estimates of ΔG. In order to make sure that there were helices with a ΔG close to zero for accurate estimates, they used a range of values of n from 5-8.

The first method was an insertion assay using reconstituted microsomes, where this helix was inserted into the luminal domain of LepB. A glycosylation site was added at each end of the helix, but glycosylation takes place only on sites inside microsomes. Helices inserted into the membrane are only glycosylated once, whereas secreted helices are glycosylated twice and those which did not go through the translocon are not glycosylated. SDS-PAGE can separate these states by mass, and the ratio between single and double glycosylation gives the partitioning between inserted and interface helices out of those which entered the translocon. As expected, the trend is for longer helices with more leucine to favour the transmembrane state.

Adapted from Figure 4a: The helix, H, either passes through the translocon into the lumen (“S”) resulting in two glycosylations (green pentagons), or is inserted (TM) resulting in one glycosylation.

The second method was also experimental: oriented synchrotron radiation circular dichroism (ORSCD). Here they used just the peptide with one glycine at each end, as this would be able to equilibrate between the two states quickly. Theoretical spectra can be calculated for a helix , and therefore the ratio in which they must be combined to give the measured spectrum for a given peptide gives the ratio of transmembrane and interface states present.

Figure 2b: TM and IP are the theoretical spectra for the transmembrane and interface states, and the peptides fall somewhere in between.

Finally, the authors present 4 μs molecular dynamics simulations of the same peptides at 140°C, so that equilibration between the two states would be fast. The extended peptide at the start of the simulation quickly associates with the membrane and adopts a helical conformation. An important observation to note is that the transmembrane state is in fact at around 30° to the membrane normal, to allow the charged guanidinium group of the arginine to “snorkel” up to interact with charged phosphate groups of the lipids. Therefore this state is defined as transmembrane, in contrast to the OSRCD experiments where the theoretical TM spectrum was calculated for a perpendicular helix. This may be a source of some inaccuracy in the propensities calculated from OSRCD.

Figure 2c: Equilibration in the simulation for the L<sub>7</sub>RL<sub>7</sub> peptide. Transmembrane and interface states are seen in the partitioning and equilibration phases after the helix has formed.

Figure 2c: Equilibration in the simulation for the L₇RL₇ peptide. Transmembrane and interface states are seen in the partitioning and equilibration phases after the helix has formed.

Figure 3c: As the simulations run, the proportion of helices in the transmembrane state (P_TM) converges to a different value for each peptide.

Overall, the ΔG calculated experimental and molecular dynamics (MD) simulations agree very well. In fact, they agree better than those from previous studies of a similar format looking at polyleucine helices, where there was a consistent offset of 2 kcal/mol between the experiment and simulation derived values. The authors are unable to explain why the agreement for this study is better, but they indicate that it is unlikely to be related to any stabilisation by dimerisation in the experimental results, as a 4 μs MD simulation of two helices did not show them forming stable interactions. The calculated difference in insertion energy (ΔΔG) on replacing a leucine with argnine is therefore calculated to be +2.4-4.3 kcal/mol by experiment and +5.4-6.8 by simulation, depending on the length of the peptide (it is a more costly substitution for longer peptides as the charge is buried deeper). The difference between the experimental and simulation results is accounted for by their disagreement in the polyleucine study.

We thought this paper was a great example of experimental design, where the system was carefully chosen so that different experimental and theoretical approaches would be directly comparable. The outcome is good agreement between the methods, demonstrating that the vastly different values recorded previously seem to be because very different questions were being asked.

Ten Simple Rules for a Successful Cross-Disciplinary Collaboration

The name of our research group (Oxford Protein Informatics Group) already indicates its cross-disciplinary character. In doing research of this type, we can acquire a lot of experience in working across the boundaries of research fields. Recently, one of our group members, Bernhard Knapp, became lead author of an article about guidelines for cross-disciplinary research. This article describes ten simple rules which you should consider if working across several disciplines. They include going to the other lab in person, understanding different rewards models, having patience with the pace of other disciplines, and recognising the importance of synergy.

The ten rules article was even picked up by a journalist of the “Times Higher Education” and further discussed in the newspaper.

Happy further interdisciplinary work!

Investigating GPCR kink variation

G-protein coupled receptors (GPCRs) are the target of 50-60% of drugs, including many of those involved in the treatment of cancer and cardiovascular disease. Over 100 GPCR crystal structures are now available, but these are for only around 30 different receptors, and there are still hundreds more receptors for which no structure exists. There is huge diversity in the ligands which bind to GPCRs, so it may often be difficult to predict the shape of a binding pocket for a specific receptor of interest, especially if no close relatives have a structure solved.

Helix kinks (see previous blog posts) are a structural feature of GPCRs which are thought to be important for function. An ability to predict their presence and the magnitude of helix direction change is important for obtaining an accurate structure. A kink prediction method has already been used in the context of GPCR structure prediction, which scored the overall structures after replacing kink segments with others from a database. This made it possible to predict the change in a kink angle based on the stability of the whole GPCR structure.

To better inform this kind of modelling, we wanted to investigate specifically how much variation there is in kink angles between GPCRs. To do this we used the tool Kink Finder to measure angles in all of the transmembrane helices of the GPCRs in the GPCRDB, and estimate a confidence interval on those angles. Then we could state whether the variation that we see in GPCR kink angles is greater than what we would expect from measurement error alone.

Each helix appears to show different behaviour. Some helices were very well conserved, but others showed a huge amount of variation. For these helices with very variable angles, it would be interesting to know if this is a change related to sequence differences, or conformational flexibility between more than one preferred conformation. We found an example where significantly different angles were found even in the same receptor. In this case, the kink angle size is related to whether the structure has an agonist or an antagonist bound, so we propose that this is a functionally relevant and flexible kink.

We also carried out the same analysis on helices from other families of membrane and soluble proteins, and found many more highly variable kinks (one example shown below). This shows that they should be a very important consideration when carrying out homology modelling, and that their conformational flexibility could also be important for function in many other contexts.

Sampling Conformations of Antibodies using MOSAICS

Much work has been done to study the conformational changes taking place in antibodies, particularly during the event of binding to an antigen. This has been done through comparison of crystal structures, circular dichroism, and recently with high resolution single particle electron microscopy. The ability to resolve domains within an antibody from single particles without any averaging made it possible to show distributions of properties such as the shape of a Fab domain, measured by the ratio of width to length. Some of the variation in structure seen involves very large scale motions, but it is not known how conformational changes may be transmitted from the antigen binding region to the Fc, and therefore influence effector function. Molecular dynamics simulations have been performed on some large antibody systems, however none have been possible on a time scale which would be able to provide information on the converged distributions of large scale properties such as the angle between the Fab and Fc fragments.

In my short project with Peter Minary, I used MOSAICS to investigate the dynamics of an antibody Fab fragment, using the coarse-grained natural move Monte Carlo approach described by Sam a few weeks ago. This makes it possible to split a structure into units which are believed to move in a correlated way, and propose moves for the components of each region together. The rate of sampling is accelerated in degrees of freedom which may have functional significance, for example the movement of the domains in a Fab fragment relative to one another (separate regions shown in the diagram below). I used ABangle to analyse the output of each sampling trajectory and observe any changes in the relative orientations of The VH and VL domains.

Fab region definitions for MOSAICS

Of particular interest would be any correlations between conformational changes in the variable and constant parts of the Fab fragment, as these could be involved in transmitting conformational changes between remote parts of the antibody. We also hoped to see in our model some effect of including the antigen in the simulation, bound to the antibody fragment as seen in the crystal structure. In the time available for the project, we was able to set up a model representing the Fab fragment and run some relatively short simulations to explore favoured conformational states and see how the set up of regions affects distributions seen. In order to draw conclusions about the meaning of the results, a much greater number of simulations will need to be run to ensure sampling of the whole conformational space.