Category Archives: Uncategorized

CABS-flex

One of the still open questions in bioinformatics is that of the flexibility of proteins, and it is one in which I am quite interested. Our main source of structural information is X-ray diffraction experiments, in which the crystalline protein is intrinsically rigid. While there is an ever increasing body of NMR data, and Molecular Dynamics simulations are becoming faster and cheaper, complete information about the dynamics of every PDB structure is a long way off. All atom MD simulations for a complete protein still take either days, or a powerful computer. So, naturally, any papers that talk about ways to get flexibility information catch my attention.

The paper that caught my attention was CABS-flex: Server for fast simulation of protein structure fluctuations.. There also is a connected paper – Consistent View of Protein Fluctuations from All-Atom Molecular Dynamics and Coarse-Grained Dynamics with Knowledge-Based Force-Field – which benchmarks their method against MD simulations.

The CABS model pops up in a number of other places – there is a CABS-fold server, and the method was first described in the paper Consistent View of Protein Fluctuations from All-Atom Molecular Dynamics and Coarse-Grained Dynamics with Knowledge-Based Force-Field from 2004.

What does the webserver do?

You give it a coordinate file, wait some hours, and it gives you back a series of (clustered) structures, a C&alpha trajectory, a residue flexibility profile, and a nice little video. Which is nice. It is, however, a little limited; you can only do a single chain (so no modelling of small molecule or peptide interactions), there is no way of fixing part of the model so it does not flex, and it can be picky about your PDB files – no chain gaps, no strange amino acids. You can tell it to analyse a PDB chain by just entering the code, but these are frequently rejected for having unacceptable coordinate files.

How does it work?

The CABS model makes a reduced representation of the protein, and explores its conformational space by making moves – which are accepted or rejected based on a (comparatively complex) force field. Many moves are attempted, and over a large number of iterations, this builds up an ensemble of structures, which show the possible confirmations of the protein. So, its a bit like MD, but rather than moving atoms based on calculated forces, you make random moves, and accept the sensible ones.

hello

The model is shown above. The protein structure is reduced to 4 atoms per residue. There are 5 types of move, and in each step, a random move of each type is tried ~n times (n being the number of amino acids in the protein). These are accepted or rejected based on their force field, and the process is repeated several hundred times. The force field used to accept/reject moves is quite complex – there are short range (sequence dependent and sequence independent), hydrogen bond, and long range (again, sequence dependent and sequence independent) interactions. There are sub-categories within these, and the relative contributions of the interactions can be changed. There is no solvent, so the long range interactions have to provide the forces to keep the protein together, and the hydrogen bond interactions act to maintain the secondary structure elements. More detail can be found in the 2004 paper.

Is it true?

The authors justify the method by comparing it to MD simulations taken from a database, and conclude that CABS-flex gives answers as similar to MD simulations as MD simulations using different force fields are to each other. Simply put, the same residues move in their simulations as move in MD simulations. They move more, which backs up their claim that they demonstrate flexibility over a longer time-frame than MD simulations. They do admit that they get the best agreement when they heavily restrict their secondary structure – raising the question of how much of the agreement between all of the methods is down to all the simulations having the same secondary structure.

To conclude, this could be a useful method, particularly in that it can give long time-frame flexibility – providing you treat it as any other measure of flexibility, with a pinch of salt. It is a bit of a shame that the interface is (perhaps frustratingly) simple, and you need to pre-process your coordinate files, as many PDB coordinate files will not be accepted.

Conservation of codon optimality

The unidirectional process of translation converts one dimensional mRNA sequences into three dimensional protein structures. The reactant, the mRNA, is built from 64 codon variants, while the product, the protein, is constructed from the 20 biologically relevant amino acids. This reduction of 64 variants to a mere 20 is indicative of the degeneracy contained within the genetic code. Multiple codons encode for the same amino acid, and hence any given protein can be encoded by a multitude of synonymous mRNA sequences.  Current structural biology research often assumes that these synonymous mRNA sequences are equivalent; the ability of certain proteins to unfold and refold without detrimental effects leading to the assumption that the information pertaining to the folded structure is contained exclusively within the protein sequence. In actuality, experimental evidence is mounting that shows synonymous sequences can in fact produce proteins with different properties; synonymous codon switches having been observed to cause a wide range of detrimental effects, such as decreases in enzyme activity, reductions in solubility, and even causing misfolding.

The ribosome (yellow) passes along the mRNA, turning the sequence of codons into a protein. Under our model, the speed of the translation for each codon varies, in turn differing the time available to the nascent peptide chain to explore the fold space. Through this method codon choice becomes an additional source of structural information.

The ribosome (yellow) passes along the mRNA, turning the sequence of codons into a protein. Under our model, the speed of the translation for each codon varies, in turn differing the time available to the nascent peptide chain to explore the fold space. Through this method codon choice becomes an additional source of structural information.

For my 10 week DTC project within the Deane group,  I was tasked with resurrecting the Coding Sequence and Structure database (CSandS; read: Sea Sands) and using it to test for the evolutionary conservation of codon choice over groups of proteins with similar structures. With experimental differences noted in protein product structure between synonymous sequences, we wished to investigate if codon choice is an evolutionary constraint on protein structure, and if so, to what degree, and in what manner. To test for evolutionary conservation we combined the theories of codon optimality and cotranslational folding, our hypothesis being that the choice of codon directly affects the translation efficiency of the ribosome; consequently different codons give the nascent polypeptide emerging from the ribosome varying amounts of time to explore the fold-space.  By assigning optimality scores to each codon, we can search for regions of fast translation and slow translation, then by looking for correlations within aligned sets of structural similar proteins we can identify sites where codon choice is of importance. While the 10 weeks project focussed mainly on methodology and implementation, my preliminary results indicate that a large percentage of proteins are under selective pressures with regards to codon choice. I found that in most sets of proteins, there is an greater amount of correlation than would be expected by chance, this crucially suggests that there is additional structural information contained within the mRNA that is lost once translation occurs.

For additional information on the background, methodology and implementation of this research, please feel free to view the presentation slides at:  http://www.slideshare.net/AlistairMartin/evolut-26981404

Antibody Modelling: CDR-H3 Structure Prediction

As regular readers of this blog will know (I know you’re out there somewhere!), one of the main focusses of OPIG at the moment is antibody structure. For the last ten weeks (as one of my short projects for the Systems Approaches to Biomedical Science program of the DTC) I have been working on predicting the structure of the CDR-H3 loop.

CDRs_rainbow_labelled

So, a quick reminder on antibody structure: antibodies, which have a characteristic shape reminiscent of the letter `Y’, consist of two identical halves, each containing a heavy and a light chain. Heavy chains are made up of four domains (three constant domains, CH1, CH2 and CH3; and one variable domain, VH), while light chains have two (one constant domain, CL; and one variable domain, VL). The variable domains of both the heavy and light chain together are known as the Fv region; most naturally occurring antibodies have two. At the ends of these Fv regions are six loops, known as the complementarity determining regions, or CDRs. There are three CDRs on each of the VH and VL domains; those located on the VL domain are labelled L1, L2 and L3, while those found on the VH domain are labelled H1, H2 and H3. It is these loops that form the most variable parts of the whole antibody structure, and so it is these CDRs that govern the binding properties of the antibody. Of the six CDRs, by far the most variable is the H3 loop, found in the centre of the antigen binding site. A huge range of H3 lengths have been observed, commonly between 3 and 25 residues but occasionally much longer. This creates a much larger structural diversity when compared to the other CDRs, each of which has at most 8 different lengths. It is the H3 loop that is thought to contribute the most to antigen binding properties. Being able to model this loop is therefore an important part of creating an accurate model, suitable for use in therapeutic antibody design.

Predicting the structure of the loop requires three steps: sampling, filtering and ranking. There are two types of loop modelling method, which differ in the way they perform the sampling step: knowledge-based methods, and ab initio methods. Knowledge-based methods, or database methods, rely upon databases of known loop structures that can be searched in order to find fragments that would form feasible structures when placed in the gap. Whilst predictions are made relatively quickly in this way, one disadvantage is that the database of fragments may not contain anything suitable, and in this situation no prediction would be made. Ab initio (or conformational searching) methods, on the other hand, do not rely upon a set of previously known loop structures – loop conformations are generated computationally, normally by sampling dihedral angles from distributions specific to each amino acid. The loops generated in this way, however, are not ‘closed’, i.e. the loop does not attach to both anchor regions, and therefore some sort of loop closure method must be implemented. The assumption is made that the native loop structure should represent the global minimum of the protein’s free energy. Ab initio methods are generally much slower than knowledge-based ones, and their accuracy is dependent on loop length (long loops are harder to predict using this method), however unlike the database methods, an answer will always be produced.

3juyE

For my project, I have examined the performance of FREAD (a knowledge-based method) and MECHANO (an ab initio method) when predicting the structure of the H3 loop. At the moment, FREAD produces better results than MECHANO, however we hope to improve the predictions made by both. By optimising the performance of both methods, we hope to create a ‘hybrid’ loop modelling method, thereby exploiting the advantages of both approaches. Since I’ve decided that this is the project I want to continue with, this will be the aim of my DPhil!

Network Analysis

Why networks?

Individual expression could be thought as a phenomenon regulated mostly by the individual, but in a second stand it is also modified by the interactions with the surroundings.  Can the response of the individual be predicted by the group? (See the following video of an experiment conducted by Asch https://www.youtube.com/watch?v=FnT2FcuZaYI)

networkinside

 Most common type of network analysis

  • Basic network summary statistics (description)
  • Clustering methods (Extract information)
  • Random graphs (Description, inference and to model network topology)
  • Learning machine methods (Prediction)

Random Graphs and the topology structure

Depending on the structure of a desired network different random models could be of use, for example, if the goal is to obtain a sparse and not highly connected network then an ER model could be of use (this model randomly assign the edges between nodes)
or if the goal is exactly the opposite (have a very highly connected network) a geometric graph could be of use (this model randomly assign positions in a n-dimensional space and then place edges between nodes closer than a given distance). 

Is there already a random model?

According to our recent results we suspect there is no null model yet for PPIs, even though  for some virus PPIs some of the random models seem to be very good models; however this virus PPIs are much smaller (around 300 nodes and up to 500 edges) than the networks of model organisms (usually with more than 2000 nodes and 5000 edges) such as yeast, human, fruit fly and Escherichia coli among others.
We will soon be publishing our article with details about this.

Progress report: Molecular Dynamics simulations of TCRpMHC

Last time I gave a presentation about different papers describing Molecular Dynamics simulations of TCRpMHC (http://blopig.com/blog/?p=648&preview=true). This time I extended this to more of my own work. The focus of this talk was about the motivation why to choose a certain system and which requirements if must fulfíl for reliable conclusions. In short they are:

1) A larger number (>100) of experimental values (e.g. some kind of binding, immunogenicity, etc) of a certain TCRpMHC system. These values should be provided by:
– the same group
– the same HLA/MHC batches
– the same binding conditions
– in the ideal case in the same manuscript

The values should NOT be from a database since different experimental groups come to extremely different results for the same complexes as several examples show e.g.:
The binding IC50 affinity values of PKYVKQNTLKLAT to DRB1*01:01 range from 1nM and 600nM (IEDB entries).

2) Structural data exactly matching the systematic experimental data mentioned above. If multiple structures are involved they should be published by:

– the same group (in the ideal case even published together with the above mentioned data by the same group)
– be contemporary

 

Data which fulfil the above mentioned criteria is quite hard to find since most biologists are mainly interested in some fancy, however, rather anecdotal examples and rarely in systematic data production.

Once one has found such a system a large number of Molecular Dynamics simulations can be performed which will yield systematic differences not just differences originating from random events or data bias.

 

 

What is a kink?

If you ask someone who works with membrane protein structures what they think a helix kink is, you will probably get a fairly vague answer. They would probably tell you something like a kink is a point in the helix where there is a sharp change in the direction of the helix axis. Like the helix on the right in the picture.

A couple of helices

Two example helices


They might also tell you that they frequently (or always) involve Proline, and that they feature the absence of some of the backbone hydrogen bonds that you would otherwise expect to observe in a helix. But that is about as far as the consensus goes.

There are a number of published methods to identify kinks in helices (ProKink, Helanal, MC-Helan, and TMKink, to name a few). The detail of identifying kinks (and hence the definition of kinks) differs a lot between the methods. That would be fine if they all identified the same helices as having kinks, and the same positions in each helix as kinked, but they do not. Even the number of helices that are kinked differs a lot – anything from 6 to 60%. It would be easy if all α-helices were like the two in the figure – clearly kinked or straight. Generally the methods agree on these ‘easy’ helices. But there are lots of helices that are difficult to classify.

Why do we want to computationally identify them? Well, kinks are functionally important in membrane proteins. Membrane proteins are both medicinally important, and hard to experimentally determine the structure of. So, modelling their structures is a useful thing to do, and their structure includes kinks. But also, they are a structural motif. If we can write down what we think a kink is, use that definition to identify kinks, and then use information from those kinks we identified to (reliably) predict the existence of other kinks, we will know that we fully understand to motif.

There are a remarkably large number of perfectly sensible ways to identify a sharp change in the axis of an α-helix, given the coordinates of the backbone atoms.
The published methods (mentioned above) are a bit cumbersome (they require you to create input files, and parse at least one, if not many, output files), so from my experience people tend to make their own methods (That is not just me with Kink Finder, but many of the people who need to identify kinks that I have spoken to at conferences do not use a published method).

A short summary of ways to identify kinks:

  1. Fit axes to short sections of helix (4-8 residues). Give each residue an angle calculated from the angle between the axis of the section before it, and the axis of the section after it. If the largest angle in the helix is greater than a threshold, annotate the helix as kinked, at the residue with the largest angle
  2. Methods similar to (1), but using a measure other than an angle. For example, the curvature of a line fitted to the helix, or the distance between the ith and (i+4)th Cα residue
  3. Identify sections of the helix that are ‘straight’. Where a helix contains more than on section, it is kinked
  4. Look at them, and classify

Even for (1), there are lots of ways to calculate a helix axis. You can us least-squares (very sensitive to the number of residues used in the fit), or fitting to a cylinder (better, but slower), or a vector method using 4 Cαs. Similarly for (2), you could fit a spline, or just a polynomial. This is before you decide on (e.g.) how many residues you are going to fit axes to (6?, 7? 8?), or what smoothing parameter to use in the spline fit, or what order polynomial.

How should we classify this helix?

How should we classify this helix?

The point is, there are lots of ways to do it, but there is no consensus definition. You can use one method, or 2, or more, but they will give you different answers. Because we have no concise definition, it is hard to say which helix classification is ‘right’ and which one is ‘wrong’. Take this helix, it is not perfectly straight, but is it not straight enough to be kinked? Or is the change in axis over a number of turns, making it curved rather than kinked?

There are a few ways round this problem, where your computer program is struggling to ‘correctly’ classify helices. Rather than using a computer, you could manually curate a set, which has been done. However, there are a few issues here. First, their definition of a kink was a bit vague – and certainly difficult to turn into a computer program. Second, humans are very good at seeing patterns, even when they are not really there. Third, the end aim, at least for me, is to incorporate this into structure prediction, where a computational formulation of a kink is necessary. Another solution is to accept that your program is deficient, and put the borderline (i.e. difficult) cases into a third group, and removing them from your analysis. This does help to improve your signal, but is a little unsatisfactory.

To conclude, there are not two types of helices, there is more of a spectrum, from very straight, through a bit kinked, to very kinked. Classifying them is a hard problem, with lots a different approaches available. But, while the methods give different answers (particularly in identifying the position of the kink in the helix), and they do not indicate the same causes of kinks, there is work to be done.

Citing R packages in your Thesis/Paper/Assignments

If you need to cite R, there is a very useful function called citation().

> citation()

To cite R in publications use:

  R Core Team (2013). R: A language and environment for statistical
  computing. R Foundation for Statistical Computing, Vienna, Austria.
  URL http://www.R-project.org/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2013},
    url = {http://www.R-project.org/},
  }

We have invested a lot of time and effort in creating R, please cite it
when using it for data analysis. See also ‘citation("pkgname")’ for
citing R packages.

If you want to cite just a package, just pass the package name as a parameter, e.g.:

> citation(package = "cluster")

To cite the R package 'cluster' in publications use:

  Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik,
  K.(2013).  cluster: Cluster Analysis Basics and Extensions. R package
  version 1.14.4.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {cluster: Cluster Analysis Basics and Extensions},
    author = {Martin Maechler and Peter Rousseeuw and Anja Struyf and Mia Hubert and Kurt Hornik},
    year = {2013},
    note = {R package version 1.14.4 --- For new features, see the 'Changelog' file (in the package source)},
  }

This will give you BibTeX entries you can copy and paste in your BibTeX reference.  If you are using M$ Word, good luck to you.

Inside Memoir: MP-T aligns membrane proteins

Although Memoir has received a lot of air-time on this blog, we haven’t gone into a great deal of detail about how it models membrane proteins. Memoir is a pipeline involving a series of programs iMembrane -> MP-T -> Medeller -> Fread, and in this post I’ll explain the MP-T step (I’ll briefly touch on Medeller too).

Let’s first look at the big-picture. There are several ways of modelling a protein’s 3D structure. In an ideal world we could specify an extended polypeptide, teach a computer some physics, set if off simulating, and watch the exact folding pathway of a protein. This doesn’t work. A second method would be to build up a protein from lots of fragments of unrelated proteins… this is usually what is meant by ‘ab initio’ modelling. The most accurate (and least sophisticated) approach is to find a protein of known structure with similar sequence, align the sequences, and copy over the coordinates of the aligned residues to make a model for the query protein. This is the approach taken by Memoir and is called homology modelling or comparative modelling.

The diagram below shows an example of how homology modelling might work. Four membrane protein sequences are aligned (left) and the alignment specifies a structural superposition (right). Assume now that the red structure is unknown: we could make a good model for it just by copying over the aligned parts of the blue, green and yellow structures.

Screen shot 2013-05-20 at 14.26.06
The greatest difficulty in the modelling described above is making an accurate alignment. As sequences become more distantly related they share less and less sequence identity, and working out the optimum alignment becomes challenging. This problem is especially acute for membrane protein modelling: there are so few structures from which to copy coordinates that a randomly chosen query protein has a good chance of having <30% sequence identity to the nearest related structure.

Although alignment is the most important facet of homology modelling it is not the only consideration. In the above diagram the centres of the proteins are structurally very conserved (so copying coordinates will lead to a good model in this region), but the top of the proteins differ (the stringy loops don’t sit on top of one another). It is the role of coordinate generation software to distinguish which coordinates to copy. It turns out that the pattern of a conserved centre and varying top/bottom is generally true for membrane proteins, and Memoir uses our Medeller coordinate generation software to take advantage of this pattern.

Back then to alignment. The aim of alignment is to work out which amino acids in one protein are related to amino acids in another. All alignment methods have at their heart a set of scores which encode the propensity for one amino acid to mutate to another, and for that mutation to become fixed in a population. These scores form a substitution table (here mutation + fixation = substitution). More sophisticated alignment methods augment these scores in different ways — for example by adding in scoring based on secondary structure, smoothing scores over a window, or estimating a statistical supplement to the score determined from a related set of pre-aligned sequences — but at some level a substitution table is always present. Using a substitution table, the most likely evolutionary relationship between two sequences can be detected and this is reported in the form of an alignment.

So that’s general alignment, now to apply this to membrane proteins. The cell membrane is composed of a lipid bilayer: a sandwich with a hydophobic filling and hydrophilic crusts. The part of a membrane protein that touches the filling will have different preferences for amino acids (and, more importantly, substitutions between these amino acids) than the part of a membrane protein that touches the crust. Similarly there are systematic preferences for amino acid substitutions depending on whether part of a protein is buried or exposed, and on which type of secondary structure it assumes. The figure below shows a membrane protein with different regions of the membrane and different types of secondary structure annotated.

Screen shot 2013-05-20 at 14.27.19 Screen shot 2013-05-20 at 14.28.42

 

It is possible to make separate substitution tables for each environment within a membrane protein, where an environment specifies where the protein sits in the membrane, what secondary structure it has, and whether it is accessible or buried. Below is a principal components analysis of the resulting set of tables: each table is represented by a single point and the axes show the direction of the greatest variation between the tables. The plot on the right shows a separation of the points based on whether they are buried (more hydrophobic) or accessible (more hydrophilic). The hydrophobic centre (red circles) and hydrophilic edges (green circles) of the membrane fall into this general pattern. The table on the left shows that the tables further divide by secondary structure type. In summary there are systematic substitution preferences in practice as well as theory, and for membrane proteins it is most important to consider hydrophobicity when aligning two protein sequences.

Screen shot 2013-05-20 at 14.30.09

On then to modelling. The conventional approach to aligning a pair of sequences for homology modelling is to take a set of pre-aligned sequences (a sequence profile), and use them to estimate a supplement to the standard substitution score for aligning two sequences. This is termed profile-profile alignment. Memoir takes a different approach by using the MP-T program to construct a multiple sequence alignment scored with environment-specific substitution tables. The alignment includes a set of homologous sequences to the pair of interest.

Profile-profile alignment methods and MP-T are very different. It is unclear whether the substitution preferences at a position are best estimated by MP-T’s tables or the supplements derived from sequence profiles, and the answer probably depends on how well the profiles are made — garbage in, garbage out. Similarly the MP-T algorithm only determines the upper limit of alignment accuracy, and the actual accuracy depends on how the homologous sequences in the alignment are chosen.

In general we find little difference between the fraction of an alignment that MP-T and either HHsearch or Promals (profile-profile alignment methods) gets right. However we do find a difference in the fraction of the alignment that these methods get wrong (part of an alignment can be right, wrong or simply not aligned, so it’s possible to get a lower fraction wrong whilst getting the same fraction right). It turns out that on average MP-T gets less of an alignment wrong for simple reasons of combinatorics: for a pair of proteins, the number of possible multiple sequence alignments is much greater than the number of possible profile-profile alignments. This means that, just by chance, the number of incorrectly aligned positions between the two sequences of interest will be lower for MP-T than for a conventional profile-profile alignment method.

Now for a little sales-pitch. The source code for MP-T is freely available and easy to expand (if you have a passing familiarity with Haskell). Only two or three lines of code need to be changed to define a new set of protein environments, and to feed it a substitution table for each environment. I’d be happy to help anyone who wants to try it out.

Viewing ligands in twilight electron density

In this week’s journal club we discussed an excellent review paper by E. Pozharski, C. X. Weichenberger and B. Rupp investigating crystallographic approaches to protein-ligand complex elucidation. The paper assessed and highlighted the shortcomings of deposited PDB structures containing ligand-protein complexes. It then made suggestions for the community as a whole and for researchers making use of ligand-protein complexes in their work.

The paper discussed:

  • The difficulties in protein ligand complex elucidation
  • The tools to assess the quality of protein-ligand structures both qualitative and quantitative
  • The methods used describing their analysis of certain PDB structures
  • Some case studies visually demonstrating these issues
  • Some practical conclusions for the crystallographic community
  • Some practical conclusions for non-crystallographer users of protein-ligand complex structures from the PDB

The basic difficulties of ligand-protein complex elucidation

  • Ligands have less than 100% occupancy – sometimes significantly less and thus will inherently show up less clearly in the overall electron density.
  • Ligands make small contributions to the overall structure and thus global quality measures , such as r-factors, will be affected only minutely by the ligand portion of the structure being wrong
  • The original basis model needs to be used appropriately. The r-free data from the original APO model should be used to avoid model bias

The following are the tools available to inspect the quality of agreement between protein structures and their associated data.

  • Visual inspection of the Fo-Fc and 2Fo-Fc maps,using software such as COOT, is essential to assess qualitatively whether a structure is justified by the evidence.
  • Use of local measures of quality for example real space correlation coefficients (RSCC)
  • Their own tool, making use of the above as well as global quality measure resolution

Methods and results

In a separate publication they had analysed the entirety of the PDB containing both ligands and published structure factors. In this sample they demonstrate 7.6% had RSCC values of less than 0.6 the arbitrary cut off they use to determine whether the experimental evidence supports the model coordinates.

Figure to show an incorrectly oriented ligand (a) and its correction (b)

An incorrectly oriented ligand (a) and its correction (b). In all of these figures Blue is the 2mFoDFc map contoured at 1σ and Green and Red are positive and negative conturing of the mFoDFc map at 3σ

In this publication they visually inspected a subset of structures to assess in more detail how effective that arbitrary cutoff is and ascertain the reason for poor correlation. They showed the following:

(i) Ligands incorrectly identified as questionable,false positives(7.4%)
(ii) Incorrectly modelled ligands (5.2%)
(iii) Ligands with partially missing density (29.2%).
(iv) Glycosylation sites (31.3%)
(v) Ligands placed into electron density that is likely to
originate from mother-liquor components
(vi) Incorrect ligand (4.7%)
(vii) Ligands that are entirely unjustified by the electron
density (11.9%).

The first point on the above data is that the false-positive rate using RSCC of 0.6 is 7.4%. This demonstrates that this value is not sufficient to accurately determine incorrect ligand coordinates. Within the other categories all errors can be attributed to one of or a combination of the following two factors:

  • The inexperience of the crystallographer being unable to understand the data in front of them
  • The wilful denial of the data in front of the crystallographer in order that they present the data they wanted to see
Figure to show a ligand placed in density for a sulphate ion from the mother liquor (a) and it's correction (b)

A ligand incorrectly placed in density for a sulphate ion from the mother liquor (a) and it’s correction (b)

The paper observed that a disproportionate amount of poor answers was derived from glycosylation sites. In some instances these observations were used to inform the biochemistry of the protein in question. Interestingly this follows observations from almost a decade ago, however many of the examples in the Twilight paper were taken from 2008 or later. This indicates the community as a whole is not reacting to this problem and needs further prodding.

Figure to show an incomplete glycosylation site inaccurately modeled

Figure to show an incomplete glycosylation site inaccurately modeled

Conclusions and suggestions

For inexperienced users looking at ligand-protein complexes from the PDB:

  • Inspect the electron density map using COOT if is available to determine qualitatively is their evidence for the ligand being there
  • If using large numbers of ligand-protein complexes, use a script such as Twilight to find the RSCC value for the ligand to give some confidence a ligand is actually present as stated

For the crystallographic community:

  • Improved training of crystallographers to ensure errors due to genuine misinterpretation of the underlying data are minimised
  • More submission of electron-density maps, even if not publically available they should form part of initial structure validation
  • Software is easy to use but difficult to analyse the output