Category Archives: Technical

Augmented Modelling with Natural Move Monte Carlo Simulations

In the last group meeting I reported on the progress that I have made regarding the development of a protocol for the systematic use of Natural Move Monte Carlo simulations.

Natural Move Monte Carlo simulations
Natural Moves are degrees of freedom that describe the collective motion of groups of residues. In DNA this might be the concerted motion of a double helix; in proteins this could be the movement of a stable secondary structure element such as a beta-sheet. These segments are joined by so called melting areas. At each simulation step the segments are propagated independently in an MC fashion. The resulting chain breaks are resolved by a chain closure algorithm that acts on the melting areas. This results in a reduction of degrees of freedom of several orders of magnitude. Therefore, large complexes and conformational changes can be sampled more effectively.

In order to get sensible results, however, the initial decomposition of the system is important. The challenge is to accurately represent the plasticity of the system, while keeping the number of degrees of freedom as small as possible. Detailed insight into the flexibility of the system might be gained from experimental sources such as NMR or computational methods such as MD simulations and Normal Mode Analysis. This can help with defining segments and melting areas. However, there are many systems for which this data is not available. Even if it is, there is no guarantee that the segmentation is correct.

Therefore, I am developing a protocol that allows for the evaluation of a range of different test cases that each reflect a unique set of segments and melting areas.

Augmented Modelling Protocol
This protocol is aimed at the systematic evaluation of NMMC segmentations. It allows researchers to feed experimental information, biological knowledge and educated guesses into molecular simulations and so provides a framework for testing competing hypotheses. The protocol has four steps.

Step 1: Segmentation of the system into low-level segments
The initial segmentation contains all possible areas of flexibility that may play a role in conformational changes in the system of interest. This decision may be influenced by many sources. For now, however, we only consider secondary structure information. Helices and beta strands are treated as potential segments. Unstructured regions such as kinks, loops and random coils are treated as melting areas. For a small fold with four helices we get the segmentation shown in figure 1a.

Step 2: Formulate test cases
Generate multiple test cases that reflect hypotheses about the mechanism of interest. In this step we try to narrow down the degrees of freedom as much as possible in order to retain sampling efficiency. This is done by selectively deactivating some melting areas that were defined in step 1. For a system with three melting areas that can either be on or off, 2^3 = 8 different test cases may be generated (example shown in figure 1b).

Figure 1 a) Segmentation of a small α-fold. The blue rectangles represent α-helices. The dashed lines indicate the presence of melting areas I, II and III. Each melting area can be switched on or off (1/0) b) Example of a test case in which the first of three melting area is switched off. c) The six degrees of freedom along which a segment is propagated.

Step 3: Perform simulations
Sample the conformational space of all test cases that were generated in step 2. We generally use Parallel Tempering or Simulated Tempering algorithm to accelerate the sampling process. These methods rely on the modulation of temperature to overcome energy barriers.

Step 4: Evaluate results
Score the results against a given control and rank the test cases accordingly. The scoring might be done by comparing experimental distributions of observables with those generated by simulations (e.g. Kullback-Leibler divergence). A test case that reproduces desired expectation values of observables might then be considered as a candidate hypothesis for a certain structural mechanism.

What’s next?
I am currently working on example uses for this protocol. These include questions regarding aspects of protein folding and the stability of the empty MHC II binding groove.

Remove all LaTeX generated files

I am going to leave this here and bookmark it, because I am fed up of looking this up every time, not finding it and having to `history | fgrep rm`. To be used if you want to delete all LaTeX generated (and pdfLaTeX) files.

rm *.aux *.out *.toc *.log *.synctex.gz *.bbl *.blg *.pdf

Use at your own risk!

Inside Memoir: MP-T aligns membrane proteins

Although Memoir has received a lot of air-time on this blog, we haven’t gone into a great deal of detail about how it models membrane proteins. Memoir is a pipeline involving a series of programs iMembrane -> MP-T -> Medeller -> Fread, and in this post I’ll explain the MP-T step (I’ll briefly touch on Medeller too).

Let’s first look at the big-picture. There are several ways of modelling a protein’s 3D structure. In an ideal world we could specify an extended polypeptide, teach a computer some physics, set if off simulating, and watch the exact folding pathway of a protein. This doesn’t work. A second method would be to build up a protein from lots of fragments of unrelated proteins… this is usually what is meant by ‘ab initio’ modelling. The most accurate (and least sophisticated) approach is to find a protein of known structure with similar sequence, align the sequences, and copy over the coordinates of the aligned residues to make a model for the query protein. This is the approach taken by Memoir and is called homology modelling or comparative modelling.

The diagram below shows an example of how homology modelling might work. Four membrane protein sequences are aligned (left) and the alignment specifies a structural superposition (right). Assume now that the red structure is unknown: we could make a good model for it just by copying over the aligned parts of the blue, green and yellow structures.

The greatest difficulty in the modelling described above is making an accurate alignment. As sequences become more distantly related they share less and less sequence identity, and working out the optimum alignment becomes challenging. This problem is especially acute for membrane protein modelling: there are so few structures from which to copy coordinates that a randomly chosen query protein has a good chance of having <30% sequence identity to the nearest related structure.

Although alignment is the most important facet of homology modelling it is not the only consideration. In the above diagram the centres of the proteins are structurally very conserved (so copying coordinates will lead to a good model in this region), but the top of the proteins differ (the stringy loops don’t sit on top of one another). It is the role of coordinate generation software to distinguish which coordinates to copy. It turns out that the pattern of a conserved centre and varying top/bottom is generally true for membrane proteins, and Memoir uses our Medeller coordinate generation software to take advantage of this pattern.

Back then to alignment. The aim of alignment is to work out which amino acids in one protein are related to amino acids in another. All alignment methods have at their heart a set of scores which encode the propensity for one amino acid to mutate to another, and for that mutation to become fixed in a population. These scores form a substitution table (here mutation + fixation = substitution). More sophisticated alignment methods augment these scores in different ways — for example by adding in scoring based on secondary structure, smoothing scores over a window, or estimating a statistical supplement to the score determined from a related set of pre-aligned sequences — but at some level a substitution table is always present. Using a substitution table, the most likely evolutionary relationship between two sequences can be detected and this is reported in the form of an alignment.

So that’s general alignment, now to apply this to membrane proteins. The cell membrane is composed of a lipid bilayer: a sandwich with a hydophobic filling and hydrophilic crusts. The part of a membrane protein that touches the filling will have different preferences for amino acids (and, more importantly, substitutions between these amino acids) than the part of a membrane protein that touches the crust. Similarly there are systematic preferences for amino acid substitutions depending on whether part of a protein is buried or exposed, and on which type of secondary structure it assumes. The figure below shows a membrane protein with different regions of the membrane and different types of secondary structure annotated.

It is possible to make separate substitution tables for each environment within a membrane protein, where an environment specifies where the protein sits in the membrane, what secondary structure it has, and whether it is accessible or buried. Below is a principal components analysis of the resulting set of tables: each table is represented by a single point and the axes show the direction of the greatest variation between the tables. The plot on the right shows a separation of the points based on whether they are buried (more hydrophobic) or accessible (more hydrophilic). The hydrophobic centre (red circles) and hydrophilic edges (green circles) of the membrane fall into this general pattern. The table on the left shows that the tables further divide by secondary structure type. In summary there are systematic substitution preferences in practice as well as theory, and for membrane proteins it is most important to consider hydrophobicity when aligning two protein sequences.

On then to modelling. The conventional approach to aligning a pair of sequences for homology modelling is to take a set of pre-aligned sequences (a sequence profile), and use them to estimate a supplement to the standard substitution score for aligning two sequences. This is termed profile-profile alignment. Memoir takes a different approach by using the MP-T program to construct a multiple sequence alignment scored with environment-specific substitution tables. The alignment includes a set of homologous sequences to the pair of interest.

Profile-profile alignment methods and MP-T are very different. It is unclear whether the substitution preferences at a position are best estimated by MP-T’s tables or the supplements derived from sequence profiles, and the answer probably depends on how well the profiles are made — garbage in, garbage out. Similarly the MP-T algorithm only determines the upper limit of alignment accuracy, and the actual accuracy depends on how the homologous sequences in the alignment are chosen.

In general we find little difference between the fraction of an alignment that MP-T and either HHsearch or Promals (profile-profile alignment methods) gets right. However we do find a difference in the fraction of the alignment that these methods get wrong (part of an alignment can be right, wrong or simply not aligned, so it’s possible to get a lower fraction wrong whilst getting the same fraction right). It turns out that on average MP-T gets less of an alignment wrong for simple reasons of combinatorics: for a pair of proteins, the number of possible multiple sequence alignments is much greater than the number of possible profile-profile alignments. This means that, just by chance, the number of incorrectly aligned positions between the two sequences of interest will be lower for MP-T than for a conventional profile-profile alignment method.

Now for a little sales-pitch. The source code for MP-T is freely available and easy to expand (if you have a passing familiarity with Haskell). Only two or three lines of code need to be changed to define a new set of protein environments, and to feed it a substitution table for each environment. I’d be happy to help anyone who wants to try it out.

Constrain a PDB to particular chains

In many applications you need to constrain PDB files to certain chains. You can do it using this program.

A. What does it do?

Given a pdb file, write out the ATOM and HETATM entries for the supplied chain(s).

PDB_constrain needs three arguments:

PDB file to constrain.
Chains from the pdb file to constrain.
Output file.

B. Requirements:

Biopython – should be installed on your machines but in case you want to use it locally, download the latest version into the PDB_constrain.py’s directory (don’t need to build).

C. Example use:

C.1 Constrain 1A2Y.pdb to chains A and B – write results in constr.pdb

python PDB_constrain.py -f 1A2Y.pdb -c AB -o const.pdb

C.2 Constrain 1ACY to chain L, write results in const.pdb – this example shows that the constrainer works well with ‘insertion’ residue numbering as in antibodies where you have 27A, 27B etc.

python PDB_constrain.py -f 1ACY.pdb -c L -o const.pdb

Making small molecules look good in PyMOL

Another largely plagiarized post for my “personal notes” (thanks Justin Lorieau!) and following on from the post about pretty-fication of macromolecules. For my slowly-progressing confirmation report I needed some beautiful small molecule representation. Here is some PyMOL code:

show sticks
set ray_opaque_background, off
set stick_radius, 0.1
show spheres
set sphere_scale, 0.15, all
set sphere_scale, 0.12, elem H
color gray40, elem C
set sphere_quality, 30
set stick_quality, 30
set sphere_transparency, 0.0
set stick_transparency, 0.0
set ray_shadow, off
set orthoscopic, 1
set antialias, 2
ray 1024,768

And the result:

Beautiful, no?

A javascript function to validate FASTA sequences

I was more than a bit annoyed of not finding this out there in the interwebs, being a strong encourager of googling (or in Jamie’s case duck-duck-going) and re-use.

So I proffer my very own fasta validation javascript function.

/*
 * Validates (true/false) a single fasta sequence string
 * param   fasta    the string containing a putative single fasta sequence
 * returns boolean  true if string contains single fasta sequence, false 
 *                  otherwise 
 */
function validateFasta(fasta) {

	if (!fasta) { // check there is something first of all
		return false;
	}

	// immediately remove trailing spaces
	fasta = fasta.trim();

	// split on newlines... 
	var lines = fasta.split('\n');

	// check for header
	if (fasta[0] == '>') {
		// remove one line, starting at the first position
		lines.splice(0, 1);
	}

	// join the array back into a single string without newlines and 
	// trailing or leading spaces
	fasta = lines.join('').trim();

	if (!fasta) { // is it empty whatever we collected ? re-check not efficient 
		return false;
	}

	// note that the empty string is caught above
	// allow for Selenocysteine (U)
	return /^[ACDEFGHIKLMNPQRSTUVWY\s]+$/i.test(fasta);
}

Let me know, by comments below, if you spot a no-no. Please link to this post if you find use for it.

p.s. I already noticed that this only validates one sequence. This is because this function is taken out of one of our web servers, Memoir, which specifically only requires one sequence. If there is interested for multi sequence validation I will add it.

aRrrrgh! or how to apply a fitted model to new data

Recently I’ve been battling furiously with R while analysing some loop modelling accuracy data. The idea was simple:

Fit a general linear model to some data
Get out a formula to predict a variable (let’s call it “accuracy”) based on some input parameters
Apply this formula to new data and see how well the predictor does

It turns out, it’s not that simple to actually implement. Fitting a general linear model in R produces coefficients in a vector.

model <- glm(accuracy ~ param1 + param2 * param3, data=trainingset)
coef(model)

            (Intercept)                  param1                  param2 
            0.435395087            -0.093295388             0.148154339 
                 param3           param2:param3
            0.024399530             0.021100300

There seems to be no easy way to insert these coefficients into your formula and apply the resulting equation to new data. The only easy thing to do is to plot the fitted values against the variable we’re trying to predict, i.e. plot our predictions on the training set itself:

plot(model$fitted.values, trainingset$accuracy, xlab="score", ylab="accuracy", main="training set")

I’m sure there must be a better way of doing this, but many hours of Googling led me nowhere. So here is how I did it. I ended up writing my own parser function, which works only on very simple formulae using the + and * operators and without any R code inside the formula.

coefapply <- function(coefficients, row)
{
  result <- 0
  for (i in 1:length(coefficients))
  {
    subresult <- as.numeric(coefficients[i])
    if (!is.na(subresult))
    {
      name <- names(coefficients[i])
      if (name != "(Intercept)")
      {
        subnames <- strsplit(name, ":", fixed=TRUE)[[1]]
        for (n in subnames)
        {
          subresult <- subresult * as.numeric(row[n])
        }
      }
      result <- result + subresult
    }
  }
  return(result)
}

calculate_scores <- function(data, coefficients)
{
  scores <- vector(mode="numeric", length=nrow(data))
  for (i in 1:nrow(data))
  {
    row <- data[i,]
    scores[i] <- coefapply(coefficients, row)
  }
  return(scores)
}

Now we can apply our formula to a new dataset and plot the accuracy achieved on the new data:

model_coef <- coef(model)

# Test if our scores are the same values as the model's fitted values
training_scores <- calculate_scores(model_coef, trainingset)
sum((training_scores - model$fitted.values) < 0.000000000001) / length(scores)

# Calculate scores for our test set and plot them
test_scores <- calculate_scores(model_coef, testset)
plot(test_scores, testset$accuracy, xlab="score", ylab="accuracy", main="test set")

It works for my purpose. Maybe one day someone will see this post, chuckle, and then enlighten me with their perfectly simple and elegant alternative.

On being cool: arrows on an R plot

Recently I needed a schematic graph of traditional vs on-demand computing (don’t ask) – and in this hand waving setting I just wanted the axes to show arrows and no labels. So, here it is:

x <- c(1:5)
y <- rnorm(5)
plot(x, y, axes = FALSE)
u <- par("usr") 
arrows(u[1], u[3], u[2], u[3], code = 2, xpd = TRUE) 
arrows(u[1], u[3], u[1], u[4], code = 2, xpd = TRUE)

And here is the output

(I pinched this off a mailing list post, so this is my due reference)

Next thing I am toying with are these xkcd like graphs in R here.

A free, sweet, valid HTML4 “Site Maintenance” page

So today we have moved servers from the cloud to a physically local server and we needed a “Site Maintenance” page. A few google searches turned up a simple HTML5 template which I converted to HTML4 and is reproduced hereunder (could not find the original source, aargh):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">

<html>

<head>
    <meta http-equiv="Content-type" content="text/html;charset=UTF-8">
    <title>Site Maintenance</title>
    <style type="text/css">
      body { text-align: center; padding: 150px; }
      h1 { font-size: 50px; }
      body { font: 20px Helvetica, sans-serif; color: #333; }
      #article { display: block; text-align: left; width: 650px; margin: 0 auto; }
      a { color: #dc8100; text-decoration: none; }
      a:hover { color: #333; text-decoration: none; }
    </style>

</head>
<body>
    <div id="article">
    <h1>We&rsquo;ll be back soon!</h1>
    <div>
        <p>Sorry for the inconvenience but we&rsquo;re performing some maintenance at the moment. If you need to you can always contact us on <b>opig AT stats.ox.ac.uk</b>, otherwise we&rsquo;ll be back online shortly!  Site should be back up on Friday 1st March 2013, 16:00 GMT.</p>
        <p>&mdash; OPIG</p>
    </div>
    </div>
</body>
</html>

And here is what it looks like… (nothing glamorous, you have been warned)

How to install RDKit on Ubuntu 12.04 / 12.10 / 13.04 / 13.10 / 14.04 / 14.10 (with InChI support)

I make extensive use of this brilliant piece of cheminformatics software (RD)kit, and it has saved me writing my own Molecule, Atom, Bond, Conformer, Fingerprint, SimilarityMetric, Descriptor etc. classes time and time again. It is really neat, and works with C++ and python (and Java I think). Here are the instructions on how to install it on a relatively recent Ubuntu version (12.04 / 12.10 / 13.04 / 13.10 / 14.04 / 14.10).

Pre-requisite software (this is why I love Debian based systems)

sudo apt-get install flex bison build-essential python-numpy cmake python-dev sqlite3 libsqlite3-dev libboost-dev  libboost-python-dev libboost-regex-dev

Get the latest RDKit goodie from here (watch out – version number has been replaced by ‘X’ below)

wget http://downloads.sourceforge.net/project/rdkit/rdkit/QX_20XX/RDKit_20XX_XX_X.tgz

Unzip the beast, save it to /opt

sudo tar xzvf RDKit_20XX_XX_X.tgz -C /opt

Add some environment salt, vim ~/.bashrc

export RDBASE=/opt/RDKit_20XX_XX_X
export LD_LIBRARY_PATH=$RDBASE/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$RDBASE:$PYTHONPATH

Resource your .bashrc

. ~/.bashrc

if you want the InChI stuff (trust me you do), first:

cd $RDBASE/External/INCHI-API/
./download-inchi.sh

Build (compile), install & test

cd $RDBASE
mkdir build
cd build
cmake .. # if you do not care for InChI support OR
cmake -DRDK_BUILD_INCHI_SUPPORT=ON .. # to install InChI generation code 
make # -j 4 to use multiple processors
make install
ctest

If all your tests passed successful you are good to go. Otherwise, get in touch via the post comments below.

Oxford Protein Informatics Group

or "OPIG" to friends