Category Archives: How To

How to do things. doh.

Viewing ligands in twilight electron density

In this week’s journal club we discussed an excellent review paper by E. Pozharski, C. X. Weichenberger and B. Rupp investigating crystallographic approaches to protein-ligand complex elucidation. The paper assessed and highlighted the shortcomings of deposited PDB structures containing ligand-protein complexes. It then made suggestions for the community as a whole and for researchers making use of ligand-protein complexes in their work.

The paper discussed:

The difficulties in protein ligand complex elucidation
The tools to assess the quality of protein-ligand structures both qualitative and quantitative
The methods used describing their analysis of certain PDB structures
Some case studies visually demonstrating these issues
Some practical conclusions for the crystallographic community
Some practical conclusions for non-crystallographer users of protein-ligand complex structures from the PDB

The basic difficulties of ligand-protein complex elucidation

Ligands have less than 100% occupancy – sometimes significantly less and thus will inherently show up less clearly in the overall electron density.
Ligands make small contributions to the overall structure and thus global quality measures , such as r-factors, will be affected only minutely by the ligand portion of the structure being wrong
The original basis model needs to be used appropriately. The r-free data from the original APO model should be used to avoid model bias

The following are the tools available to inspect the quality of agreement between protein structures and their associated data.

Visual inspection of the Fo-Fc and 2Fo-Fc maps,using software such as COOT, is essential to assess qualitatively whether a structure is justified by the evidence.
Use of local measures of quality for example real space correlation coefficients (RSCC)
Their own tool, making use of the above as well as global quality measure resolution

Methods and results

In a separate publication they had analysed the entirety of the PDB containing both ligands and published structure factors. In this sample they demonstrate 7.6% had RSCC values of less than 0.6 the arbitrary cut off they use to determine whether the experimental evidence supports the model coordinates.

Figure to show an incorrectly oriented ligand (a) and its correction (b)

An incorrectly oriented ligand (a) and its correction (b). In all of these figures Blue is the 2mF_oDF_c map contoured at 1σ and Green and Red are positive and negative conturing of the mF_oDF_c map at 3σ

In this publication they visually inspected a subset of structures to assess in more detail how effective that arbitrary cutoff is and ascertain the reason for poor correlation. They showed the following:

(i) Ligands incorrectly identiﬁed as questionable,false positives(7.4%)
(ii) Incorrectly modelled ligands (5.2%)
(iii) Ligands with partially missing density (29.2%).
(iv) Glycosylation sites (31.3%)
(v) Ligands placed into electron density that is likely to
originate from mother-liquor components
(vi) Incorrect ligand (4.7%)
(vii) Ligands that are entirely unjustiﬁed by the electron
density (11.9%).

The first point on the above data is that the false-positive rate using RSCC of 0.6 is 7.4%. This demonstrates that this value is not sufficient to accurately determine incorrect ligand coordinates. Within the other categories all errors can be attributed to one of or a combination of the following two factors:

The inexperience of the crystallographer being unable to understand the data in front of them
The wilful denial of the data in front of the crystallographer in order that they present the data they wanted to see

Figure to show a ligand placed in density for a sulphate ion from the mother liquor (a) and it's correction (b)

A ligand incorrectly placed in density for a sulphate ion from the mother liquor (a) and it’s correction (b)

The paper observed that a disproportionate amount of poor answers was derived from glycosylation sites. In some instances these observations were used to inform the biochemistry of the protein in question. Interestingly this follows observations from almost a decade ago, however many of the examples in the Twilight paper were taken from 2008 or later. This indicates the community as a whole is not reacting to this problem and needs further prodding.

Figure to show an incomplete glycosylation site inaccurately modeled

Conclusions and suggestions

For inexperienced users looking at ligand-protein complexes from the PDB:

Inspect the electron density map using COOT if is available to determine qualitatively is their evidence for the ligand being there
If using large numbers of ligand-protein complexes, use a script such as Twilight to find the RSCC value for the ligand to give some confidence a ligand is actually present as stated

For the crystallographic community:

Improved training of crystallographers to ensure errors due to genuine misinterpretation of the underlying data are minimised
More submission of electron-density maps, even if not publically available they should form part of initial structure validation
Software is easy to use but difficult to analyse the output

Good looking proteins for your publication(s)

Just came across a wonderful PyMOL gallery while creating some images for my (long overdue) confirmation report. A fantastic resource to draw sexy proteins – especially useful for posters, talks and papers (unless you are paying extra for coloured figures!).

It would be great if we had our own OPIG “pymol gallery”.

An example of one of my proteins (1tgm) with aspirin bound to it:

A javascript function to validate FASTA sequences

I was more than a bit annoyed of not finding this out there in the interwebs, being a strong encourager of googling (or in Jamie’s case duck-duck-going) and re-use.

So I proffer my very own fasta validation javascript function.

/*
 * Validates (true/false) a single fasta sequence string
 * param   fasta    the string containing a putative single fasta sequence
 * returns boolean  true if string contains single fasta sequence, false 
 *                  otherwise 
 */
function validateFasta(fasta) {

	if (!fasta) { // check there is something first of all
		return false;
	}

	// immediately remove trailing spaces
	fasta = fasta.trim();

	// split on newlines... 
	var lines = fasta.split('\n');

	// check for header
	if (fasta[0] == '>') {
		// remove one line, starting at the first position
		lines.splice(0, 1);
	}

	// join the array back into a single string without newlines and 
	// trailing or leading spaces
	fasta = lines.join('').trim();

	if (!fasta) { // is it empty whatever we collected ? re-check not efficient 
		return false;
	}

	// note that the empty string is caught above
	// allow for Selenocysteine (U)
	return /^[ACDEFGHIKLMNPQRSTUVWY\s]+$/i.test(fasta);
}

Let me know, by comments below, if you spot a no-no. Please link to this post if you find use for it.

p.s. I already noticed that this only validates one sequence. This is because this function is taken out of one of our web servers, Memoir, which specifically only requires one sequence. If there is interested for multi sequence validation I will add it.

aRrrrgh! or how to apply a fitted model to new data

Recently I’ve been battling furiously with R while analysing some loop modelling accuracy data. The idea was simple:

Fit a general linear model to some data
Get out a formula to predict a variable (let’s call it “accuracy”) based on some input parameters
Apply this formula to new data and see how well the predictor does

It turns out, it’s not that simple to actually implement. Fitting a general linear model in R produces coefficients in a vector.

model <- glm(accuracy ~ param1 + param2 * param3, data=trainingset)
coef(model)

            (Intercept)                  param1                  param2 
            0.435395087            -0.093295388             0.148154339 
                 param3           param2:param3
            0.024399530             0.021100300

There seems to be no easy way to insert these coefficients into your formula and apply the resulting equation to new data. The only easy thing to do is to plot the fitted values against the variable we’re trying to predict, i.e. plot our predictions on the training set itself:

plot(model$fitted.values, trainingset$accuracy, xlab="score", ylab="accuracy", main="training set")

I’m sure there must be a better way of doing this, but many hours of Googling led me nowhere. So here is how I did it. I ended up writing my own parser function, which works only on very simple formulae using the + and * operators and without any R code inside the formula.

coefapply <- function(coefficients, row)
{
  result <- 0
  for (i in 1:length(coefficients))
  {
    subresult <- as.numeric(coefficients[i])
    if (!is.na(subresult))
    {
      name <- names(coefficients[i])
      if (name != "(Intercept)")
      {
        subnames <- strsplit(name, ":", fixed=TRUE)[[1]]
        for (n in subnames)
        {
          subresult <- subresult * as.numeric(row[n])
        }
      }
      result <- result + subresult
    }
  }
  return(result)
}

calculate_scores <- function(data, coefficients)
{
  scores <- vector(mode="numeric", length=nrow(data))
  for (i in 1:nrow(data))
  {
    row <- data[i,]
    scores[i] <- coefapply(coefficients, row)
  }
  return(scores)
}

Now we can apply our formula to a new dataset and plot the accuracy achieved on the new data:

model_coef <- coef(model)

# Test if our scores are the same values as the model's fitted values
training_scores <- calculate_scores(model_coef, trainingset)
sum((training_scores - model$fitted.values) < 0.000000000001) / length(scores)

# Calculate scores for our test set and plot them
test_scores <- calculate_scores(model_coef, testset)
plot(test_scores, testset$accuracy, xlab="score", ylab="accuracy", main="test set")

It works for my purpose. Maybe one day someone will see this post, chuckle, and then enlighten me with their perfectly simple and elegant alternative.

On being cool: arrows on an R plot

Recently I needed a schematic graph of traditional vs on-demand computing (don’t ask) – and in this hand waving setting I just wanted the axes to show arrows and no labels. So, here it is:

x <- c(1:5)
y <- rnorm(5)
plot(x, y, axes = FALSE)
u <- par("usr") 
arrows(u[1], u[3], u[2], u[3], code = 2, xpd = TRUE) 
arrows(u[1], u[3], u[1], u[4], code = 2, xpd = TRUE)

And here is the output

(I pinched this off a mailing list post, so this is my due reference)

Next thing I am toying with are these xkcd like graphs in R here.

A free, sweet, valid HTML4 “Site Maintenance” page

So today we have moved servers from the cloud to a physically local server and we needed a “Site Maintenance” page. A few google searches turned up a simple HTML5 template which I converted to HTML4 and is reproduced hereunder (could not find the original source, aargh):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">

<html>

<head>
    <meta http-equiv="Content-type" content="text/html;charset=UTF-8">
    <title>Site Maintenance</title>
    <style type="text/css">
      body { text-align: center; padding: 150px; }
      h1 { font-size: 50px; }
      body { font: 20px Helvetica, sans-serif; color: #333; }
      #article { display: block; text-align: left; width: 650px; margin: 0 auto; }
      a { color: #dc8100; text-decoration: none; }
      a:hover { color: #333; text-decoration: none; }
    </style>

</head>
<body>
    <div id="article">
    <h1>We&rsquo;ll be back soon!</h1>
    <div>
        <p>Sorry for the inconvenience but we&rsquo;re performing some maintenance at the moment. If you need to you can always contact us on <b>opig AT stats.ox.ac.uk</b>, otherwise we&rsquo;ll be back online shortly!  Site should be back up on Friday 1st March 2013, 16:00 GMT.</p>
        <p>&mdash; OPIG</p>
    </div>
    </div>
</body>
</html>

And here is what it looks like… (nothing glamorous, you have been warned)

How to install RDKit on Ubuntu 12.04 / 12.10 / 13.04 / 13.10 / 14.04 / 14.10 (with InChI support)

I make extensive use of this brilliant piece of cheminformatics software (RD)kit, and it has saved me writing my own Molecule, Atom, Bond, Conformer, Fingerprint, SimilarityMetric, Descriptor etc. classes time and time again. It is really neat, and works with C++ and python (and Java I think). Here are the instructions on how to install it on a relatively recent Ubuntu version (12.04 / 12.10 / 13.04 / 13.10 / 14.04 / 14.10).

Pre-requisite software (this is why I love Debian based systems)

sudo apt-get install flex bison build-essential python-numpy cmake python-dev sqlite3 libsqlite3-dev libboost-dev  libboost-python-dev libboost-regex-dev

Get the latest RDKit goodie from here (watch out – version number has been replaced by ‘X’ below)

wget http://downloads.sourceforge.net/project/rdkit/rdkit/QX_20XX/RDKit_20XX_XX_X.tgz

Unzip the beast, save it to /opt

sudo tar xzvf RDKit_20XX_XX_X.tgz -C /opt

Add some environment salt, vim ~/.bashrc

export RDBASE=/opt/RDKit_20XX_XX_X
export LD_LIBRARY_PATH=$RDBASE/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$RDBASE:$PYTHONPATH

Resource your .bashrc

. ~/.bashrc

if you want the InChI stuff (trust me you do), first:

cd $RDBASE/External/INCHI-API/
./download-inchi.sh

Build (compile), install & test

cd $RDBASE
mkdir build
cd build
cmake .. # if you do not care for InChI support OR
cmake -DRDK_BUILD_INCHI_SUPPORT=ON .. # to install InChI generation code 
make # -j 4 to use multiple processors
make install
ctest

If all your tests passed successful you are good to go. Otherwise, get in touch via the post comments below.

How to make a custom latex bibliography style

Imagine you are writing up your latest thrilling piece of science in your favourite odt or docx format. Nothing comes from nothing so you need to cite the 50 or so people whose ideas you built on, or who came to conclusions that contradict yours. Suddenly you realize that your second sentence needs a reference… and this will require you to renumber the subsequent 50. What a drag! There goes 5 minutes of your life that could have been better spent drinking beer.

Had you written your research in latex instead, this drudgery would have been replaced by a range of much more interesting and intractable difficulties. In latex it is easy to automagically renumber everything just by recompiling the document. Unfortunately, it is often hard to direct latex’s magic… just try moving a picture an inch to the right, or reformatting a reference.

Moving figures around is still a black art as far as I’m concerned… but I’ve recently found out an easy way to reformat references. This might be especially handy when you find out that your sort of proteins fall out of the scope of the International Journal of Eating Disorders and you now want to submit to a journal that requires you to list authors in small-caps, and the dates of publication in seconds from the Unix epoch.

A good way of including references in latex is with a “.bib” file and a “.bst” file. An example of the end of a document is shown below.

\bibliographystyle{myfile} \bibliography{mycollection}

\end{document}

What’s happening here? All my references are stored in bibtex format in a database file called “mycollection.bib”. A separate file “myfile.bst” says how the information in the database should be presented. For example, are references in the text of the form (Blogs et al 2005) or are they numbered (1)? At the end of the text are they grouped in order of appearance, by date of publication or alphabetically? If alphabetically does “de Ville” come under “d” or “v”? To reformat a reference, we simply need to change “myfile.bst”.

Most latex distributions come with a set of bibliography styles. Some examples can be found here (a page which also explains all of the above much better than I have). However, it is very easy to produce a custom file using the custom-bib package. After a one-click download it is as simple as typing:

latex makebst.ins latex makebst.tex

Here’s a screenshot to prove it. At the bottom is the first of thirty or forty multiple-choice questions about how you want your references to look. If in doubt, just close your eyes and press return to select the default.

The problem with a multiple-choice fest is that if you make a poor decision at question 28 you have to go through the whole process again. Fortunately, this can be circumvented — as well as generating a pretty “myfile.bst” file, the custom-bib package generates an intermediate file “myfile.dbj”. Changing your multiple-choice answers is just a matter of commenting out the relevant parts and typing “latex myfile.dbj”. A snippet of a “dbj” file is below:

Selected options are those without a “%” sign on the left hand side. Who would have thought that Latex could be so cuddly?

How to make environment variables persist in your web app (Apache2)

Last Sunday, with the Memoir gang, we were talking about using this blog as a technical notebook. This post is in that spirit.

We have moved most of the validation of Memoir before submitting the job to the queue. This has the advantage that the user knows immediately the submission has failed (instead of waiting for the job to be processed in the queue). The only issue with this is that the apache user (on most systems www-data) is going to run the validation pipeline. Memoir requires a ton of PATH variables addition.

In most cases you cannot add variables in .bashrc as the apache user does not have a home directory. The easiest way how to add environment variables for the apache user is:

sudo vim /etc/apache2/envvars

And add the following line at the bottom (these have to be customized according to the server):

export PATH=$PATH:/opt/tmalign:/opt/muscle:/opt/joy-5.10:/opt/joy-5.10/psa:/opt/joy_related-1.06:/opt/joy_related-1.06/sstruc:/opt/joy_related-1.06/hbond:/opt/medeller-dev/bin:/opt/usearch6.0.307:/opt/ncbi-blast-2.2.27+:/opt/MPT/bin

After this restart apache – and you should be laughing. Or, as is often the case, apache will be laughing at you.

sudo service apache2 restart

The apache user should now “see” the amended path variable.

Here it is this technical note – so in a few weeks time when I forget all of the above I can just copy and paste it …

(This post is short, also because there is the superbowl!!)

Anatomy of a blog post

Now, shall I use the first person singular or plural to write this? Active or passive voice? …

It doesn’t really matter. This isn’t a formal article, and you can even use abbreviations. This group blog, like anything else during our time in Oxford, is an experiment. We will give it a few months and see what happens. If it pans out, we will have a, more or less, detailed research journal for the group. Not to mention a link with the outside world (prospective students? employers?) and proof that we can “communicate” with others. And since this is an exploratory exercise we should have freedom to explore what we want to write about.

We should have plenty of fodder. Let us face it, if we do not do some mildly interesting science every week then we are probably not having enough fun. But even if you are working on a hushed up, undercover project (e.g. the next blockbuster drug against Malaria) – there are still so many interesting bits of our D.Phil. which would otherwise never see the light of day.

For inspiration, have a look at other popular scientific blogs – the chembl one is both educational and humourous in equal measures (Post Idea #1: list of bio/cheminformatics blogs which every grad student should read). Blogs are a great way to survey literature without actually doing any reading (Post Idea #2: tricks to increase grad student productivity… what do you mean you don’t use Google Alerts to surprise your supervisor with a link to a paper published the day before?); and for a TL;DR version there is twitter (Post Idea #3: Idea #1 but for twitter instead). I only found out about the four stranded DNA in human cells by following @biomol_info.

And of course, we are mostly a computational group – so software is what we churn out on a daily basis. How much of the software we write ends up resting forever on our disks, never to be used again. The masses want splitchain! (Idea #4: post software you wrote). And there is benefit in not only giving out software, but also explaining the internals with snippets (Idea #5: a clever algorithm explained line-by-line).

And then there is the poster you hung up once (Idea: #6) or the talk you gave and prepared for hours on your disposable, use-once-only slides (Idea: #7). There is the announcement of publishing a paper – that solemn moment in academia when someone else thinks what you have done is worthy (Idea: #8 – btw well done to our own Jamie Hill for his recent MP-T work).

And if your an athelete, like Anna (Dr. Lewis) who crossed the atlantic in a rowing boat or Eleanor who used to row for the blues – what can I say, this is how we roll, or row [feeble attempt at humour] – thats a non-scientific but unique and interesting experience too (Idea #8). .

If you’ve read a paper and you think it’s interesting comment on it – people will follow your posts just because it acts like a literature filter (Idea #9). You can probably even have a rant (Idea #10); as long as its more positive and less bitter than Fred Ross’ Farewell to Bioinformatics.

Finally, this post is long and tedious for the reader. But that is ok too – like everything else here, it is a learning experience and the more I write the more I will improve. So hey, I’m also doing this to write a better thesis (i.e. to make the writing less painful).

An addendum; my initial intention was to discuss the bits which make a good blog post. You can find lots of articles about this – so it is less interesting; but here are the main points

links, links, links; everywhere
be informal, colourful and opinionated (this should come easy to you)
a picture is worth a thousand words

If a picture is really worth a thousand words, 30 of these is all I need for my thesis.

Oxford Protein Informatics Group

or "OPIG" to friends