Category Archives: Code

Colour wisely…

Colour – the attribute of an image that makes it acceptable or destined for the bin. Colour has a funny effect on us – it’s a double-edged sword that greatly strengthens, or weakens data representation in such a huge level. No one really talks about what’s a good way to colour an image or a graph, but it’s something that most can agree as being pleasing, or disgusting. There are two distinctive advantages to colouring a graph: it conveys both quantitative and categorical information very, very well. Thus, I will provide a brief overview (with code) on how colour can be used to display both quantitative and qualitative information. (*On the note of colours, Nick has previously discussed how colourblindness must be considered in visualising data…).

1. Colour conveys quantitative information.
A huge advantage of colour is that it can provide quantitative information, but this has to be done correctly. Here are three graphs showing the exact same information (the joint density of two normal distributions) and we can see from the get-go which method is the best at representing the density of the two normal distributions:

Colouring the same graph using three different colour maps.

If you thought the middle one was the best one, I’d agree too. Why would I say that, despite it being grayscale and seemingly being the least colourful of them all?

Colour is not limited to hues (i.e. whether it’s red/white/blue/green etc. etc.); ‘colour’ is also achieved by saturation and brightness (i.e., how vivid a colour is, or dark/light it is). In the case of the middle graph, we’re using brightness to indicate the variations in density and this is a more intuitive display of variations in density. Another advantage of using shades as the means to portray colour is that it will most likely be OK with colourblind users.
Why does the graph on the right not work for this example? This is a case where we use a “sequential” colour map to convey the differences in density. Although the colour legend clarifies what colour belongs to which density bin, without it, it’s very difficult to tell what “red” is with respect to “yellow”. Obviously by having a colour map we know that red means high density and yellow is lower, but without the legend, we can interpret the colours very differently, e.g. as categories, rather than quantities. Basically, when you decide on a sequential colour map, its use must be handled well, and a colour map/legend is critical. Otherwise, we risk putting colours as categories, rather than as continuous values.
Why is the left graph not working well? This is an example of a “diverging” colourmap.
It’s somewhat clear that blue and red define two distinct quantities. Despite this, a major error of this colour map comes in the fact that there’s a white colour in the middle. If the white was used as a “zero crossing” — basically, where a white means the value is 0 — the diverging colour map would have been a more effective tool. However, we can see that matplotlib used white as the median value (by default); this sadly creates the false illusion of a 0 value, as our eyes tend to associate white with missing data, or ‘blanks’. Even if this isn’t your biggest beef with the divergent colour map, we run into the same colour as the sequential colour map, where blue and red don’t convey information (unless specified), and the darkness/lightness of the blue and red are not linked very well without the white in the middle. Thus, it doesn’t do either job very well in this graph. Basically, avoid using divergent colourmaps unless we have two different quantities of values (e.g. data spans from -1 to +1).

2. Colour displays categorical information.
An obvious use of colour is the ability to categorise our data. Anything as simple as a line chart with multiple lines will tell you that colour is terrific at distinguishing groups. This time, notice that the different colour schemes have very different effects:

Colour schemes can instantly differentiate groups.

Notice how this time around, the greyscale method (right) was clearly the losing choice. To begin with, it’s hard to pick out what’s the difference between persons A,B,C, but there’s almost a temptation to think that person A morphs into person C! However, on the left, with a distinct set of colours, there is a clear distinction of persons A, B, and C as the three separate colours. Although a set of distinct three colours is a good thing, bear in mind the following…

Make sure the colours don’t clash with respect to lightness! Try to pick something that’s distinct (blue/red/green), rather than two colours which can be interpreted as two shades of the same colour (red/pink, blue/cyan, etc.)
Pick a palette to choose from – a rainbow is typically the best choice just because it’s the most natural, but feel free to choose your own set of hues. Also include white and black as necessary, so long as it’s clear that they are also part of the palette. White in particular would only work if you have a black outline.
Keep in mind that colour blind readers can have trouble with certain colour combinations (red/yellow/green) and it’s best to steer toward colourblind-friendly palettes.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sp
from mpl_toolkits.axes_grid1 import make_axes_locatable

### Part 1
# Sample 250 points
np.random.seed(30)
x = np.random.normal(size = 250)
np.random.seed(71)
y = np.random.normal(size = 250)

# Assume the limits of a standard normal are at -3, 3
pmin, pmax = -3, 3

# Create a meshgrid that is 250x250
xgrid, ygrid = np.mgrid[pmin:pmax:250j, pmin:pmax:250j]
pts = np.vstack([xgrid.ravel(), ygrid.ravel()]) # ravel unwinds xgrid from a 250x250 matrix into a 62500x1 array

data = np.vstack([x,y])
kernel = sp.gaussian_kde(data)
density = np.reshape(kernel(pts).T, xgrid.shape) # Evaluate the density for each point in pts, then reshape back to a 250x250 matrix

greys = plt.cm.Greys
bwr = plt.cm.bwr
jet = plt.cm.jet

# Create 3 contour plots
fig, ax = plt.subplots(1,3)
g0 = ax[0].contourf(xgrid, ygrid, density, cmap = bwr)
c0 = ax[0].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black
g1 = ax[1].contourf(xgrid, ygrid, density, cmap = greys)
c1 = ax[1].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black
g2 = ax[2].contourf(xgrid, ygrid, density, cmap = jet)
c2 = ax[2].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black

# Divide each axis then place a colourbar next to it
div0 = make_axes_locatable(ax[0])
cax0 = div0.append_axes('right', size = '10%', pad = 0.1) # Append a new axes object
cb0  = plt.colorbar(g0, cax = cax0)

div1 = make_axes_locatable(ax[1])
cax1 = div1.append_axes('right', size = '10%', pad = 0.1)
cb1  = plt.colorbar(g1, cax = cax1)

div2 = make_axes_locatable(ax[2])
cax2 = div2.append_axes('right', size = '10%', pad = 0.1)
cb2  = plt.colorbar(g2, cax = cax2)

fig.set_size_inches((15,5))
plt.tight_layout()
plt.savefig('normals.png', dpi = 300)
plt.close('all')

### Part 2
years = np.arange(1999, 2017, 1)
np.random.seed(20)
progress1 = np.random.randint(low=500, high =600, size = len(years))
np.random.seed(30)
progress2 = np.random.randint(low=500, high =600, size = len(years))
np.random.seed(40)
progress3 = np.random.randint(low=500, high =600, size = len(years))

fig, ax = plt.subplots(1,2)
ax[0].plot(years, progress1, label = 'Person A', c = '#348ABD')
ax[0].plot(years, progress2, label = 'Person B', c = '#00de00')
ax[0].plot(years, progress3, label = 'Person C', c = '#A60628')
ax[0].set_xlabel("Years")
ax[0].set_ylabel("Progress")
ax[0].legend()

ax[1].plot(years, progress1, label = 'Person A', c = 'black')
ax[1].plot(years, progress2, label = 'Person B', c = 'gray')
ax[1].plot(years, progress3, label = 'Person C', c = '#3c3c3c')
ax[1].set_xlabel("Years")
ax[1].set_ylabel("Progress")
ax[1].legend()

fig.set_size_inches((10,5))
plt.tight_layout()
plt.savefig('colourgrps.png', dpi = 300)
plt.close('all')

Tracked changes in LaTeX

Maybe people keep telling you Word is great but you are just too emotionally attached to LaTeX to consider using anything else. It just looks so beautiful. Besides, you would have to leave your beloved linux environment (maybe that’s just me), so you stick with what you know. You work for many weeks long and hard, finally producing a draft of a paper that gets the all clear from your supervisor to submit to journal X. Eventually you hear back and the reviewers have responded with some good ideas and a few pedantic points. Apparently this time the journal wants a tracked changes version to go with your revised manuscript.

Highlighting every change sounds like a lot of bother, and besides, you’d have to process the highlighted version to generate the clean version they want you to submit alongside it. There must be a better way, and one that doesn’t involve converting your document to Word.

Thankfully, the internet has an answer! Check out this little package changes which will do just what you need. As long as you annotate using \deleted{}, \replaced{} and \added{} along the way, you will have to change just one word of your tex source file in order to produce the highlighted and final versions. It even comes with a handy bash script to get rid of the resulting mess when you’re happy with the result, leaving you with a clean final tex source file.

Screenshot from 2016-07-12 19-45-12

The die-hard Word fans won’t be impressed, but you will be very satisfied that you have found a nice little solution that does just the job you want it to. It’s actually capable of much more, including comments by multiple authors, customisation of colours and styles, and an automatically generated summary of changes. I have heard good things about ShareLaTeX for collaboration, but this simple package will get you a long way if you are not keen to start paying money yet.

Drawing Custom Unrooted Trees from Sequence Alignments

Multiple Sequence Alignments can provide a lot of information relating to the relationships between proteins. One notable example was the map of the kinome space published in 2002 (Figure 1).

Figure 1. Kinase space as presented by Manning et al. 2002;

Such images organize our thinking about the possible space of such proteins/genes going beyond long lists of multiple sequence alignments. The image in Figure 1, got a revamp later which now is the popular ‘kinome poster’ (Figure 2).

Revamped dendrogram of the kinome fro Fig. 1. Downloaded from http://i.imgur.com/BPLUvfc.png.

Here we have created a script to produce similar dendrograms straight from the multiple sequence alignment files (although clearly not as pretty as Fig 2!). It is not difficult to find software that would produce ‘a dendrogram’ from an MSA but making it do the simple thing of annotating the nodes with colors, shapes etc. with respect to the labels of the genes/sequences is slightly more problematic. Sizes might correspond to the importance of given nodes and colors can organize by their tree branches. The script uses the Biopython module Phylo to construct a tree from an arbitrary MSA and networkx to draw it:

python Treebeard.py
import networkx, pylab
from networkx.drawing.nx_agraph import graphviz_layout
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio import AlignIO

#What color to give to the edges?
e_color = '#ccccff'
#What colors to give to the nodes with similar labels?
color_scheme = {'RSK':'#e60000','SGK':'#ffff00','PKC':'#32cd32','DMPK':'#e600e6','NDR':'#3366ff','GRK':'#8080ff','PKA':'magenta','MAST':'green','YANK':'pink'}
#What sizes to give to the nodes with similar labels?
size_scheme = {'RSK':200,'SGK':150,'PKC':350,'DMPK':400,'NDR':280,'GRK':370,'PKA':325,'MAST':40,'YANK':200}

#Edit this to produce a custom label to color mapping
def label_colors(label):
	color_to_set = 'blue'
	for label_subname in color_scheme:
		if label_subname in label:
			color_to_set = color_scheme[label_subname]
	return color_to_set

#Edit this to produce a custom label to size mapping
def label_sizes(label):
	#Default size
	size_to_set = 20
	for label_subname in size_scheme:
		if label_subname in label:
			size_to_set = size_scheme[label_subname]
	return size_to_set

#Draw a tree whose alignment is stored in msa.phy
def draw_tree():
	
	#This loads the default kinase alignment that should be in the same directory as this script
	aln = AlignIO.read('agc.aln', 'clustal')
	#This will construct the unrooted tree.
	calculator = DistanceCalculator('identity')
	dm = calculator.get_distance(aln)
	constructor = DistanceTreeConstructor()
	tree = constructor.nj(dm)
	G = Phylo.to_networkx(tree)
	node_sizes = []
	labels = {}
	node_colors = []
	for n in G:
		label = str(n)
		if 'Inner' in label:
			#These are the inner tree nodes -- leave them blank and with very small sizes.
			node_sizes.append( 1 )
			labels[n] = ''
			node_colors.append(e_color)
		else:
			#Size of the node depends on the labels!
			node_sizes.append( label_sizes(label) )
			#Set colors depending on our color scheme and label names
			node_colors.append(label_colors(label))
			#set the label that will appear in each node			
			labels[n] = label
	#Draw the tree given the info we provided!
	pos = graphviz_layout(G)
	networkx.draw(G, pos,edge_color=e_color,node_size = node_sizes, labels=labels, with_labels=True,node_color=node_colors)
	#Showing	
	pylab.show()
	#Saving the image -- uncomment
	#pylab.savefig('example.png')

if __name__ == '__main__':
	
	draw_tree()

We are going to use the kinase alignment example to demonstrate how the script can be used. The kinase alignment we use can be found here on the kinase.com website. We load the alignment and construct the unrooted tree using the Bio.Phylo module. Note that on each line of the alignment there is a name. These names are the labels that we use to define the colors and sizes of nodes. There are two dummy functions that achieve that label_nodes() and label_sizes() — if you look at them it should be clear how to define your own custom labeling.

If you download the code and the alignment and run it by:

python Treebeard.py

You should see a similar image as in Fig 3.

Fig 3. Size-color-customized unrooted tree straight from a multiple sequence alignment file of protein kinases. Constructed using the script Treebeard.py

From 300 superpositions/sec to 6,000,000 superpositions/sec. A Python to C story

Part of the work I do requires me to identify uniqueness of structural shapes in specific proteins. People have done this in many ways, some more complex than others. I like more simple things so I decided to use clustering on top of a superposition algorithm. I will not detail too much the clustering part of this algorithm as this is not the focus of this article, but let’s assume that we need to make millions, possibly billions, of superpositions. In the following I will go through the steps I took in making a superposition library faster and faster, as the requirements demanded more and more. It will include information on how to profile your code, connect Python to C, and idiosyncrasies of parallelising C code.

Version 1.0 Python 100%

In PyFread I found a superposition algorithm, coupled with a PDB parser. I extracted the code and created a wrapping function. The signature of it was:

def superpose(file1, file2):

The files had to contain only ATOM lines with the same number of residues . It returned the RMSD of the best superposition, which was needed for the clustering. This could do a couple of a hundred superpositions/second. Using the inbuilt python profiler I found out that both the reading in part of the algorithm and the superposition are slowing down the program so I decided to move the contents of the wrapping function in C.

You can easily profile your code by running the command below. It will tell you for each method how many times it’s called and what is the total time spent in it. More information here https://docs.python.org/2/library/profile.html

python -m cProfile your_script.py

Version 2.0 C 99% – Python 1%

My script still needed to run from python but it’s functionality would be implemented in C. So I rewrote the whole thing in C (a PDB parser and superposition algorithm) and interfaced it to Python. The way to achieve this is using ctypes. This provides an interface for calling C/C++ methods from Python. The way you set this up is:

C code rmsd.cpp

extern “C” {
   void superpose_c(const char* file1, consta char* file2, double* rmsd){
       //perform superposition
       *rmsd = compute_rmsd();
   }
}

You compile this code with: g++ -shared -Wl,-soname,rmsdlib -o rmsdlib.so -fPIC rmsd.cpp . This creates a shared object which can be loaded in Python

Python Code

# load the ctypes library with C-like objects
from ctypes import *
# load the C++ shared objects
lib = cdll.LoadLibrary(‘./rmsdlib.so’)

#convert python string to c char array function
def to_c_string(py_string)
   c_string = c_char * (len(py_string)+1)()
   for i in range(len(py_string)):
       c_string[i] = py_string[i]
   c_string[i+1] = “”

rmsd = c_double()
lib.superpose_c(to_c_string(“file1.pdb”), to_c_string(“file2.pdb”), ref(rmsd))
# the use of ref is necessary so that the value set to rmsd inside the C function will be returned to Python

There are other ways of coding the Python – C Bridge I’m sure, and if you think you have something better and easier please comment.

Version 2.1 Armadillo Library

After running another batch of profiling, this time in C, I found that the bottleneck was the superposition code.

An easy way to profile C code is that when you compile it you use the -pg flag (gcc -pg …..). You then run the executable which will produce a file called gmon.out. Using that you run the command gprof executable_name gmon.out > analysis.txt . This will store the profiling info in analysis.txt. You can read more on this here http://www.thegeekstuff.com/2012/08/gprof-tutorial/

The superposition algorithm involved doing a Singular Value Decomposition, which in my implementation was slow. Jaroslaw Nowak found the Armadillo library which does a lot of cool linear algebra stuff very fast, and it is written in C. He rewrote the superposition using Armadillo which made it much faster. You can read about Armadillo here http://arma.sourceforge.net/

Version 3.0 C 100%

So what is the bottleneck now? Everytime I do a superposition the C function gets called from Python. Suppose I have the following files: file1.pdb, file2.pdb, file3.pdb, file4.pdb . I have to compare file1.pdb to every other file, which means I will be reading file1.pdb 3 times ( or 1 million times depending on what I had to do) in my original Python – C implementation. My clustering method is smarter than this, but similar redundancies did exists and they slowed down my program. I therefore decided to move everything in C, so that once I read a file I could keep it stored in memory.

Version 3.1 We have multiple cores, let’s use them

The superpositions can be parallelised, but how straightforward is doing this in C++? First of all you need to import the library #include <thread> . A simple scenario would be that I have a vector of N filenames and the program would have to compare everything with everything. The results would be stored in the upper triangle of an NxN array. The way I usually approach such a situation is that I send to each thread a list of comparison indeces (eg. Thread 1 compares 1-2, 1-3, 1-4; Thread 2-3, 2-4….), and the address of the results matrix (where the RMSD values would be stored). Because each thread will be writing to a different set of indeces in the matrix it should not be a problem that all the threads see the same results matrix (if you code it properly).

The way to start a thread is:

thread(par_method, …. , ref(results))

par_method is the method you are calling. When you pass an argument like results to a thread if it’s not specified with ref(..) it would be passed by value (it does not matter if normally in an unthreaded environment it would be passed by reference). ref is a reference wrapper and will make sure that all the threads see the same results instance.

Other problems you can come across is if you want to write to the same file from multiple threads, or modify vector objects which are passed by reference (with ref). They do not have thread safe operations, the program will crash if more than one thread will call a function on these objects. To make sure this does not happen you can use a lock. The way you achieve this in C is by using with mutex.

#include <mutex>
// Declare a global variable mutex that every thread sees
mutex output_mtx;

void parallel_method(….,vector<string>& files){
   output_mtx.lock();
   //Unsafe operation
   files.append(filename);
   output_mtx.unlock();
}

If one thread has locked output_mtx another one won’t finish the execution of output_mtx.lock() until the other first one unlocks it.

Version 3.2 12 million files in a folder is not a good idea

My original use case was that the method would receive two file names, the files would be read and the superposition done. This original use case stayed with the program, and even in version 3.1 you required one file for each fragment. Because the library was becoming faster and faster the things we wanted to try became more ambitious. You could hear me saying ‘Yeah sure we can cluster 12 million PDB fragments’. To do this I had to extract the 12 million fragments and store them in as many files. It wasn’t long until I received an angry/condecending email from the IT department. So what to do?

I decided to take each PDB file, remove everything except the ATOM fields and store them as binary files. In this way instead of providing two filenames to the superposition algorithm you provide two byte ranges which would be read from the pre-parsed PDB files. This is fast and you also have only 100k files(one for each PDB file), which you could use over and over again.

Version 3.2 is the latest version as of now but it is worth pointing out that whatever you do to make your code faster it will never be enough! The requirements become more ambitious as your code is getting faster. I can now perform 6 million superpositions/second on fragments of 4 residues, and 2 million superpositions/second on fragments of 5 residues. This declines exponentially and I can foresee a requirement to cluster all the 10 residue fragments from the PDB appearing sometime in the future.

OPIG Algorithm Club Problem #1: The 3n+1 problem

In the first meeting of the OPIG Algotirhm Club, we tackled the problem 3n+1 from the Sphere Online Judge (SPOJ).

—Background —

The problem relates to the Collatz conjecture. Before describing the conjecture, let us define a very simple recursive relation:

If a number is odd, multiply it by 3 and add 1 (hence 3n+1).
If a number is even, divide it by 2.

The Collatz conjecture states that for any given integer, you can repeat this recursion indefinetely and you will always reach the number 1.

The 3n+1 problem requires yet another concept which is the cycle length (i.e. the number of times you have to repeat the recursion for a given number until you reach 1).

— Goal —

The aim of the problem is: given 2 integers i and j, find the number with the highest cycle length that lies in the interval comprised between the 2 integers (them included).

— Important Issues —

Two things are worth mentioning before attempting to implement this problem:

i could be either greater than or lesser than j.
Despite the fact that the description of the problem states that all operations can be performed using int, there are certain inputs between 100,000 and 1,000,000 that violate that statement.

— Implementation —

A simple solution can be implemented by using a recursive function to compute the cycle length:

int Cycle_Length (unsigned int n) /* Note the unsigned precision */
{
    	unsigned int aux;

        /* Stopping Condition - Don't want the recursion to run for forever */
        if (n == 1) return 1;

        /* Here is the recursion, if n is odd  else     if n is even   */
	    aux = (n%2) ? (1 + Cycle_Length(3*n+1) ) : (1+Cycle_Length(n>>1));
        /* Note that division by two is performed by shifitng a bit to the right */
	    return aux;
}

Now, there are two ways we can optimise this function and make it more efficient.

The first one relates to a certain property of odd numbers: we know that if n is odd, than 3*n+1 is going to be even. Therefore we can simply skip one iteration of the cycle:

aux = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;

We can further optimise our code by using Dynamic Programming. We can store the Cycle Lengths we already computed so we don’t need to compute them ever again.

Using dynamic programming and the tricks above, we can get to the final solution (which runs pretty fast on the Online Judge platform):

#include<stdio.h>
#include<stdlib.h>

int CycleLength[100001];

int Cycle_Length (unsigned int n)
{
	int aux;
	if(n>100000)
	{
		aux = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;
		return aux;
	}
	if(!CycleLength[n]) 	
		CycleLength[n] = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;
	return CycleLength[n];
}

int findmax(int a, int b)
{
	int max=0,aux,i;
	for(i=a;i<=b;i++)
	{
		aux=Cycle_Length(i);
		if(max<aux)
			max=aux;
	}
	return max;	
}

int main()
{
	int a,b;
	CycleLength[1]=1;
	while(scanf("%d %d",&a,&b)!=EOF)
		a<b?printf("%d %d %d\n",a,b,findmax(a,b)):printf("%d %d %d\n",a,b,findmax(b,a));
	return 0;
}

Ta daam!

Annotate Antibody CDR and Framework Residues

Intro

Antibodies have very well conserved structures and their binding site is chiefly comprised of the six CDRs. The great similarity between the 1700+ antibody structures that can be found in SAbDab/PDB prompted the introduction of numbering schemes which act as coordinates with respect to the sequence/structural features of antibodies. The earliest such numbering scheme was introduced by Wu and Kabat, followed by the structurally informed Chothia-scheme which was eventually amended by Abhinandan and Martin. Even though there are several of those schemes, the one currently endorsed by the World Health Organization (WHO) is this of IMGT.

The Program Downolad.

It annotates the framework and CDR residues according to three definitions: Kabat, Chothia or Contact. You can download it here.

Possible issues?

You need internet connection for this program to work since it calls the Abnum service, thus you should cite the following if you use this code:

Abhinandan, K.R. and Martin, A.C.R. (2008) Analysis and improvements to Kabat and structurally correct numbering of antibody variable domains Molecular Immunology, 45, 3832-3839.

How to use it?

As an example test case, type the following in the Framer directory:

python Framer.py --f 1A2Y.pdb --c AB --o my_first_output --d chothia

This should get the the heavy and light chains of 1A2Y (A and B) and leave the output in a folder called my_first_outupt.

Options:

–f: Antibody file
–c: Antibody chains (you can submit just one or several)
–o: Output folder name – NB this is going to be created in the directory you call Framer from!
–d: CDR definition to be used, possible options are: chothia, kabat and contact.

Output files:

There are four output files:

red_blue.pdb: The pdb with b-factor colored CDRs. The CDRs have B-factor of 100.00 and the framework 0.00.
paratope.txt: The CDR residues, given in the format [id][whitespace][chain]
framework.txt: The Framework residues, given in the format [id][whitespace][chain]
full_info.txt: Full breakdown of the annotation given in the format:

Original ID Original Chain AA Chothia ID CDR(FR=frame,or CDR id)

Life in Colour – Vim

Among programmers, there are occasional debates on what editor is best — some love Eclipse, some are die-hard Emacs supporters, or some have no preference, and use the default text editor(s) with their OS. Whatever your choice, you can never underestimate how useful Vim can be, e.g. if you SSH into another machine. And so, here is a vim config that I’ve been using (thanks to Ben Frot), which makes your vim environment very colourful and easy to read. Code available here.

Plus, you can do awesome things in vim:

Edit multiple files in Vim. Can get a little crazy but, hey, why not?

So, to do some of the crazier things (e.g. what I’ve shown in this blog post), try this:

# Open a file of choice
:e blah1.md

# First split to two screens; change between screens by Ctrl + ww
:split 

# Now open a second file
:e blah2.md

# Repeat for more screens & lines.

Happy vim-ing!

Get PDB intermolecular protein contacts and interface residues

Very often in Struc Bio it is necessary to determine the contacts between two molecules. Most of us in the group have written a snippet of code to compute precisely that or they have adapted the Biopython functionality or one of the tools in pdbtools. The piece of code written in Python presented here is a Biopython variety that gives you the intermolecular contacts and it annotates the interface neighborhood. An example of the program output is given in the Figure below:

The complex between an antibody and an antigen is shown on the left without the annotation. On the right it is shown with intermolecular contacts annotated in red (4.5A distance) and the interface neighborhood shown in green (10A away from any contact residue).

Download

You can download it from here. There are three files inside:

GetInterfaces.py – the main source/runnable file
README.txt – instructions, very similar to this post (quite a lot copy/pasted)
1A2Y.pdb – the PDB file used in the example to practice on.

Requirements:

You need Biopython. (if you are from OPIG or any other Bioinformatics group, most likely it is already installed on your machine). You can download it from here.

How to use it?

As the bare minimum, you need to provide the structure of the pdb(s) and the chains that you want to examine contacts in.

Input options:

–f1 : first pdb file [Required]
–f2 : second pdb file (if the contacts are to be calculated in the same molecule, just submit the same pdb in both cases) [Required]
–c1 : Chains to be used for the first molecule [Required]
–c2 : Chains to be used for the second molecule [Required]
–c : contact cutoff for intermolecular contacts (Optional, set to 4.5A if not supplied on input)
–i : interface neighbor cutoff for intramolecular neighborhood of the contacting interface (Optional, set to 10.0A if not supplied on input). Set this option to zero (0.0) if you only want to get the intermolecular contacts in the interface, without the interface neighborhood.
–jobid : name for the output folder (Set to out_<random number> if not supplied on input)

An example which you can just copy paste and run when in the same directory as the python script:

python GetInterfaces.py --f1 1A2Y.pdb --f2 1A2Y.pdb --c1 AB --c2 C --c 4.5 --i 10.0 --jobid example_output

Above command will calculate the contacts between antibody in 1a2y (chains A and B) and the antigen (chain C). The contact distance was defined as 4.5A and the interface distance was defined as 10A. All the output files are saved in the folder out_example_output.

Output

Output folder is placed in the current directory. If you specify the output folder name (–jobid) it will be saved under the name ‘out_[whateveryoutyped]’, otherwise it will be ‘out_[randomgeneratednumber]’. The program tells you at the end where it saved all the files.

Input options:

molecule_1.pdb – the first supplied molecule with b-factor fields of contacts set to 100 and interface neighborhood set to 50
molecule_2.pdb – the second supplied molecule with b-factor fields of contacts set to 100 and interface neighborhood set to 50
molecule_1.txt – whitespace delimited file specifying the contacts and interface neighborhood in the second molecule in the format: [chain] [residue id] [contact ‘C’ or interface residues ‘I’]
molecule_2.txt – whitespace delimited file specifying the contacts and interface neighborhood in the second molecule in the format: [chain] [residue id] [contact ‘C’ or interface residues ‘I’]
molecule_1_constrained.pdb – the first molecule, which is constrained only to the residues in the interface.
molecule_2_constrained.pdb – the second molecule, which is constrained only to the residues in the interface.
parameters.txt – the contact distance and neighborhood distance used for the particular run.

Remove all LaTeX generated files

I am going to leave this here and bookmark it, because I am fed up of looking this up every time, not finding it and having to `history | fgrep rm`. To be used if you want to delete all LaTeX generated (and pdfLaTeX) files.

rm *.aux *.out *.toc *.log *.synctex.gz *.bbl *.blg *.pdf

Use at your own risk!

Django for scientific applications

In my current work I am developing a cheminformatics tool using structural and activity data to investigate protein-ligand binding. I ~~have only ever properly used~~ love python and I listen to Saulo, so I decided to used Django to develop my application. I didn’t understand what it was and why it might be useful before I started using it but below I thought I’d discuss a few of the features that I think have been useful and might encourage others to use it.

Firstly I will outline how Django works. I wanted to download all the PDB structures for CDK2 and store the information in a data structure that is robust and easily used. We have a Target and a Protein. A Target is associated to a particular UniProt accession. Cyclin-dependent kinase 2 (CDK2) is a Target. A Protein is a set of 3D coordinates, so 1AQ1 is a Protein.

class Target(models.Model):
"""A Django model to define a given protein target"""
    UniProt = models.CharField(max_length=20,unique=True)
    InitDate = models.DateTimeField(auto_now_add=True)
    Title = models.CharField(max_length=10)

In the above Target model I have three different fields. The first field denotes the UniProt accession for the Target and is “unique”. This means that only one Target can have any given UniProt accession in my data structure. If I try to add another with the same value in the UniProt field it will throw an exception. The second field denotes the time and date that the model was created. This means I can check back to when the target was created. The third is the Title I would like to use for this, for example CDK2.

I can then make a new Target objects by:

new_target = Target()
new_target.Title = "CDK2"
new_target.UniProt = "P24941"

and save it to the database by:

new_target.save() # Django takes care of the required SQL

The next model is for the Protein molecules:

class Protein(models.Model):
    """A Django model to define a given protein"""
    Code = models.CharField(max_length=6,unique=True)
    InitDate = models.DateTimeField(auto_now_add=True)
    TargetID = models.ForeignKey(Target)
    Apo = models.BoolenField()
    PDBInfo = models.FileField(upload_to='pdb')

The model contains the PDB Code, e.g. 1AQ1, and the date it was added to the database. It also consists of a foreign key, relating it to its Target and a boolean indicating if the structure is apo or holo. Finally there is a file field relating this entry to the appropriate file path where the PDB information is stored.

Once the data has been added to the database, Django then deals with all SQL queries from the database:

my_prot = Protein.objects.get(Code="1aq1") # Gives me the Protein object "1aq1"
CDK2_prots = Protein.objects.filter(TargetID__Title="CDK2") # All PDB entries associated to CDK2, as a query set, behaving similarily to a list
CDK2_list = [x for x in CDK2_prots] # Now exactly like a list

The “__” in the above query allows one to span the foreign key relationship, so it is searching for the Title of the Target not the Title of the Protein. Finally I can then access the PDB files for each of these proteins.

my_prot = Protein.objects.get(Code="1aq1") # Gives me the Protein object "1aq1"
print my_prot.Code # prints "1aq1"
# my_prot.PDBInfo has the behaviour of a file handle
pdb_lines = my_prot.PDBInfo.readlines()# Reads the lines of the file

There, you’ve made a queryable database, where Django deals with all the hard stuff and everything is native to python. Obviously in this example it might not be so difficult to imagine alternative ways of creating the same thing using directory structures, but as the structure of your data becomes more complex, Django can be easily manipulated and as it grow it utilises the speed advantages of modern databases.

Oxford Protein Informatics Group

or "OPIG" to friends