Category Archives: How To

How to do things. doh.

Viewing 3D molecules interactively in Jupyter iPython notebooks

Greg Landrum, curator of the invaluable open source cheminformatics API, RDKit, recently blogged about viewing molecules in a 3D window within a Jupyter-hosted iPython notebook (as long as your browser supports WebGL, that is).

The trick is to use py3Dmol. It’s easy to install:

pip install py3Dmol

This is built on the object-oriented, webGL based JavaScript library for online molecular visualization 3Dmol.js (Rego & Koes, 2015); here's a nice summary of the capabilities of 3Dmol.js. It's features include:

  • support for pdb, sdf, mol2, xyz, and cube formats
  • parallelized molecular surface computation
  • sphere, stick, line, cross, cartoon, and surface styles
  • atom property based selection and styling
  • labels
  • clickable interactivity with molecular data
  • geometric shapes including spheres and arrows

I tried a simple example and it worked beautifully:

import py3Dmol
view = py3Dmol.view(query='pdb:1hvr')
view.setStyle({'cartoon':{'color':'spectrum'}})
view

py3dmol_in_jupyter_ipython

The 3Dmol.js website summarizes how to view molecules, along with how to choose representations, how to embed it, and even how to develop with it.

References

Nicholas Rego & David Koes (2015). “3Dmol.js: molecular visualization with WebGL”.
Bioinformatics, 31 (8): 1322-1324. doi:10.1093/bioinformatics/btu829

Plotting and storing a 3D network in R

A simple toy example of a three layered network:

Note 1: In order to view the 3D plots, mac users will need Xquartz  installed (https://www.xquartz.org/).

 
require(igraph)
require(rgl)
#Another package that might be needed is "rglwidget". The function writeWebGL will show an error stating if rglwidget is required.
######################################//// 
######The basics######################////
######################################////
#1) Create a "food" network (three layers) 
set.seed(432)
g1<-watts.strogatz.game(dim = 1,size = 5,nei = 2,p = .5,loops = FALSE,multiple = FALSE)
g2<-watts.strogatz.game(dim = 1,size = 10,nei = 2,p = .2,loops = FALSE,multiple = FALSE)
g3<-watts.strogatz.game(dim = 1,size = 30,nei = 1,p = .5,loops = FALSE,multiple = FALSE)
g123=g1+g2+g3 


#Create more edges btw layers 
g123=rewire(g123,each_edge(prob=.4,loops = FALSE,multiple = FALSE)) 
ne=15;add_edges(g123,edges = cbind(sample(1:vcount(g1),size = ne,replace = TRUE), sample((vcount(g1)+1):vcount(g123),size = ne,replace = TRUE)))#top layer 
ne=30;add_edges(g123,edges = cbind(sample((vcount(g1)+1):(vcount(g1)+vcount(g2)),size = ne,replace = TRUE), sample((vcount(g1)+vcount(g2)+1):vcount(g123),size = ne,replace = TRUE)))#second layer 

#A quick plot of the graph
plot(g123,vertex.size=1,vertex.label.cex=0.02)


#Create 3d coordinates of the network layout
circpos=function(n,r=1){#Coordinates on a circle
 rad=seq(0,2*pi,length.out=n+1)[-1];x=cos(rad)*r;y=sin(rad)*r
 return(cbind(x,y))
}
#
lay=rbind(cbind(circpos(vcount(g1),r=1), runif(n = vcount(g1),-1,1)),
 cbind(circpos(vcount(g2),r=2), runif(n = vcount(g2),6,7)),
 cbind(circpos(vcount(g3),r=4), runif(n = vcount(g3),13,17))
)



#2d plot using the previous layout
plot(g123,vertex.size=5,vertex.label=NA,layout=lay[,c(1,3)])
plot(g123,vertex.size=1,vertex.label=NA,layout=lay[,c(1,2)])

layers

 
#3D graph plot
#Add some colour to nodes and edges
nodecols=c(rep("red",vcount(g1)),
 rep("blue",vcount(g2)),
 rep("yellow",vcount(g3)))

edgecols=function(elist,cols,grouplist){
 whatcol=rep(length(cols)+1,nrow(elist))
 finalcol=whatcol
 for(i in 1:nrow(elist)){
 for(k in length(cols):1){ 
 if( k * (length( intersect(elist[i,], grouplist[[k]]) ) > 0)){
 whatcol[i]=min(whatcol[i], k )
 }
 }
 finalcol[i]=cols[whatcol[i]]
 }
 return(finalcol)
}

#Open 3d viewer
rgl.open()
rglplot(g123, layout=lay,vertex.size=5,vertex.label=NA,vertex.color=nodecols,
 edge.color=edgecols(elist=get.edgelist(g123,names = FALSE),cols=c("orange","green","pink"),grouplist=list(1:vcount(g1), (vcount(g1)+1):(vcount(g1)+vcount(g2)), (vcount(g1)+vcount(g2)+1):vcount(g123)) )
)

3d_layers

###Storing the plot in an html file###

dirfolder="..." #your dir
#rgl.open()#instead of rgl.open use open3d, in order to save the plot. 
open3d()
rglplot(g123, layout=lay,vertex.size=5,vertex.label=NA,vertex.color=nodecols,
 edge.color=edgecols(elist=get.edgelist(g123,names = FALSE),cols=c("orange","green","pink"),grouplist=list(1:vcount(g1), (vcount(g1)+1):(vcount(g1)+vcount(g2)), (vcount(g1)+vcount(g2)+1):vcount(g123)) )
)
#Fix the view
rgl.viewpoint(theta=90, phi=0)

#Save a static 2d image:
rgl.snapshot(paste(dirfolder,"a_png_pic_0.png",sep=""), fmt="png", top=TRUE)

#Save the plot in a .htlm file:
rglfolder=writeWebGL(dir = paste(dirfolder,"first_net3d",sep=""), width=700)

#The previous function should create a file called index.htlm inside the folder "first_net3d". By opening this file in a browser (with javascript enabled) the 3d plot will be displayed again.
#Also the following command will open the plot in the browser:
browseURL(rglfolder)

Note 2: In order to view the .htlm file javascript should be enabled in the browser. (Here is an example on how to do this for safari ).

Although not covered in the previous script, further options are available such as edge/vertex size and the ability to control independently each of the nodes and edges in the graph. Here is an example that makes more use of these options:

clour_plot

3d network representing a T cell receptor. Edges are coloured according to a relevant path found between the bottom green node and the upper red node cluster.

T cell receptor (in blue), binding to a peptide (in red).

T cell receptor (in blue), binding to a peptide (in red).

Counting Threads

When someone talks about “counting threads” the first thing that you think of is probably shopping for bed sheets. But this post is not about the happy feeling of drifting off to sleep on smooth, comfortable Egyptian cotton. This post is about that much less happy feeling when you want to quickly run a bit of code on a couple of data sets to finish the results section of a thesis chapter, and you see this:

Luis blocking the server again....

Luis blocking the server again….

Obviously someone (*cough*Luis*cough*) is having some fun on the server without nice-ing their code to allow people who are much less organized than they should be (*cough*me*cough*) to do a quick last-minute data analysis run.

The solution: confront the culprit with their excessive server usage (Note: alternatives include manual server restart with a power cord to make the world your enemy – not recommended).

So… now we just need to find out how much of the server “ospina” is using. Screenshots won’t convince him… and we can’t take enough screenshots to show the extent of the server-hogging with his 1000s of processes anyway. We need to count…

Luckily there is a handy function to find out information about processes called pgrep. This is basically a ‘ps | grep’ function which has a bunch of options to reflect the many ways it can be used. We see opsina is running R, so here goes:

pgrep -c R

The -c flag counts processes and the pattern matches the command name that was run. But yeah, it turns out this wasn’t the best idea ever. A lot of people are running R (as might be expected in the Statistics Department), and you get a number that is really too high to be likely. We need to be more specific in our query, so let’s go back to the ps command. Second attempt:

ps -Af | grep ospina | wc -l

What we’re doing now is first showing all processes that are run on the server (ps -A) also showing details of the command run and who ran it (-f flag). Then we’re finding the ones that are labelled with our server culprit (grep ospina) and counting the lines we find. There are annoyingly still a few problems with this approach.

  1. We just ran this command on the server and thus will count a command like grep –color=auto ospina,
  2. User “ospina” is probably running a few more things than just his R command (like ssh-ing into the server and maybe a couple of screens)
  3. We get a number than looks far lower than what we expected just by visual inspection.

So… what happened? We can fix problems 1 and 2 by just piping to a further grep command. But problem 3 is different. As it turns out, our culprit is running multiple threads from the same process (which is also why you find so many chrome instances on htop for example). We just counted processes, when really the server is being occupied by his multi-threading exploits. So… if you want to back up your complaint with a nice number, here’s your baby:

ps -ALf | grep opsina | grep R-3.3 | wc -l

The -L flag displays all threads instead of only the processes. I further used R-3.3 as it turns out he is using a specific version of R, which I can use to specify this command. Otherwise it also helps to use inputs arguments to functions to search against. If your fingers get too tired to press the shift-key that often, ps -ALf is equivalent to ps -eLf.

For now: moan away, folks!

 

Disclaimer: Any scenarios alluded to in the above text are fictitious and do not represent the behaviour of the individuals mentioned. Read: obviously I do not do my analysis last-minute.

A beginner’s guide to Rosetta

Rosetta is a big software suite, and I mean really big. It includes applications for protein structure prediction, refinement, docking, and design, and specific adaptations of these applications (and others) to a particular case, for example protein-protein docking of membrane proteins to form membrane protein complexes. Some applications are available in one of the hassle-free servers online (e.g. ROSIE, Robetta, rosetta.design), which might work well if you’ve got just a few tests you would like to try using standard parameters and protocols. However, it’s likely that you will want to download and install a version if you’re interested in carrying out a large amount of modelling, or using an unusual combination of steps or scoring function. This is not a trivial task, as the source code is a 2.5G download, then your machine will be busy compiling for some time (around 5 hours on two cores on my old laptop). Alternatively, if the protocols and objects you’re interested in are part of PyRosetta, this is available in a pre-compiled package for most common operating systems and is less than 1G.

This brings me to the different ways to use Rosetta. Most applications come as an executable which you can find in Rosetta/main/source/bin/ after completing the build. There is documentation available on how to use most of these, and on the different flags which can be used to input PDB structures and parameters. Some applications can be run using RosettaScripts, which uses an xml file to define the protocol, including scoring functions, movers and other options. In this case, Rosetta/main/source/bin/rosetta_scripts.* is run, which will read the xml and execute the required protocol.

screenshot-from-2016-09-14-19-19-28

An example RosettaScript, used for the MPrelax protocol

PyRosetta is even more flexible, and relatively easy to use for anyone accustomed to programming in python. There are python bindings for the fast C++ objects and movers so that the increased usability is generally not greatly compromised by slower speeds. One of the really handy things about PyRosetta is the link to PyMOL which can be used to view the trajectory of your protein moving while a simulation is running. Just add the following to your .pymolrc file in your home directory to set up the link every time you open pymol:

run /PATH/TO/PYROSETTA/PyMOLPyRosettaServer.py

When it comes to finding your way around the Rosetta package, there are a few things it is very useful to know to start with. The demos directory contains plenty of useful example scripts and instructions for running your first jobs. In demos/tutorials you will find introductions to the main concepts. The demos/protocol_capture subdirectory is particularly helpful, as most papers which report a new Rosetta protocol will deposit here the scripts required to reproduce their results. These may not currently be the best methods to approach a problem, but if you have found a research article describing some results which would be useful to get for your system, they are a good starting point to learn how to make an application work. Then the world is your oyster as you explore the many possible options and inputs to change and add!

Processing large files using python: part duex

Last week I wrote a post on some of the methods I use in python to efficiently process very large datasets. You can find that here. Roughly it details how one can break a large file into chunks which then can be passed onto multiple cores to do the number crunching. Below I expand upon this, first creating a parent class which turns a given (large) file into chunks. I construct it in a manner which children classes can be easily created and tailored for specific file types, given some examples. Finally, I give some wrapping functions for use in conjunction with any of the chunkers so that the chunks can be processed using multiple cores.

First, and as an aside, I was asked after the previous post, at what scale these methods should be considered. A rough answer would be when the size of the data becomes comparable to the available RAM. A better answer would be, when the overhead of reading each individual line(/entry) is more than the operation on that entry. Here is an example of this case, though it isn’t really that fair a comparison:


>>> import timeit,os.path
>>> os.path.getsize("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa")
10095955
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([l.count(">") for l in f])',number=10)
0.8403599262237549
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")])',number=10)
0.15671014785766602

For a small 10MB fasta file, we count the number of sequences present in a fifth of the time using chunks. I should be honest though, and state that the speedup is mostly due not having the identify newline characters in the chunking method; but nevertheless, it shows the power one can have using chunks. For a 14GB fasta file, the times for the chunking (1Mb chunks) and non-chunking methods are 55s and 130s respectively.

Getting back on track, let’s turn the chunking method into a parent class from which we can build on:


import os.path

class Chunker(object):

    #Iterator that yields start and end locations of a file chunk of default size 1MB.
    @classmethod
    def chunkify(cls,fname,size=1024*1024):
        fileEnd = os.path.getsize(fname)
        with open(fname,'r') as f:
            chunkEnd = f.tell()
            while True:
                chunkStart = chunkEnd
                f.seek(size,1)
                cls._EOC(f)
                chunkEnd = f.tell()
                yield chunkStart, chunkEnd - chunkStart
                if chunkEnd >= fileEnd:
                    break

    #Move file pointer to end of chunk
    @staticmethod
    def _EOC(f):
        f.readline()

    #read chunk
    @staticmethod
    def read(fname,chunk):
        with open(fname,'r') as f:
            f.seek(chunk[0])
            return f.read(chunk[1])

    #iterator that splits a chunk into units
    @staticmethod
    def parse(chunk):
        for line in chunk.splitlines():
            yield chunk

In the above, we create the class Chunker which has the class method chunkify as well as the static methods, _EOC, read, and parse. The method chunkify does the actual chunking of a given file, returning an iterator that yields tuples containing a chunk’s start and size. It’s a class method so that it can make use of _EOC (end of chunk) static method, to move the pointer to a suitable location to split the chunks. For the simplest case, this is just the end/start of a newline. The read and parse methods read a given chunk from a file and split it into units (single lines in the simplest case) respectively. We make the non-chunkify methods static so that they can be called without the overhead of creating an instance of the class.

Let’s now create some children of this class for specific types of files. First, one of the most well known file types in bioinformatics, FASTA. Below is an segment of a FASTA file. Each entry has a header line, which begins with a ‘>’, followed by a unique identifier for the sequence. After the header line, one or more lines follow giving the sequence. Sequences may be either protein or nucleic acid sequences, and they may contain gaps and/or alignment characters.


>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

And here is the file type specific chunker:


from Bio import SeqIO
from cStringIO import StringIO

class Chunker_FASTA(Chunker):

    @staticmethod
    def _EOC(f):
        l = f.readline() #incomplete line
        p = f.tell()
        l = f.readline()
        while l and l[0] != '>': #find the start of sequence
            p = f.tell()
            l = f.readline()
        f.seek(p) #revert one line

    @staticmethod
    def parse(chunk):
        fh = cStringIO.StringIO(chunk)
        for record in SeqIO.parse(fh,"fasta"):
            yield record
        fh.close()

We update the _EOC method to find when one entry finishes and the next begins by locating “>”, following which we rewind the file handle pointer to the start of that line. We also update the parse method to use fasta parser from the BioPython module, this yielding SeqRecord objects for each entry in the chunk.

For a second slightly harder example, here is one designed to work with output produced by bowtie, an aligner of short reads from NGS data. The format consists of of tab separated columns, with the id of each read located in the first column. Note that a single read can align to multiple locations (up to 8 as default!), hence why the same id appears in multiple lines. A small example section of the output is given below.


SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-2	2502 GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3

with the corresponding chunker given by:


class Chunker_BWT(chunky.Chunker):

    @staticmethod
    def _EOC(f):
        l = f.readline()#incomplete line
        l = f.readline()
        if not l: return #EOF
        readID = l.split()[0]
        while l and (l.split()[0] != readID): #Keep reading lines until read IDs don't match
            p = f.tell()	
            l = f.readline()
        f.seek(p) #revert one line

    @staticmethod
    def parse(chunk):
        lines = chunk.splitlines()
        N = len(lines)
        i = 0 
        while i < N:
            readID = lines[i].split('\t')[0]
            j = i
            while lines[j].split('\t')[0] == readID:
                j += 1
                if j == N:
                    break
            yield lines[i:j]
            i = j

This time, the end of chunk is located by reading lines until there is a switch in the read id, whereupon we revert one line. For parsing, we yield all the different locations a given read aligns to as a single entry.

Hopefully these examples show you how the parent class can be expanded upon easily for most file types. Let’s now combine these various chunkers with the code from previous post to show how we can enable multicore parallel processing of the chunks they yield. The code below contains a few generalised wrapper functions which work in tandem with any of the above chunkers to allow most tasks to be parallelised .


import multiprocessing as mp, sys

def _printMP(text):
    print text
    sys.stdout.flush()

def _workerMP(chunk,inFile,jobID,worker,kwargs):
    _printMP("Processing chunk: "+str(jobID))
    output = worker(chunk,inFile,**kwargs)
    _printMP("Finished chunk: "+str(jobID))
    return output	

def main(inFile,worker,chunker=Chunker,cores=1,kwargs={}):
    pool = mp.Pool(cores)

    jobs = []
    for jobID,chunk in enumerate(chunker.chunkify(inFile)):
        job = pool.apply_async(_workerMP,(chunk,inFile,jobID,worker,kwargs))
        jobs.append(job)

    output = []
    for job in jobs:
        output.append( job.get() )

    pool.close()
    
    return output

The main function should be recognisable as the code from the previous post. It generates the pool of workers, ideally one for each core, before using the given chunker to split the corresponding file into a series of chunks for processing. Unlike before, we collect the output given by each job and return it after processing is finished. The main function acts as wrapper allowing us to specify different processing functions and different chunkers, as given by the variables worker and chunker respectively. We have wrapped the processing function call within the function _workerMP which prints to the terminal as tasks are completed. It uses the function _printMP to do this, as you need to flush the terminal after a print statement when using multi core processing, otherwise nothing appears until all tasks are completed.

Let’s finish by showing an example of how we would use the above to do the same task as we did at the start of this post, counting the sequences within a fasta file, using the base chunker:


def seq_counter(chunk,inFile):
    data = Chunker.read(inFile,chunk)
    return data.count('>')

and using the FASTA chunker:


def seq_counter_2(chunk,inFile):
    data = list(Chunker_FASTA.parse(Chunker_FASTA.read(inFile,chunk)))
    return len(data)

And time they take to count the sequences within the 14GB file from before:


>>> os.path.getsize("SRR951828.fa")
14944287128
>>> x=time.time();f = open("SRR951828.fa");sum([l.count(">") for l in f]);time.time()-x
136829250
190.05533599853516
>>> x=time.time();f = open("SRR951828.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")]);time.time()-x
136829250
26.343637943267822
>>> x=time.time();sum(main("SRR951828.fa",seq_counter,cores=8));time.time()-x
136829250
4.36846399307251
>>> x=time.time();main("SRR951828.fa",seq_counter_2,Chunker_FASTA,8);time.time()-x
136829250
398.94060492515564

Let’s ignore that last one, as the slowdown is due to turning the entries into BioPython SeqRecords. The prior one, which combines chunking and multicore processing, has roughly a factor of 50 speed up. I’m sure this could be further reduced using more cores and/or optimising the chunk size, however, this difference alone can change something from being computationally implausible, to plausible. Not bad for only a few lines of code.

Anyway, as before, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. If I feel like it, I may follow this up with a post about how to integrate a queue into the above which outputs the result of each job as they are produced. In the above, we currently hold collate all the results in the memory, which has the potential to cause a memory overflow depending on what is being returned.

Processing large files using python

In the last year or so, and with my increased focus on ribo-seq data, I have come to fully appreciate what the term big data means. The ribo-seq studies in their raw forms can easily reach into hundreds of GBs, which means that processing them in both a timely and efficient manner requires some thought. In this blog post, and hopefully those following, I want to detail some of the methods I have come up (read: pieced together from multiple stack exchange posts), that help me take on data of this magnitude. Specifically I will be detailing methods for python and R, though some of the methods are transferrable to other languages.

My first big data tip for python is learning how to break your files into smaller units (or chunks) in a manner that you can make use of multiple processors. Let’s start with the simplest way to read a file in python.


with open("input.txt") as f:
    data = f.readlines()
    for line in data:
        process(line)

This mistake made above, with regards to big data, is that it reads all the data into RAM before attempting to process it line by line. This is likely the simplest way to cause the memory to overflow and an error raised. Let’s fix this by reading the data in line by line, so that only a single line is stored in the RAM at any given time.


with open("input.txt") as f:
    for line in f:
        process(line)

This is a big improvement, namely it doesn’t crash when fed a big file (though also it’s shorter!). Next we should attempt to speed this up a bit by making use of all these otherwise idle cores.


import multiprocessing as mp

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    for line in f:
        jobs.append( pool.apply_async(process,(line)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Provided the order of which you process the lines don’t matter, the above generates a set (pool) of workers, ideally one for each core, before creating a bunch of tasks (jobs), one for each line, for the workers to do. I tend to use the Pool object provided by the multiprocessing module due to ease of use, however, you can spawn and control individual workers using mp.Process if you want finer control. For mere number crunching, the Pool object is very good.

While the above is now making use of all those cores, it sadly runs into memory problems once again. We specifically use apply_async function so that the pool isn’t blocked while each line processes. However, in doing so, all the data is read into memory once again; this time stored as individual lines associated with each job, waiting inline to be processed. As such, the memory will again overflow. Ideally the method will only read the line into memory when it is its turn to be processed.


import multiprocessing as mp

def process_wrapper(lineID):
    with open("input.txt") as f:
        for i,line in enumerate(f):
            if i != lineID:
                continue
            else:
                process(line)
                break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    for ID,line in enumerate(f):
        jobs.append( pool.apply_async(process_wrapper,(ID)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Above we’ve now changed the function fed to pool of workers to include opening the file, locating the specified line, reading it into memory, and then processing it. The only input now stored for each job spawned is the line number, thereby preventing the memory overflow. Sadly, the overhead involved in having to locate the line by reading iteratively through the file for each job is untenable, getting progressively more time consuming as you get further into the file. To avoid this we can use the seek function of file objects which skips you to a particular location within a file. Combining with the tell function, which returns the current location within a file, gives:


import multiprocessing as mp

def process_wrapper(lineByte):
    with open("input.txt") as f:
        f.seek(lineByte)
        line = f.readline()
        process(line)

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    nextLineByte = f.tell()
    for line in f:
        jobs.append( pool.apply_async(process_wrapper,(nextLineByte)) )
        nextLineByte = f.tell()

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Using seek we can move directly to the correct part of the file, whereupon we read a line into the memory and process it. We have to be careful to correctly handle the first and last lines, but otherwise this does exactly what we set out, namely using all the cores to process a given file while not overflowing the memory.

I’ll finish this post with a slight upgrade to the above as there is a reasonable amount of overhead associated with opening and closing the file for each individual line. If we process multiple lines of the file at a time as a chunk, we can reduce these operations. The biggest technicality when doing this is noting that when you jump to a location in a file, you are likely not located at the start of a line. For a simple file, as in this example, this just means you need to call readline, which reads to next newline character. More complex file types likely require additional code to locate a suitable location to start/end a chunk.


import multiprocessing as mp,os

def process_wrapper(chunkStart, chunkSize):
    with open("input.txt") as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        for line in lines:
            process(line)

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    with open(fname,'r') as f:
        chunkEnd = f.tell()
    while True:
        chunkStart = chunkEnd
        f.seek(size,1)
        f.readline()
        chunkEnd = f.tell()
        yield chunkStart, chunkEnd - chunkStart
        if chunkEnd > fileEnd:
            break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
for chunkStart,chunkSize in chunkify("input.txt"):
    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Anyway, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. In another post coming in the near future, I will expanded on this code, turning it into a parent class from which create multiple children to use with various file types.

Colour wisely…

Colour – the attribute of an image that makes it acceptable or destined for the bin. Colour has a funny effect on us – it’s a double-edged sword that greatly strengthens, or weakens data representation in such a huge level. No one really talks about what’s a good way to colour an image or a graph, but it’s something that most can agree as being pleasing, or disgusting. There are two distinctive advantages to colouring a graph: it conveys both quantitative and categorical information very, very well. Thus, I will provide a brief overview (with code) on how colour can be used to display both quantitative and qualitative information. (*On the note of colours, Nick has previously discussed how colourblindness must be considered in visualising data…).

1. Colour conveys quantitative information.
A huge advantage of colour is that it can provide quantitative information, but this has to be done correctly. Here are three graphs showing the exact same information (the joint density of two normal distributions) and  we can see from the get-go which method is the best at representing the density of the two normal distributions:

Colouring the same graph using three different colour maps.

Colouring the same graph using three different colour maps.

If you thought the middle one was the best one, I’d agree too. Why would I say that, despite it being grayscale and seemingly being the least colourful of them all?

  • Colour is not limited to hues (i.e. whether it’s red/white/blue/green etc. etc.); ‘colour’ is also achieved by saturation and brightness (i.e., how vivid a colour is, or dark/light it is). In the case of the middle graph, we’re using brightness to indicate the variations in density and this is a more intuitive display of variations in density. Another advantage of using shades as the means to portray colour is that it will most likely be OK with colourblind users.
  • Why does the graph on the right not work for this example? This is a case where we use a “sequential” colour map to convey the differences in density. Although the colour legend clarifies what colour belongs to which density bin, without it, it’s very difficult to tell what “red” is with respect to “yellow”. Obviously by having a colour map we know that red means high density and yellow is lower, but without the legend, we can interpret the colours very differently, e.g. as categories, rather than quantities. Basically, when you decide on a sequential colour map, its use must be handled well, and a colour map/legend is critical. Otherwise, we risk putting colours as categories, rather than as continuous values.
  • Why is the left graph not working well? This is an example of a “diverging” colourmap.
    It’s somewhat clear that blue and red define two distinct quantities. Despite this, a major error of this colour map comes in the fact that there’s a white colour in the middle. If the white was used as a “zero crossing” — basically, where a white means the value is 0 — the diverging colour map would have been a more effective tool. However, we can see that matplotlib used white as the median value (by default); this sadly creates the false illusion of a 0 value, as our eyes tend to associate white with missing data, or ‘blanks’. Even if this isn’t your biggest beef with the divergent colour map, we run into the same colour as the sequential colour map, where blue and red don’t convey information (unless specified), and the darkness/lightness of the blue and red are not linked very well without the white in the middle. Thus, it doesn’t do either job very well in this graph. Basically, avoid using divergent colourmaps unless we have two different quantities of values (e.g. data spans from -1 to +1).

2. Colour displays categorical information.
An obvious use of colour is the ability to categorise our data. Anything as simple as a line chart with multiple lines will tell you that colour is terrific at distinguishing groups. This time, notice that the different colour schemes have very different effects:

Colour schemes can instantly differentiate groups.

Colour schemes can instantly differentiate groups.

Notice how this time around, the greyscale method (right) was clearly the losing choice. To begin with, it’s hard to pick out what’s the difference between persons A,B,C, but there’s almost a temptation to think that person A morphs into person C! However, on the left, with a distinct set of colours, there is a clear distinction of persons A, B, and C as the three separate colours. Although a set of distinct three colours is a good thing, bear in mind the following…

  • Make sure the colours don’t clash with respect to lightness! Try to pick something that’s distinct (blue/red/green), rather than two colours which can be interpreted as two shades of the same colour (red/pink, blue/cyan, etc.)
  • Pick a palette to choose from – a rainbow is typically the best choice just because it’s the most natural, but feel free to choose your own set of hues. Also include white and black as necessary, so long as it’s clear that they are also part of the palette. White in particular would only work if you have a black outline.
  • Keep in mind that colour blind readers can have trouble with certain colour combinations (red/yellow/green) and it’s best to steer toward colourblind-friendly palettes.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sp
from mpl_toolkits.axes_grid1 import make_axes_locatable

### Part 1
# Sample 250 points
np.random.seed(30)
x = np.random.normal(size = 250)
np.random.seed(71)
y = np.random.normal(size = 250)

# Assume the limits of a standard normal are at -3, 3
pmin, pmax = -3, 3

# Create a meshgrid that is 250x250
xgrid, ygrid = np.mgrid[pmin:pmax:250j, pmin:pmax:250j]
pts = np.vstack([xgrid.ravel(), ygrid.ravel()]) # ravel unwinds xgrid from a 250x250 matrix into a 62500x1 array

data = np.vstack([x,y])
kernel = sp.gaussian_kde(data)
density = np.reshape(kernel(pts).T, xgrid.shape) # Evaluate the density for each point in pts, then reshape back to a 250x250 matrix

greys = plt.cm.Greys
bwr = plt.cm.bwr
jet = plt.cm.jet

# Create 3 contour plots
fig, ax = plt.subplots(1,3)
g0 = ax[0].contourf(xgrid, ygrid, density, cmap = bwr)
c0 = ax[0].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black
g1 = ax[1].contourf(xgrid, ygrid, density, cmap = greys)
c1 = ax[1].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black
g2 = ax[2].contourf(xgrid, ygrid, density, cmap = jet)
c2 = ax[2].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black

# Divide each axis then place a colourbar next to it
div0 = make_axes_locatable(ax[0])
cax0 = div0.append_axes('right', size = '10%', pad = 0.1) # Append a new axes object
cb0  = plt.colorbar(g0, cax = cax0)

div1 = make_axes_locatable(ax[1])
cax1 = div1.append_axes('right', size = '10%', pad = 0.1)
cb1  = plt.colorbar(g1, cax = cax1)

div2 = make_axes_locatable(ax[2])
cax2 = div2.append_axes('right', size = '10%', pad = 0.1)
cb2  = plt.colorbar(g2, cax = cax2)

fig.set_size_inches((15,5))
plt.tight_layout()
plt.savefig('normals.png', dpi = 300)
plt.close('all')

### Part 2
years = np.arange(1999, 2017, 1)
np.random.seed(20)
progress1 = np.random.randint(low=500, high =600, size = len(years))
np.random.seed(30)
progress2 = np.random.randint(low=500, high =600, size = len(years))
np.random.seed(40)
progress3 = np.random.randint(low=500, high =600, size = len(years))

fig, ax = plt.subplots(1,2)
ax[0].plot(years, progress1, label = 'Person A', c = '#348ABD')
ax[0].plot(years, progress2, label = 'Person B', c = '#00de00')
ax[0].plot(years, progress3, label = 'Person C', c = '#A60628')
ax[0].set_xlabel("Years")
ax[0].set_ylabel("Progress")
ax[0].legend()

ax[1].plot(years, progress1, label = 'Person A', c = 'black')
ax[1].plot(years, progress2, label = 'Person B', c = 'gray')
ax[1].plot(years, progress3, label = 'Person C', c = '#3c3c3c')
ax[1].set_xlabel("Years")
ax[1].set_ylabel("Progress")
ax[1].legend()

fig.set_size_inches((10,5))
plt.tight_layout()
plt.savefig('colourgrps.png', dpi = 300)
plt.close('all')

From 300 superpositions/sec to 6,000,000 superpositions/sec. A Python to C story

Part of the work I do requires me to identify uniqueness of structural shapes in specific proteins. People have done this in many ways, some more complex than others. I like more simple things so I decided to use clustering on top of a superposition algorithm. I will not detail too much the clustering part of this algorithm as this is not the focus of this article, but let’s assume that we need to make millions, possibly billions, of superpositions. In the following I will go through the steps I took in making a superposition library faster and faster, as the requirements demanded more and more. It will include information on how to profile your code, connect Python to C, and idiosyncrasies of parallelising C code.

Version 1.0 Python 100%

In PyFread I found a superposition algorithm, coupled with a PDB parser. I extracted the code and created a wrapping function. The signature of it was:

def superpose(file1, file2):

The files had to contain only ATOM lines with the same number of residues . It returned the RMSD of the best superposition, which was needed for the clustering. This could do a couple of a hundred  superpositions/second. Using the inbuilt python profiler I found out that both the reading in part of the algorithm and the superposition are slowing down the program so I decided to move the contents of the wrapping function in C.

You can easily profile your code by running the command below. It will tell you for each method how many times it’s called and what is the total time spent in it. More information here https://docs.python.org/2/library/profile.html

python -m cProfile your_script.py

Version 2.0 C 99% – Python 1%

My script still needed to run from python but it’s functionality would be implemented in C. So I rewrote the whole thing in C (a PDB parser and superposition algorithm)  and  interfaced it to Python. The way to achieve this is using ctypes. This provides an interface for calling C/C++ methods from Python.  The way you set this up is:

C code rmsd.cpp

extern “C” {
    void superpose_c(const char* file1, consta char* file2, double* rmsd){
        //perform superposition
        *rmsd = compute_rmsd();
    }
}

You compile this code with: g++ -shared -Wl,-soname,rmsdlib -o rmsdlib.so -fPIC rmsd.cpp . This creates a shared object which can be loaded in Python

Python Code

# load the ctypes library with C-like objects
from ctypes import *
# load the C++ shared objects
lib = cdll.LoadLibrary(‘./rmsdlib.so’)

#convert python string to c char array function
def to_c_string(py_string)
    c_string = c_char * (len(py_string)+1)()
    for i in range(len(py_string)):
        c_string[i] = py_string[i]
    c_string[i+1] = “”

rmsd = c_double()
lib.superpose_c(to_c_string(“file1.pdb”), to_c_string(“file2.pdb”), ref(rmsd))
# the use of ref is necessary so that the value set to rmsd inside the C function will be returned to Python

There are other ways of coding the Python – C Bridge I’m sure, and if you think you have something better and easier please comment.

Version 2.1 Armadillo Library

After running another batch of profiling, this time in C, I found that the bottleneck was the superposition code.

An easy way to profile C code is that when you compile it you use the -pg flag (gcc -pg …..). You then run the executable which will produce a file called gmon.out. Using that you run the command gprof executable_name gmon.out > analysis.txt . This will store the profiling info in analysis.txt. You can read more on this here http://www.thegeekstuff.com/2012/08/gprof-tutorial/

The superposition algorithm involved doing a Singular Value Decomposition, which in my implementation was slow. Jaroslaw Nowak found the Armadillo library which does a lot of cool linear algebra stuff very fast, and it is written in C. He rewrote the superposition using Armadillo which made it much faster. You can read about Armadillo here http://arma.sourceforge.net/

Version 3.0 C 100%

So what is the bottleneck now? Everytime I do a superposition the C function gets called from Python. Suppose I have the following files: file1.pdb, file2.pdb, file3.pdb, file4.pdb . I have to compare file1.pdb to every other file, which means I will be reading file1.pdb 3 times ( or 1 million times depending on what I had to do) in my original Python – C implementation. My clustering method is smarter than this, but similar redundancies did exists and they slowed down my program. I therefore decided to move everything in C, so that once I read a file I could keep it stored in memory.

Version 3.1 We have multiple cores, let’s use them

The superpositions can be parallelised, but how straightforward is doing this in C++? First of all you need to import the library #include <thread> . A simple scenario would be that I have a vector of N filenames and the program would have to compare everything with everything. The results would be stored in the upper triangle of an NxN array. The way I usually approach such a situation is that I send to each thread a list of comparison indeces (eg. Thread 1 compares 1-2, 1-3, 1-4; Thread 2-3, 2-4….), and the address of the results matrix (where the RMSD values would be stored). Because each thread will be writing to a different set of indeces in the matrix it should not be a problem that all the threads see the same results matrix (if you code it properly).

The way to start a thread is:

thread(par_method, …. , ref(results))

par_method is the method you are calling. When you pass an argument like results to a thread if it’s not specified with ref(..) it would be passed by value (it does not matter if normally in an unthreaded environment it would be passed by reference). ref is a reference wrapper and will make sure that all the threads see the same results instance.

Other problems you can come across is if you want to write to the same file from multiple threads, or modify vector objects which are passed by reference (with ref). They do not have thread safe operations, the program will crash if more than one thread will call a function on these objects. To make sure this does not happen you can use a lock. The way you achieve this in C is by using with mutex.

#include <mutex>
// Declare a global variable mutex that every thread sees
mutex output_mtx;

void parallel_method(….,vector<string>& files){
    output_mtx.lock();
    //Unsafe operation
    files.append(filename);
    output_mtx.unlock();
}

If one thread has locked output_mtx another one won’t finish the execution of output_mtx.lock() until the other first one unlocks it.

Version 3.2 12 million files in a folder is not a good idea

My original use case was that the method would receive two file names, the files would be read and the superposition done. This original use case stayed with the program, and even in version 3.1 you required one file for each fragment. Because the library was becoming faster and faster the things we wanted to try became more ambitious. You could hear me saying ‘Yeah sure we can cluster 12 million PDB fragments’. To do this I had to extract the 12 million fragments and store them in as many files. It wasn’t long until I received an angry/condecending email from the IT department. So what to do?

I decided to take each PDB file, remove everything except the ATOM fields and store them as binary files. In this way instead of providing two filenames to the superposition algorithm you provide two byte ranges which would be read from the pre-parsed PDB files. This is fast and you also have only 100k files(one for each PDB file), which you could use over and over again.

Version 3.2 is the latest version as of now but it is worth pointing out that whatever you do to make your code faster it will never be enough! The requirements become more ambitious as your code is getting faster. I can now perform 6 million superpositions/second on fragments of 4 residues, and 2 million superpositions/second on fragments of 5 residues. This declines exponentially and I can foresee a requirement to cluster all the 10 residue fragments from the PDB appearing sometime in the future.

 

 

 

 

 

 

Submitting your thesis!

Writing and submitting your thesis is (almost) the final stage of completing your PhD. It can be the most stressful and unpleasant part of the process… but it can also be rewarding to see the story of your last three years’ work fall into place.

 "Piled Higher and Deeper" by Jorge Cham (www.phdcomics.com)

All I want for christmas is… “Piled Higher and Deeper” by Jorge Cham (www.phdcomics.com)

This post is a miscellaneous collection of advice and resources about the submission process, most of which have been passed down from the very first members of OPIG. Hopefully it will be useful to have it all in the same place for present and future members. Feel free to comment here if you have any tips I have missed!

All information and links that I’ve included are correct at the time of writing (for Oxford University Statistics students) but you should always use the university’s guidelines as your primary resource.

The very beginning: the plan

Don’t spend too long on this! But you should have an idea of your planned chapter titles and an overall story for your book. Also useful is a timeline for when you will finish drafts of chapters by. Try to be realistic with this. If you decide to change your thesis title you should fill out an application for change of thesis title form (GSO.6). Make sure you look up any restrictions (word/page limits etc.) which may apply, and confirm your hand-in date.

Starting writing

It’s a good idea to decide what you will use to write your thesis. Most OPIG members use LaTeX. There are some great thesis templates out there but the one most people tend to use is one from Cambridge’s Engineering department. You can do a fair bit of customisation within that template… changing fonts, headers, titles and more, but it’s a great starting point.

When the finish line’s in sight: choosing examiners

A couple of months before you are planning to submit your thesis you should discuss with your supervisor(s) potential examiners. Your supervisor can informally check with them if they are happy to examine you and then you should fill out an appointment of examiners form (GSO.3). You can also change your thesis title on this form without filling in GSO.6.

Finishing writing

Your final document is likely to be over 100 pages with thousands of words (or potential typos as you might come to call them). A great LaTeX spell checker is aspell which should already be installed on your work machine. To spell check a .tex file (ignoring all TeX notation… apart from multiple citations I found!) using a British dictionary simply type:

aspell --lang=en_GB -t -c filename.tex

You’re absolutely guaranteed to still have typos floating around but it’s a decent start. You (and others if you can get them) should manually proof-read as well!

Final Formatting

Your thesis should be set out on numbered, portrait A4 pages. It should be double spaced and the inner (bound) margin should be 3-3.5cm. For more details on the formatting required check out the university’s regulations.

Printing and binding

When you’re happy with your proof-reading (you’re still almost guaranteed to have remaining typos) you’ll have to print and bind your finished book! To comply with university guidelines you will need to submit two copies, for each of your examiners, to the exam schools. You may also like to print a copy for yourself (you will need one to take with you into your viva). Before you start, if you are printing in colour at the department make sure you have enough printer credit by emailing IT (let them know the printer and your Bod card number and they will top you up if necessary).

If you are planning to print your copies double sided you may want to buy your own paper of higher quality than that provided by the department (at least 100gsm). As of October 2014, the Oxford Print Centre was selling the cheapest packs of 100gsm paper we could find but sold out close to deadline day! Also check out WHSmiths or Ryman’s.

You might want to do a test run of a few colour pages of your thesis before you send the whole file to be printed. Printing at 1200dpi (instead of the default 600dpi) can improve the appearance of your figures considerably. You may want to stay late at the office to print so you are not disturbed by other print jobs during office hours.

Your thesis should be securely bound in either hard or soft cover. Loose-leaf or spiral binding won’t be accepted. There are several binding facilities through Oxford but I used the Oxford Print Centre just down the road, which also guarantees a one hour service for soft binding even on submission days.

Submission

Submit your completed copies to the exam schools, noting their opening hours (08.30-17:00, Monday to Friday), take the traditional photo, and bask in your newly found FREEDOM (try to forget about the viva!).

Quick Standalone BLAST Setup for Ubuntu Linux

Some people run into trouble trying to setup a standalone version of BLAST using the NCBI instructions. Here a stremalined process will be presented, targeted at Ubuntu.

I assume that you are aware of the paradigms of blast, meaning that there are several executables for searching nucleic acids or proteins and there are different databases you can blast against. Sinon, you should read up on the available search tools  and databases before you attempt to install Blast. NB, throughout this document, I am using protein blast and protein input – changing to nucleotide sequences is trivial as you just change blastp to blastn and ‘prot’ to ‘nt’ in obvious places (and of course you use different queries and target databases).
Without further ado, Blast setup for UNIX.
There are two components for the installation:
  1. Executables (bastn, blastp etc.)
  2. Databases. (nr, nt etc.)
Both are described below with follow-up examples of usage.
Ad.1 The executables can be downloaded and compiled from here (download the source, run ./configure then make and finally make install in the directory of the untarred file). However a much easier way to do it under Ubuntu is:
sudo apt-get install ncbi-blast+
This automatically installs everything. In both cases to check if all went ok, type:
which blastp
If you get a directory such as /usr/local/bin than all went well and that’s where your executables are.
Ad.2 FIrst, you need to decide on where to store the databases. Do this by setting the environment variable:
export BLASTDB=/path/to/blastdbs/of/your/chosing
Now, we can either use one of the ncbi-curated databases or create our own. We will do both.
A) Downloading and using an ncbi-curated database.
The databases can be downloaded using the update_blastdb script. As an example I will download a non redundant protein database which is referred to as ‘nr’:
cd $BLASTDB
sudo update_blastdb --passive --timeout 300 --force --verbose nr
ls *.gz |xargs -n1 tar -xzvf
rm *.gz.*

The penultimate command extracts all the files you have downloaded and the last one removes the downloaded archives.

Now you should be able to use your new database by executing (where somesequence.fasta is your sample query):

blastp -db nr -query somesequence.fasta

Done.

B) Creating your own database.

Firstly, put a bunch of fasta protein sequences into a file called sample.fa

Next, execute the following

makeblastdb -in sample.fa -dbtype 'prot' -out NewDb
mv NewDB* $BLASTDB/

We have now created a blast protein database from your fasta file, called NewDB. The last line simply moves all the blast files to the database directory.

Now you should be able to use your new database by executing (where somesequence.fasta is your sample query):

blastp -db NewDb -query somesequence.fasta

Done.

Afterword

These instructions are the shortest way I could find to get a working stand-alone BLAST application. If you require more info, you can look here.