Category Archives: Code

Interesting Jupyter and IPython Notebooks

Here’s a treasure trove of interesting Jupyter and iPython notebooks, with lots of diverse examples relevant to OPIG, including an RDKit notebook, but also:

Entire books or other large collections of notebooks on a topic (covering Introductory Tutorials; Programming and Computer Science; Statistics, Machine Learning and Data Science; Mathematics, Physics, Chemistry, Biology; Linguistics and Text Mining; Signal Processing; Scientific computing and data analysis with the SciPy Stack; General topics in scientific computing; Machine Learning, Statistics and Probability; Physics, Chemistry and Biology; Data visualization and plotting; Mathematics; Signal, Sound and Image Processing; Natural Language Processing; Pandas for data analysis); General Python Programming; Notebooks in languages other than Python (Julia; Haskell; Ruby; Perl; F#; C#); Miscellaneous topics about doing various things with the Notebook itself; Reproducible academic publications; and lots more!  

 

Using RDKit to load ligand SDFs into Pandas DataFrames

If you have downloaded lots of ligand SDF files from the PDB, then a good way of viewing/comparing all their properties would be to load it into a Pandas DataFrame.

RDKit has a very handy function just for this – it’s found under the PandasTool module.

I show an example below within Jupypter-notebook, in which I load in the SDF file, view the table of molecules and perform other RDKit functions to the molecules.

First import the PandasTools module:

from rdkit.Chem import PandasTools

Read in the SDF file:

SDFFile = "./Ligands_noHydrogens_noMissing_59_Instances.sdf"
BRDLigs = PandasTools.LoadSDF(SDFFile)

You can see the whole table by calling the dataframe:

BRDLigs

The ligand properties in the SDF file are stored as columns. You can view what these properties are, and in my case I have loaded 59 ligands each having up to 26 properties:

BRDLigs.info()

It is also very easy to perform other RDKit functions on the dataframe. For instance, I noticed there is no heavy atom column, so I added my own called ‘NumHeavyAtoms’:

BRDLigs['NumHeavyAtoms']=BRDLigs.apply(lambda x: x['ROMol'].GetNumHeavyAtoms(), axis=1)

Here is the column added to the table, alongside columns containing the molecules’ SMILES and RDKit molecule:

BRDLigs[['NumHeavyAtoms','SMILES','ROMol']]

R or Python for data vis?

Python users: ever wanted to learn R?
R users: ever wanted to learn Python?
Check out: http://mathesaurus.sourceforge.net/r-numpy.html

Both languages are incredibly powerful for doing large-scale data analyses. They both have amazing data visualisation platforms, allowing you to make custom graphs very easily (e.g. with your own set of fonts, color palette choices, etc.) These are just a quick run-down of the good, bad, and ugly:

R

  • The good:
    • More established in statistical analyses; if you can’t find an R package for something, chances are it won’t be available in Python either.
    • Data frame parsing is fast and efficient, and incredibly easy to use (e.g. indexing specific rows, which is surprisingly hard in Pandas)
    • If GUIs are your thing, there are programs like Rstudio that mesh the console, plotting, and code.
  • The bad:
    • For loops are traditionally slow, meaning that you have to use lots of apply commands (e.g. tapply, sapply).
  • The ugly:
    • Help documentation can be challenging to read and follow, leading to (potentially) a steep learning curve.

Python

  • The good:
    • If you have existing code in Python (e.g. analysing protein sequences/structures), then you can plot straight away without having to save it as a separate CSV file for analysis, etc.
    • Lots of support for different packages such as NumPy, SciPy, Scikit Learn, etc., with good documentation and lots of help on forums (e.g. Stack Overflow)
    • It’s more useful for string manipulation (e.g. parsing out the ordering of IMGT numbering for antibodies, which goes from 111A->111B->112B->112A->112)
  • The bad:
    • Matplotlib, which is the go-to for data visualisation, has a pretty steep learning curve.
  • The ugly:
    • For statistical analyses, model building can have an unusual syntax. For example, building a linear model in R is incredibly easy (lm), whereas Python involves sklearn.linear_model.LinearRegression().fit. Otherwise you have to code up a lot of things yourself, which might not be practical.

For me, Python wins because I find it’s much easier to create an analysis pipeline where you can go from raw data (e.g. PDB structures) to analysing it (e.g. with BioPython) then plotting custom graphics. Another big selling point is that Python packages have great documentation. Of course, there are libraries to do the analyses in R but the level of freedom, I find, is a bit more restricted, and R’s documentation means you’re often stuck interpreting what the package vignette is saying, rather than doing actual coding.

As for plotting (because pretty graphs are where it’s at!), here’s a very simple implementation of plotting the densities of two normal distributions, along with their means and standard deviations.

import numpy as np
from matplotlib import rcParams

# plt.style.use('xkcd') # A cool feature of matplotlib is stylesheets, e.g. make your plots look XKCD-like

# change font to Arial
# you can change this to any TrueType font that you have in your machine
rcParams['font.family'] = 'sans-serif'
rcParams['font.sans-serif'] = ['Arial']

import matplotlib.pyplot as plt
# Generate two sets of numbers from a normal distribution
# one with mean = 4 sd = 0.5, another with mean (loc) = 1 and sd (scale) = 2
randomSet = np.random.normal(loc = 4, scale = 0.5, size = 1000)
anotherRandom = np.random.normal(loc = 1, scale = 2, size = 1000)

# Define a Figure and Axes object using plt.subplots
# Axes object is where we do the actual plotting (i.e. draw the histogram)
# Figure object is used to configure the actual figure (e.g. the dimensions of the figure)
fig, ax = plt.subplots()

# Plot a histogram with custom-defined bins, with a blue colour, transparency of 0.4
# Plot the density rather than the raw count using normed = True
ax.hist(randomSet, bins = np.arange(-3, 6, 0.5), color = '#134a8e', alpha = 0.4, normed = True)
ax.hist(anotherRandom, bins = np.arange(-3, 6, 0.5), color = '#e8291c', alpha = 0.4, normed = True)

# Plot solid lines for the means
plt.axvline(np.mean(randomSet), color = 'blue')
plt.axvline(np.mean(anotherRandom), color = 'red')

# Plot dotted lines for the std devs
plt.axvline(np.mean(randomSet) - np.std(randomSet), linestyle = '--', color = 'blue')
plt.axvline(np.mean(randomSet) + np.std(randomSet), linestyle = '--', color = 'blue')

plt.axvline(np.mean(anotherRandom) - np.std(anotherRandom), linestyle = '--', color = 'red')
plt.axvline(np.mean(anotherRandom) + np.std(anotherRandom), linestyle = '--', color = 'red')

# Set the title, x- and y-axis labels
plt.title('A fancy plot')
ax.set_xlabel("Value of $x$") 
ax.set_ylabel("Density")

# Set the Figure's size as a 5in x 5in figure
fig.set_size_inches((5,5))

Figure made by matplotlib using the code above.

randomSet = rnorm(mean = 4, sd = 0.5, n = 1000)
anotherRandom = rnorm(mean = 1, sd = 2, n = 1000)

# Let's define a range to plot the histogram for binning;
limits = range(randomSet, anotherRandom)
lbound = limits[1] - (diff(limits) * 0.1)
ubound = limits[2] + (diff(limits) * 0.1)
# use freq = F to plot density
# in breaks, we define the bins of the histogram by providing a vector of values using seq
# xlab, ylab define axis labels; main sets the title
# rgb defines the colour in RGB values from 0-1, with the fourth digit setting transparency
# e.g. rgb(0,1,0,1) is R = 0, G = 1, B = 0, with a alpha of 1 (i.e. not transparent)
hist(randomSet, freq = F, breaks = seq(lbound, ubound, 0.5), col = rgb(0,0,1,0.4), xlab = 'Value of x', ylab = 'Density', main = 'A fancy plot')
# Use add = T to keep both histograms in one graph
# other parameters, such as breaks, etc., can be introduced here
hist(anotherRandom, freq = F, breaks = seq(lbound, ubound, 0.5), col = rgb(1,0,0,0.4), add = T)

# Plot vertical lines with v =
# lty = 2 generates a dashed line
abline(v = c(mean(randomSet), mean(anotherRandom)), col = c('blue', 'red'))

abline(v = c(mean(randomSet)-sd(randomSet), mean(randomSet)+sd(randomSet)), col = 'blue', lty = 2)
abline(v = c(mean(anotherRandom)-sd(anotherRandom), mean(anotherRandom)+sd(anotherRandom)), col = 'red', lty = 2)

Similar figure made using R code from above.

*Special thanks go out to Ali and Lyuba for helpful fixes to make the R code more efficient!

How to Calculate PLIFs Using RDKit and PLIP

Protein-Ligand interaction fingerprints (PLIFs) are becoming more widely used to compare small molecules in the context of a protein target. A fingerprint is a bit vector that is used to represent a small molecule. Fingerprints of molecules can then be compared to determine the similarity between two molecules. Rather than using the features of the ligand to build the fingerprint, a PLIF is based on the interactions between the protein and the small molecule. The conventional method of building a PLIF is that each bit of the bit vector represents a residue in the binding pocket of the protein. The bit is set to 1 if the molecule forms an interaction with the residue, whereas it is set to 0 if it does not.

Constructing a PLIF therefore consists of two parts:

  1. Calculating the interactions formed by a small molecule from the target
  2. Collating this information into a bit vector.

Step 1 can be achieved by using the Protein-Ligand Interaction Profiler (PLIP). PLIP is an easy-to-use tool, that given a pdb file will calculate the interactions between the ligand and protein. This can be done using the online web-tool or alternatively using the command-line tool. Six different interaction types are calculated: hydrophobic, hydrogen-bonds, water-mediated hydrogen bonds, salt bridges, pi-pi and pi-cation. The command-line version outputs an xml report file containing all the information required to construct a PLIF.

Step 2 involves manipulating the output of the report file into a bit vector. RDKit is an amazingly useful Cheminformatics toolkit with great documentation. By reading the PLIF into an RDKit bit vector this allows the vector to be manipulated as an RDKit fingerprint. The fingerprints can then be compared using RDKit functionality very easily, for example, using Tanimoto Similarity.

EXAMPLE:

Let’s take 3 pdb files as an example. Fragment screening data from the SGC is a great sort of data for this analysis, as it contains lots of pdb structures of small hits bound to the same target. The data can be found here. For this example I will use 3 protein-ligand complexes from the BRD1 dataset: BRD1A-m004.pdb, BRD1A-m006.pdb and BRD1A-m009.pdb.

brd1_sgc

1.PLIP First we need to run plip to generate a report file for each protein-ligand complex. This is done using:


 

plipcmd -f BRD1A-m004.pdb -o m004 -x

plipcmd -f BRD1A-m006.pdb -o m006 -x

plipcmd -f BRD1A-m009.pdb -o m009 -x

 


A report file (‘report.xml’) is created for each pdb file within the directory m004, m006 and m009.

2. Get Interactions: Using a python script the results of the report can be collated using the function “generate_plif_lists” (shown below) on each report file. The function takes in the report file name, and the residues already found to be in the binding site (residue_list). “residue_list” must be updated for each molecule to be compared as the residues used to define the binding site can vary betwen each report file. The function then returns the updated “residue_list”, as well as a list of residues found to interact with the ligand: “plif_list_all”.

 


import xml.etree.ElementTree as ET

################################################################################

def generate_plif_lists(report_file, residue_list, lig_ident):

    #uses report.xml from PLIP to return list of interacting residues and update list of residues in binding site

        plif_list_all = []

        tree = ET.parse(report_file)

        root = tree.getroot()

        #list of residue keys that form an interaction

        for binding_site in root.findall('bindingsite'):

                nest = binding_site.find('identifiers')

                lig_code = nest.find('hetid')

                if str(lig_code.text) == str(lig_ident):

                        #get the plifs stuff here

                        nest_residue = binding_site.find('bs_residues')

                        residue_list_tree = nest_residue.findall('bs_residue')

                        for residue in residue_list_tree:

                                res_id = residue.text

                                dict_res_temp = residue.attrib

                                if res_id not in residue_list:

                                        residue_list.append(res_id)

                                if dict_res_temp['contact'] == 'True':

                                        if res_id not in plif_list_all:

                                                plif_list_all.append(res_id)

        return plif_list_all, residue_list

###############################################################################

plif_list_m006, residue_list = generate_plif_lists('m006/report.xml',residue_list, 'LIG')

plif_list_m009, residue_list = generate_plif_lists('m009/report.xml', residue_list, 'LIG')

plif_list_m004, residue_list = generate_plif_lists('m004/report.xml', residue_list, 'LIG')


3. Read Into RDKit: Now we have the list of binding site residues and which residues are interacting with the ligand a PLIF can be generated. This is done using the function shown below (“generate_rdkit_plif”):


from rdkit import Chem,  DataStructs

from rdkit.DataStructs import cDataStructs

################################################################################

def generate_rdkit_plif(residue_list, plif_list_all):

    #generates RDKit plif given list of residues in binding site and list of interacting residues

    plif_rdkit = DataStructs.ExplicitBitVect(len(residue_list), False)

    for index, res in enumerate(residue_list):

        if res in plif_list_all:

            print 'here'

            plif_rdkit.SetBit(index)

        else:

            continue

    return plif_rdkit

#########################################################################

plif_m006 = generate_rdkit_plif(residue_list, plif_list_m006)

plif_m009 = generate_rdkit_plif(residue_list, plif_list_m009)

plif_m004 = generate_rdkit_plif(residue_list, plif_list_m004)


4. Play! These PLIFs can now be compared using RDKit functionality. For example the Tanimoto similarity between the ligands can be computed:


def similarity_plifs(plif_1, plif_2):

    sim = DataStructs.TanimotoSimilarity(plif_1, plif_2)

    print sim

    return sim

###################################################################

print similarity_plifs(plif_m006, plif_m009)

print similarity_plifs(plif_m006, plif_m004)

print similarity_plifs(plif_m009, plif_m004)


The output is: 0.2, 0.5, 0.0.

All files used to generate the PLIFs cound be found here. Happy PLIF-making!

Viewing 3D molecules interactively in Jupyter iPython notebooks

Greg Landrum, curator of the invaluable open source cheminformatics API, RDKit, recently blogged about viewing molecules in a 3D window within a Jupyter-hosted iPython notebook (as long as your browser supports WebGL, that is).

The trick is to use py3Dmol. It’s easy to install:

pip install py3Dmol

This is built on the object-oriented, webGL based JavaScript library for online molecular visualization 3Dmol.js (Rego & Koes, 2015); here's a nice summary of the capabilities of 3Dmol.js. It's features include:

  • support for pdb, sdf, mol2, xyz, and cube formats
  • parallelized molecular surface computation
  • sphere, stick, line, cross, cartoon, and surface styles
  • atom property based selection and styling
  • labels
  • clickable interactivity with molecular data
  • geometric shapes including spheres and arrows

I tried a simple example and it worked beautifully:

import py3Dmol
view = py3Dmol.view(query='pdb:1hvr')
view.setStyle({'cartoon':{'color':'spectrum'}})
view

py3dmol_in_jupyter_ipython

The 3Dmol.js website summarizes how to view molecules, along with how to choose representations, how to embed it, and even how to develop with it.

References

Nicholas Rego & David Koes (2015). “3Dmol.js: molecular visualization with WebGL”.
Bioinformatics, 31 (8): 1322-1324. doi:10.1093/bioinformatics/btu829

Plotting and storing a 3D network in R

A simple toy example of a three layered network:

Note 1: In order to view the 3D plots, mac users will need Xquartz  installed (https://www.xquartz.org/).

 
require(igraph)
require(rgl)
#Another package that might be needed is "rglwidget". The function writeWebGL will show an error stating if rglwidget is required.
######################################//// 
######The basics######################////
######################################////
#1) Create a "food" network (three layers) 
set.seed(432)
g1<-watts.strogatz.game(dim = 1,size = 5,nei = 2,p = .5,loops = FALSE,multiple = FALSE)
g2<-watts.strogatz.game(dim = 1,size = 10,nei = 2,p = .2,loops = FALSE,multiple = FALSE)
g3<-watts.strogatz.game(dim = 1,size = 30,nei = 1,p = .5,loops = FALSE,multiple = FALSE)
g123=g1+g2+g3 


#Create more edges btw layers 
g123=rewire(g123,each_edge(prob=.4,loops = FALSE,multiple = FALSE)) 
ne=15;add_edges(g123,edges = cbind(sample(1:vcount(g1),size = ne,replace = TRUE), sample((vcount(g1)+1):vcount(g123),size = ne,replace = TRUE)))#top layer 
ne=30;add_edges(g123,edges = cbind(sample((vcount(g1)+1):(vcount(g1)+vcount(g2)),size = ne,replace = TRUE), sample((vcount(g1)+vcount(g2)+1):vcount(g123),size = ne,replace = TRUE)))#second layer 

#A quick plot of the graph
plot(g123,vertex.size=1,vertex.label.cex=0.02)


#Create 3d coordinates of the network layout
circpos=function(n,r=1){#Coordinates on a circle
 rad=seq(0,2*pi,length.out=n+1)[-1];x=cos(rad)*r;y=sin(rad)*r
 return(cbind(x,y))
}
#
lay=rbind(cbind(circpos(vcount(g1),r=1), runif(n = vcount(g1),-1,1)),
 cbind(circpos(vcount(g2),r=2), runif(n = vcount(g2),6,7)),
 cbind(circpos(vcount(g3),r=4), runif(n = vcount(g3),13,17))
)



#2d plot using the previous layout
plot(g123,vertex.size=5,vertex.label=NA,layout=lay[,c(1,3)])
plot(g123,vertex.size=1,vertex.label=NA,layout=lay[,c(1,2)])

layers

 
#3D graph plot
#Add some colour to nodes and edges
nodecols=c(rep("red",vcount(g1)),
 rep("blue",vcount(g2)),
 rep("yellow",vcount(g3)))

edgecols=function(elist,cols,grouplist){
 whatcol=rep(length(cols)+1,nrow(elist))
 finalcol=whatcol
 for(i in 1:nrow(elist)){
 for(k in length(cols):1){ 
 if( k * (length( intersect(elist[i,], grouplist[[k]]) ) > 0)){
 whatcol[i]=min(whatcol[i], k )
 }
 }
 finalcol[i]=cols[whatcol[i]]
 }
 return(finalcol)
}

#Open 3d viewer
rgl.open()
rglplot(g123, layout=lay,vertex.size=5,vertex.label=NA,vertex.color=nodecols,
 edge.color=edgecols(elist=get.edgelist(g123,names = FALSE),cols=c("orange","green","pink"),grouplist=list(1:vcount(g1), (vcount(g1)+1):(vcount(g1)+vcount(g2)), (vcount(g1)+vcount(g2)+1):vcount(g123)) )
)

3d_layers

###Storing the plot in an html file###

dirfolder="..." #your dir
#rgl.open()#instead of rgl.open use open3d, in order to save the plot. 
open3d()
rglplot(g123, layout=lay,vertex.size=5,vertex.label=NA,vertex.color=nodecols,
 edge.color=edgecols(elist=get.edgelist(g123,names = FALSE),cols=c("orange","green","pink"),grouplist=list(1:vcount(g1), (vcount(g1)+1):(vcount(g1)+vcount(g2)), (vcount(g1)+vcount(g2)+1):vcount(g123)) )
)
#Fix the view
rgl.viewpoint(theta=90, phi=0)

#Save a static 2d image:
rgl.snapshot(paste(dirfolder,"a_png_pic_0.png",sep=""), fmt="png", top=TRUE)

#Save the plot in a .htlm file:
rglfolder=writeWebGL(dir = paste(dirfolder,"first_net3d",sep=""), width=700)

#The previous function should create a file called index.htlm inside the folder "first_net3d". By opening this file in a browser (with javascript enabled) the 3d plot will be displayed again.
#Also the following command will open the plot in the browser:
browseURL(rglfolder)

Note 2: In order to view the .htlm file javascript should be enabled in the browser. (Here is an example on how to do this for safari ).

Although not covered in the previous script, further options are available such as edge/vertex size and the ability to control independently each of the nodes and edges in the graph. Here is an example that makes more use of these options:

clour_plot

3d network representing a T cell receptor. Edges are coloured according to a relevant path found between the bottom green node and the upper red node cluster.

T cell receptor (in blue), binding to a peptide (in red).

T cell receptor (in blue), binding to a peptide (in red).

Physical-chemical property predictors as command line tools

The Instant JChem Suite, from ChemAxon, is a fantastic set of software designed for Chemists. It allows easy and simple database management to store both chemical and non-chemical data. It also contains a plethora of physical-chemical prediction and visualisation tools that can be utilised by chemists and computational based scientists alike.

I personally believe that the hidden gem within the suite is the availability of these predictive tools in the command line, found within the ChemAxon Calculator (cxcalc). In addition, calculator plug ins have also been developed by external developers. This allows you to incorporate the powerful predictive tools of ChemAxon into your larger workflows, with a little scripting.

For example, it can be used to predict the dominant protonation state of a ligand before use in MD or docking studies, with the majormicrospecies tool. You can input and output all major file types including SDF, PDB and MOL2 using commands such as:

cxcalc [Input_File].sdf -o [Output_File].sdf majormicrospecies -H [pH] -f [Output_File_Type]:H

You can easily find the calculator plugins available and how to construct input commands using the cxcalc –h command, or the available online information. I would thoroughly recommend looking at the tools available and how you could incorporate them into your workflows.

A beginner’s guide to Rosetta

Rosetta is a big software suite, and I mean really big. It includes applications for protein structure prediction, refinement, docking, and design, and specific adaptations of these applications (and others) to a particular case, for example protein-protein docking of membrane proteins to form membrane protein complexes. Some applications are available in one of the hassle-free servers online (e.g. ROSIE, Robetta, rosetta.design), which might work well if you’ve got just a few tests you would like to try using standard parameters and protocols. However, it’s likely that you will want to download and install a version if you’re interested in carrying out a large amount of modelling, or using an unusual combination of steps or scoring function. This is not a trivial task, as the source code is a 2.5G download, then your machine will be busy compiling for some time (around 5 hours on two cores on my old laptop). Alternatively, if the protocols and objects you’re interested in are part of PyRosetta, this is available in a pre-compiled package for most common operating systems and is less than 1G.

This brings me to the different ways to use Rosetta. Most applications come as an executable which you can find in Rosetta/main/source/bin/ after completing the build. There is documentation available on how to use most of these, and on the different flags which can be used to input PDB structures and parameters. Some applications can be run using RosettaScripts, which uses an xml file to define the protocol, including scoring functions, movers and other options. In this case, Rosetta/main/source/bin/rosetta_scripts.* is run, which will read the xml and execute the required protocol.

screenshot-from-2016-09-14-19-19-28

An example RosettaScript, used for the MPrelax protocol

PyRosetta is even more flexible, and relatively easy to use for anyone accustomed to programming in python. There are python bindings for the fast C++ objects and movers so that the increased usability is generally not greatly compromised by slower speeds. One of the really handy things about PyRosetta is the link to PyMOL which can be used to view the trajectory of your protein moving while a simulation is running. Just add the following to your .pymolrc file in your home directory to set up the link every time you open pymol:

run /PATH/TO/PYROSETTA/PyMOLPyRosettaServer.py

When it comes to finding your way around the Rosetta package, there are a few things it is very useful to know to start with. The demos directory contains plenty of useful example scripts and instructions for running your first jobs. In demos/tutorials you will find introductions to the main concepts. The demos/protocol_capture subdirectory is particularly helpful, as most papers which report a new Rosetta protocol will deposit here the scripts required to reproduce their results. These may not currently be the best methods to approach a problem, but if you have found a research article describing some results which would be useful to get for your system, they are a good starting point to learn how to make an application work. Then the world is your oyster as you explore the many possible options and inputs to change and add!

Processing large files using python: part duex

Last week I wrote a post on some of the methods I use in python to efficiently process very large datasets. You can find that here. Roughly it details how one can break a large file into chunks which then can be passed onto multiple cores to do the number crunching. Below I expand upon this, first creating a parent class which turns a given (large) file into chunks. I construct it in a manner which children classes can be easily created and tailored for specific file types, given some examples. Finally, I give some wrapping functions for use in conjunction with any of the chunkers so that the chunks can be processed using multiple cores.

First, and as an aside, I was asked after the previous post, at what scale these methods should be considered. A rough answer would be when the size of the data becomes comparable to the available RAM. A better answer would be, when the overhead of reading each individual line(/entry) is more than the operation on that entry. Here is an example of this case, though it isn’t really that fair a comparison:


>>> import timeit,os.path
>>> os.path.getsize("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa")
10095955
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([l.count(">") for l in f])',number=10)
0.8403599262237549
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")])',number=10)
0.15671014785766602

For a small 10MB fasta file, we count the number of sequences present in a fifth of the time using chunks. I should be honest though, and state that the speedup is mostly due not having the identify newline characters in the chunking method; but nevertheless, it shows the power one can have using chunks. For a 14GB fasta file, the times for the chunking (1Mb chunks) and non-chunking methods are 55s and 130s respectively.

Getting back on track, let’s turn the chunking method into a parent class from which we can build on:


import os.path

class Chunker(object):

    #Iterator that yields start and end locations of a file chunk of default size 1MB.
    @classmethod
    def chunkify(cls,fname,size=1024*1024):
        fileEnd = os.path.getsize(fname)
        with open(fname,'r') as f:
            chunkEnd = f.tell()
            while True:
                chunkStart = chunkEnd
                f.seek(size,1)
                cls._EOC(f)
                chunkEnd = f.tell()
                yield chunkStart, chunkEnd - chunkStart
                if chunkEnd >= fileEnd:
                    break

    #Move file pointer to end of chunk
    @staticmethod
    def _EOC(f):
        f.readline()

    #read chunk
    @staticmethod
    def read(fname,chunk):
        with open(fname,'r') as f:
            f.seek(chunk[0])
            return f.read(chunk[1])

    #iterator that splits a chunk into units
    @staticmethod
    def parse(chunk):
        for line in chunk.splitlines():
            yield chunk

In the above, we create the class Chunker which has the class method chunkify as well as the static methods, _EOC, read, and parse. The method chunkify does the actual chunking of a given file, returning an iterator that yields tuples containing a chunk’s start and size. It’s a class method so that it can make use of _EOC (end of chunk) static method, to move the pointer to a suitable location to split the chunks. For the simplest case, this is just the end/start of a newline. The read and parse methods read a given chunk from a file and split it into units (single lines in the simplest case) respectively. We make the non-chunkify methods static so that they can be called without the overhead of creating an instance of the class.

Let’s now create some children of this class for specific types of files. First, one of the most well known file types in bioinformatics, FASTA. Below is an segment of a FASTA file. Each entry has a header line, which begins with a ‘>’, followed by a unique identifier for the sequence. After the header line, one or more lines follow giving the sequence. Sequences may be either protein or nucleic acid sequences, and they may contain gaps and/or alignment characters.


>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

And here is the file type specific chunker:


from Bio import SeqIO
from cStringIO import StringIO

class Chunker_FASTA(Chunker):

    @staticmethod
    def _EOC(f):
        l = f.readline() #incomplete line
        p = f.tell()
        l = f.readline()
        while l and l[0] != '>': #find the start of sequence
            p = f.tell()
            l = f.readline()
        f.seek(p) #revert one line

    @staticmethod
    def parse(chunk):
        fh = cStringIO.StringIO(chunk)
        for record in SeqIO.parse(fh,"fasta"):
            yield record
        fh.close()

We update the _EOC method to find when one entry finishes and the next begins by locating “>”, following which we rewind the file handle pointer to the start of that line. We also update the parse method to use fasta parser from the BioPython module, this yielding SeqRecord objects for each entry in the chunk.

For a second slightly harder example, here is one designed to work with output produced by bowtie, an aligner of short reads from NGS data. The format consists of of tab separated columns, with the id of each read located in the first column. Note that a single read can align to multiple locations (up to 8 as default!), hence why the same id appears in multiple lines. A small example section of the output is given below.


SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-2	2502 GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3

with the corresponding chunker given by:


class Chunker_BWT(chunky.Chunker):

    @staticmethod
    def _EOC(f):
        l = f.readline()#incomplete line
        l = f.readline()
        if not l: return #EOF
        readID = l.split()[0]
        while l and (l.split()[0] != readID): #Keep reading lines until read IDs don't match
            p = f.tell()	
            l = f.readline()
        f.seek(p) #revert one line

    @staticmethod
    def parse(chunk):
        lines = chunk.splitlines()
        N = len(lines)
        i = 0 
        while i < N:
            readID = lines[i].split('\t')[0]
            j = i
            while lines[j].split('\t')[0] == readID:
                j += 1
                if j == N:
                    break
            yield lines[i:j]
            i = j

This time, the end of chunk is located by reading lines until there is a switch in the read id, whereupon we revert one line. For parsing, we yield all the different locations a given read aligns to as a single entry.

Hopefully these examples show you how the parent class can be expanded upon easily for most file types. Let’s now combine these various chunkers with the code from previous post to show how we can enable multicore parallel processing of the chunks they yield. The code below contains a few generalised wrapper functions which work in tandem with any of the above chunkers to allow most tasks to be parallelised .


import multiprocessing as mp, sys

def _printMP(text):
    print text
    sys.stdout.flush()

def _workerMP(chunk,inFile,jobID,worker,kwargs):
    _printMP("Processing chunk: "+str(jobID))
    output = worker(chunk,inFile,**kwargs)
    _printMP("Finished chunk: "+str(jobID))
    return output	

def main(inFile,worker,chunker=Chunker,cores=1,kwargs={}):
    pool = mp.Pool(cores)

    jobs = []
    for jobID,chunk in enumerate(chunker.chunkify(inFile)):
        job = pool.apply_async(_workerMP,(chunk,inFile,jobID,worker,kwargs))
        jobs.append(job)

    output = []
    for job in jobs:
        output.append( job.get() )

    pool.close()
    
    return output

The main function should be recognisable as the code from the previous post. It generates the pool of workers, ideally one for each core, before using the given chunker to split the corresponding file into a series of chunks for processing. Unlike before, we collect the output given by each job and return it after processing is finished. The main function acts as wrapper allowing us to specify different processing functions and different chunkers, as given by the variables worker and chunker respectively. We have wrapped the processing function call within the function _workerMP which prints to the terminal as tasks are completed. It uses the function _printMP to do this, as you need to flush the terminal after a print statement when using multi core processing, otherwise nothing appears until all tasks are completed.

Let’s finish by showing an example of how we would use the above to do the same task as we did at the start of this post, counting the sequences within a fasta file, using the base chunker:


def seq_counter(chunk,inFile):
    data = Chunker.read(inFile,chunk)
    return data.count('>')

and using the FASTA chunker:


def seq_counter_2(chunk,inFile):
    data = list(Chunker_FASTA.parse(Chunker_FASTA.read(inFile,chunk)))
    return len(data)

And time they take to count the sequences within the 14GB file from before:


>>> os.path.getsize("SRR951828.fa")
14944287128
>>> x=time.time();f = open("SRR951828.fa");sum([l.count(">") for l in f]);time.time()-x
136829250
190.05533599853516
>>> x=time.time();f = open("SRR951828.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")]);time.time()-x
136829250
26.343637943267822
>>> x=time.time();sum(main("SRR951828.fa",seq_counter,cores=8));time.time()-x
136829250
4.36846399307251
>>> x=time.time();main("SRR951828.fa",seq_counter_2,Chunker_FASTA,8);time.time()-x
136829250
398.94060492515564

Let’s ignore that last one, as the slowdown is due to turning the entries into BioPython SeqRecords. The prior one, which combines chunking and multicore processing, has roughly a factor of 50 speed up. I’m sure this could be further reduced using more cores and/or optimising the chunk size, however, this difference alone can change something from being computationally implausible, to plausible. Not bad for only a few lines of code.

Anyway, as before, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. If I feel like it, I may follow this up with a post about how to integrate a queue into the above which outputs the result of each job as they are produced. In the above, we currently hold collate all the results in the memory, which has the potential to cause a memory overflow depending on what is being returned.

Processing large files using python

In the last year or so, and with my increased focus on ribo-seq data, I have come to fully appreciate what the term big data means. The ribo-seq studies in their raw forms can easily reach into hundreds of GBs, which means that processing them in both a timely and efficient manner requires some thought. In this blog post, and hopefully those following, I want to detail some of the methods I have come up (read: pieced together from multiple stack exchange posts), that help me take on data of this magnitude. Specifically I will be detailing methods for python and R, though some of the methods are transferrable to other languages.

My first big data tip for python is learning how to break your files into smaller units (or chunks) in a manner that you can make use of multiple processors. Let’s start with the simplest way to read a file in python.


with open("input.txt") as f:
    data = f.readlines()
    for line in data:
        process(line)

This mistake made above, with regards to big data, is that it reads all the data into RAM before attempting to process it line by line. This is likely the simplest way to cause the memory to overflow and an error raised. Let’s fix this by reading the data in line by line, so that only a single line is stored in the RAM at any given time.


with open("input.txt") as f:
    for line in f:
        process(line)

This is a big improvement, namely it doesn’t crash when fed a big file (though also it’s shorter!). Next we should attempt to speed this up a bit by making use of all these otherwise idle cores.


import multiprocessing as mp

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    for line in f:
        jobs.append( pool.apply_async(process,(line)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Provided the order of which you process the lines don’t matter, the above generates a set (pool) of workers, ideally one for each core, before creating a bunch of tasks (jobs), one for each line, for the workers to do. I tend to use the Pool object provided by the multiprocessing module due to ease of use, however, you can spawn and control individual workers using mp.Process if you want finer control. For mere number crunching, the Pool object is very good.

While the above is now making use of all those cores, it sadly runs into memory problems once again. We specifically use apply_async function so that the pool isn’t blocked while each line processes. However, in doing so, all the data is read into memory once again; this time stored as individual lines associated with each job, waiting inline to be processed. As such, the memory will again overflow. Ideally the method will only read the line into memory when it is its turn to be processed.


import multiprocessing as mp

def process_wrapper(lineID):
    with open("input.txt") as f:
        for i,line in enumerate(f):
            if i != lineID:
                continue
            else:
                process(line)
                break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    for ID,line in enumerate(f):
        jobs.append( pool.apply_async(process_wrapper,(ID)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Above we’ve now changed the function fed to pool of workers to include opening the file, locating the specified line, reading it into memory, and then processing it. The only input now stored for each job spawned is the line number, thereby preventing the memory overflow. Sadly, the overhead involved in having to locate the line by reading iteratively through the file for each job is untenable, getting progressively more time consuming as you get further into the file. To avoid this we can use the seek function of file objects which skips you to a particular location within a file. Combining with the tell function, which returns the current location within a file, gives:


import multiprocessing as mp

def process_wrapper(lineByte):
    with open("input.txt") as f:
        f.seek(lineByte)
        line = f.readline()
        process(line)

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    nextLineByte = f.tell()
    for line in f:
        jobs.append( pool.apply_async(process_wrapper,(nextLineByte)) )
        nextLineByte = f.tell()

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Using seek we can move directly to the correct part of the file, whereupon we read a line into the memory and process it. We have to be careful to correctly handle the first and last lines, but otherwise this does exactly what we set out, namely using all the cores to process a given file while not overflowing the memory.

I’ll finish this post with a slight upgrade to the above as there is a reasonable amount of overhead associated with opening and closing the file for each individual line. If we process multiple lines of the file at a time as a chunk, we can reduce these operations. The biggest technicality when doing this is noting that when you jump to a location in a file, you are likely not located at the start of a line. For a simple file, as in this example, this just means you need to call readline, which reads to next newline character. More complex file types likely require additional code to locate a suitable location to start/end a chunk.


import multiprocessing as mp,os

def process_wrapper(chunkStart, chunkSize):
    with open("input.txt") as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        for line in lines:
            process(line)

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    with open(fname,'r') as f:
        chunkEnd = f.tell()
    while True:
        chunkStart = chunkEnd
        f.seek(size,1)
        f.readline()
        chunkEnd = f.tell()
        yield chunkStart, chunkEnd - chunkStart
        if chunkEnd > fileEnd:
            break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
for chunkStart,chunkSize in chunkify("input.txt"):
    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Anyway, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. In another post coming in the near future, I will expanded on this code, turning it into a parent class from which create multiple children to use with various file types.