Category Archives: How To

How to do things. doh.

A very basic introduction to Random Forests using R

Random Forests is a powerful tool used extensively across a multitude of fields. As a matter of fact, it is hard to come upon a data scientist that never had to resort to this technique at some point. Motivated by the fact that I have been using Random Forests quite a lot recently, I decided to give a quick intro to Random Forests using R.

So what are Random Forests? Well, I am probably not the most suited person to answer this question (a google search will reveal much more interesting answers) , still I shall give it a go. Random Forests is a learning method for classification (and others applications — see below). It is based on generating a large number of decision trees, each constructed using a different subset of your training set. These subsets are usually selected by sampling at random and with replacement from the original data set. The decision trees are then used to identify a classification consensus by selecting the most common output (mode). While random forests can be used for other applications (i.e. regression), for the sake of keeping this post short, I shall focus solely on classification.

Why R? Well, the quick and easy question for this is that I do all my plotting in R (mostly because I think ggplot2 looks very pretty). I decided to explore Random Forests in R and to assess what are its advantages and shortcomings. I am planning to compare Random Forests in R against the python implementation in scikit-learn. Do expect a post about this in the near future!

The data: to keep things simple, I decided to use the Edgar Anderson’s Iris Data set. You can have a look at it by inspecting the contents of iris in R. This data set contains observations for four features (sepal length and width, and petal length and width – all in cm) of 150 flowers, equally split between three different iris species. This data set is fairly canon in classification and data analysis. Let us take a look at it, shall we:

As you can observe, there seems to be some separation in regards to the different features and our three species of irises [note: this set is not very representative of a real world data set and results should be taken with a grain of salt].

Training and Validation sets: great care needs to be taken to ensure clear separation between training and validation sets. I tend to save the cases for which I am actually interested in performing predictions as a second validation set (Validation 2). Then I split the remaining data evenly into Training and Validation 1.

Let us split our data set then, shall we?

# Set random seed to make results reproducible:
set.seed(17)
# Calculate the size of each of the data sets:
data_set_size <- floor(nrow(iris)/2)
# Generate a random sample of "data_set_size" indexes
indexes <- sample(1:nrow(iris), size = data_set_size)

# Assign the data to the correct sets
training <- iris[indexes,]
validation1 <- iris[-indexes,]

Before we can move on, here are some things to consider:

1- The size of your data set usually imposes a hard limit on how many features you can consider. This occurs due to the curse of dimensionality, i.e. your data becomes sparser and sparser as you increase the number of features considered, which usually leads to overfitting. While there is no rule of thumb relating to how many features vs. the number of observations you should use, I try to keep e^Nf < No (Nf = number of features, No = number of observations) to minimise overfitting [this is not always possible and it does not ensure that we won’t overfit]. In this case, our training set has 75 observations, which suggests that using four features (e^4 ~ 54.6) is not entirely absurd. Obviously, this depends on your data, so we will cover some further overfitting checks later on.

2- An important thing to consider when assembling training sets is the proportion of negatives vs. positives in your data. Think of an extreme scenario where you have many, many more observations for one class vs. the others. How will this affect classification? This would make it more likely for the classifier to predict the dominant class when given new values. I mentioned before that the iris set is quite nice to play with. It comes with exactly 50 observations for each species of irises. What happens if you have a data set with a much higher number of observations for a particular class? You can bypass any imbalance regarding the representation of each class by carefully constructing your training set in order not to favour any particular class. In this case, our randomly selected set has 21 observations for species setosa and 27 observations for each of species versicolor and virginica, so we are good to go.

3- Another common occurrence that is not represented by the iris data set is missing values (NAs) for observations. There are many ways of dealing with missing values, including assigning the median or the mode for that particular feature to the missing observation or even disregarding some observations entirely, depending on how many observations you have. There are even ways to use random forests to estimate a good value to assign to the missing observations, but for the sake of brevity, this will not be covered here.

Right, data sets prepared and no missing values, it is time to fire our random forests algorithm. I am using the randomForest package. You can click the link for additional documentation. Here is the example usage code:

#import the package
library(randomForest)
# Perform training:
rf_classifier = randomForest(Species ~ ., data=training, ntree=100, mtry=2, importance=TRUE)

Note some important parameters:

-The first parameter specifies our formula: Species ~ . (we want to predict Species using each of the remaining columns of data).
–ntree defines the number of trees to be generated. It is typical to test a range of values for this parameter (i.e. 100,200,300,400,500) and choose the one that minimises the OOB estimate of error rate.
–mtry is the number of features used in the construction of each tree. These features are selected at random, which is where the “random” in “random forests” comes from. The default value for this parameter, when performing classification, is sqrt(number of features).
–importance enables the algorithm to calculate variable importance.

We can quickly look at the results of our classifier for our training set by printing the contents of rf_classifier:

> rf_classifier

Call:
 randomForest(formula = Species ~ ., data = training,ntree=100,mtry=2, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 2

        OOB estimate of  error rate: 5.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         21          0         0  0.00000000
versicolor      0         25         2  0.07407407
virginica       0          2        25  0.07407407

As you can see, it lists the call used to build the classifier, the number of trees (100), the variables at each split (2), and it outputs a very useful confusion matrix and OOB estimate of error rate. This estimate is calculated by counting however many points in the training set were misclassified (2 versicolor and 2 virginica observations = 4) and dividing this number by the total number of observations (4/75 ~= 5.33%).

The OOB estimate of error rate is a useful measure to discriminate between different random forest classifiers. We could, for instance, vary the number of trees or the number of variables to be considered, and select the combination that produces the smallest value for this error rate. For more complicated data sets, i.e. when a higher number of features is present, a good idea is to use cross-validation to perform feature selection using the OOB error rate (see rfcv from randomForest for more details).

Remember the importance parameter? Let us take a look at the importance that our classifier has assigned to each variable:

varImpPlot(rf_classifier)

Each features’s importance is assessed based on two criteria:

-MeanDecreaseAccuracy: gives a rough estimate of the loss in prediction performance when that particular variable is omitted from the training set. Caveat: if two variables are somewhat redundant, then omitting one of them may not lead to massive gains in prediction performance, but would make the second variable more important.

-MeanDecreaseGini: GINI is a measure of node impurity. Think of it like this, if you use this feature to split the data, how pure will the nodes be? Highest purity means that each node contains only elements of a single class. Assessing the decrease in GINI when that feature is omitted leads to an understanding of how important that feature is to split the data correctly.

Do note that these measures are used to rank variables in terms of importance and, thus, their absolute values could be disregarded.

Ok, great. Looks like we have a classifier that was properly trained and is producing somewhat good predictions for our training set. Shall we evaluate what happens when we try to use this classifier to predict classes for our validation1 set?

# Validation set assessment #1: looking at confusion matrix
prediction_for_table <- predict(rf_classifier,validation1[,-5])
table(observed=validation1[,5],predicted=prediction_for_table)

            predicted
observed     setosa versicolor virginica
  setosa         29          0         0
  versicolor      0         20         3
  virginica       0          1        22

The confusion matrix is a good way of looking at how good our classifier is performing when presented with new data.

Another way of assessing the performance of our classifier is to generate a ROC curve and compute the area under the curve:

# Validation set assessment #2: ROC curves and AUC

# Needs to import ROCR package for ROC curve plotting:
library(ROCR)

# Calculate the probability of new observations belonging to each class
# prediction_for_roc_curve will be a matrix with dimensions data_set_size x number_of_classes
prediction_for_roc_curve <- predict(rf_classifier,validation1[,-5],type="prob")

# Use pretty colours:
pretty_colours <- c("#F8766D","#00BA38","#619CFF")
# Specify the different classes 
classes <- levels(validation1$Species)
# For each class
for (i in 1:3)
{
 # Define which observations belong to class[i]
 true_values <- ifelse(validation1[,5]==classes[i],1,0)
 # Assess the performance of classifier for class[i]
 pred <- prediction(prediction_for_roc_curve[,i],true_values)
 perf <- performance(pred, "tpr", "fpr")
 if (i==1)
 {
     plot(perf,main="ROC Curve",col=pretty_colours[i]) 
 }
 else
 {
     plot(perf,main="ROC Curve",col=pretty_colours[i],add=TRUE) 
 }
 # Calculate the AUC and print it to screen
 auc.perf <- performance(pred, measure = "auc")
 print(auc.perf@y.values)
}

Here is the final product (ROC curve):

And here are the values for our AUCs:

Setosa
AUC = 1

Versicolor
AUC = 0.98

Virginica
AUC = 0.98

Voila! I hope this was somewhat useful!

Parallel Computing: GNU Parallel

Recently I started using the OPIG servers to run the algorithm I have developed (CRANkS) on datasets from DUDE (Database of Useful Decoys Enhanced).

This required learning how to run jobs in parallel. Previously I had been using computer clusters with their own queuing system (Torque/PBS) which allowed me to submit each molecule to be scored by the algorithm as a separate job. The queuing system would then automatically allocate nodes to jobs and execute jobs accordingly. On a side note I learnt how to submit these jobs an array, which was preferable to submitting ~ 150,000 separate jobs:

qsub -t 1:X array_submit.sh

where the contents of array_submit.sh would be:
#!/bin/bash ./$SGE_TASK_ID.sh

which would submit jobs 1.sh to X.sh, where X is the total number of jobs.

However the OPIG servers do not have a global queuing system to use. I needed a way of being able to run the code I already had in parallel with minimal changes to the workflow or code itself. There are many ways to run jobs in parallel, but to minimise work for myself, I decided to use GNU parallel [1].

This is an easy-to-use shell tool, which I found quick and easy to install onto my home server, allowing me to access it on each of the OPIG servers.

To use it I simply run the command:
cat submit.sh | parallel -j Y

where Y is the number of cores to run the jobs on, and submit.sh contains:
./1.sh ./2.sh ... ./X.sh

This executes each job making use of Y number of cores when available to run the jobs in parallel.

Quick, easy, simple and minimal modifications needed! Thanks to Jin for introducing me to GNU Parallel!

[1] O. Tange (2011): GNU Parallel – The Command-Line Power Tool, The USENIX Magazine, February 2011:42-47.

Using RDKit to load ligand SDFs into Pandas DataFrames

If you have downloaded lots of ligand SDF files from the PDB, then a good way of viewing/comparing all their properties would be to load it into a Pandas DataFrame.

RDKit has a very handy function just for this – it’s found under the PandasTool module.

I show an example below within Jupypter-notebook, in which I load in the SDF file, view the table of molecules and perform other RDKit functions to the molecules.

First import the PandasTools module:

from rdkit.Chem import PandasTools

Read in the SDF file:

SDFFile = "./Ligands_noHydrogens_noMissing_59_Instances.sdf"
BRDLigs = PandasTools.LoadSDF(SDFFile)

You can see the whole table by calling the dataframe:

BRDLigs

The ligand properties in the SDF file are stored as columns. You can view what these properties are, and in my case I have loaded 59 ligands each having up to 26 properties:

BRDLigs.info()

It is also very easy to perform other RDKit functions on the dataframe. For instance, I noticed there is no heavy atom column, so I added my own called ‘NumHeavyAtoms’:

BRDLigs['NumHeavyAtoms']=BRDLigs.apply(lambda x: x['ROMol'].GetNumHeavyAtoms(), axis=1)

Here is the column added to the table, alongside columns containing the molecules’ SMILES and RDKit molecule:

BRDLigs[['NumHeavyAtoms','SMILES','ROMol']]

How to Calculate PLIFs Using RDKit and PLIP

Protein-Ligand interaction fingerprints (PLIFs) are becoming more widely used to compare small molecules in the context of a protein target. A fingerprint is a bit vector that is used to represent a small molecule. Fingerprints of molecules can then be compared to determine the similarity between two molecules. Rather than using the features of the ligand to build the fingerprint, a PLIF is based on the interactions between the protein and the small molecule. The conventional method of building a PLIF is that each bit of the bit vector represents a residue in the binding pocket of the protein. The bit is set to 1 if the molecule forms an interaction with the residue, whereas it is set to 0 if it does not.

Constructing a PLIF therefore consists of two parts:

Calculating the interactions formed by a small molecule from the target
Collating this information into a bit vector.

Step 1 can be achieved by using the Protein-Ligand Interaction Profiler (PLIP). PLIP is an easy-to-use tool, that given a pdb file will calculate the interactions between the ligand and protein. This can be done using the online web-tool or alternatively using the command-line tool. Six different interaction types are calculated: hydrophobic, hydrogen-bonds, water-mediated hydrogen bonds, salt bridges, pi-pi and pi-cation. The command-line version outputs an xml report file containing all the information required to construct a PLIF.

Step 2 involves manipulating the output of the report file into a bit vector. RDKit is an amazingly useful Cheminformatics toolkit with great documentation. By reading the PLIF into an RDKit bit vector this allows the vector to be manipulated as an RDKit fingerprint. The fingerprints can then be compared using RDKit functionality very easily, for example, using Tanimoto Similarity.

EXAMPLE:

Let’s take 3 pdb files as an example. Fragment screening data from the SGC is a great sort of data for this analysis, as it contains lots of pdb structures of small hits bound to the same target. The data can be found here. For this example I will use 3 protein-ligand complexes from the BRD1 dataset: BRD1A-m004.pdb, BRD1A-m006.pdb and BRD1A-m009.pdb.

1.PLIP First we need to run plip to generate a report file for each protein-ligand complex. This is done using:


 
plipcmd -f BRD1A-m004.pdb -o m004 -x
plipcmd -f BRD1A-m006.pdb -o m006 -x
plipcmd -f BRD1A-m009.pdb -o m009 -x

A report file (‘report.xml’) is created for each pdb file within the directory m004, m006 and m009.

2. Get Interactions: Using a python script the results of the report can be collated using the function “generate_plif_lists” (shown below) on each report file. The function takes in the report file name, and the residues already found to be in the binding site (residue_list). “residue_list” must be updated for each molecule to be compared as the residues used to define the binding site can vary betwen each report file. The function then returns the updated “residue_list”, as well as a list of residues found to interact with the ligand: “plif_list_all”.



import xml.etree.ElementTree as ET
################################################################################
def generate_plif_lists(report_file, residue_list, lig_ident):
    #uses report.xml from PLIP to return list of interacting residues and update list of residues in binding site
        plif_list_all = []
        tree = ET.parse(report_file)
        root = tree.getroot()
        #list of residue keys that form an interaction
        for binding_site in root.findall('bindingsite'):
                nest = binding_site.find('identifiers')
                lig_code = nest.find('hetid')
                if str(lig_code.text) == str(lig_ident):
                        #get the plifs stuff here
                        nest_residue = binding_site.find('bs_residues')
                        residue_list_tree = nest_residue.findall('bs_residue')
                        for residue in residue_list_tree:
                                res_id = residue.text
                                dict_res_temp = residue.attrib
                                if res_id not in residue_list:
                                        residue_list.append(res_id)
                                if dict_res_temp['contact'] == 'True':
                                        if res_id not in plif_list_all:
                                                plif_list_all.append(res_id)
        return plif_list_all, residue_list
###############################################################################
plif_list_m006, residue_list = generate_plif_lists('m006/report.xml',residue_list, 'LIG')
plif_list_m009, residue_list = generate_plif_lists('m009/report.xml', residue_list, 'LIG')
plif_list_m004, residue_list = generate_plif_lists('m004/report.xml', residue_list, 'LIG')

3. Read Into RDKit: Now we have the list of binding site residues and which residues are interacting with the ligand a PLIF can be generated. This is done using the function shown below (“generate_rdkit_plif”):


from rdkit import Chem,  DataStructs
from rdkit.DataStructs import cDataStructs
################################################################################
def generate_rdkit_plif(residue_list, plif_list_all):
    #generates RDKit plif given list of residues in binding site and list of interacting residues
    plif_rdkit = DataStructs.ExplicitBitVect(len(residue_list), False)
    for index, res in enumerate(residue_list):
        if res in plif_list_all:
            print 'here'
            plif_rdkit.SetBit(index)
        else:
            continue
    return plif_rdkit
#########################################################################
plif_m006 = generate_rdkit_plif(residue_list, plif_list_m006)
plif_m009 = generate_rdkit_plif(residue_list, plif_list_m009)
plif_m004 = generate_rdkit_plif(residue_list, plif_list_m004)

4. Play! These PLIFs can now be compared using RDKit functionality. For example the Tanimoto similarity between the ligands can be computed:


def similarity_plifs(plif_1, plif_2):
    sim = DataStructs.TanimotoSimilarity(plif_1, plif_2)
    print sim
    return sim 
###################################################################
print similarity_plifs(plif_m006, plif_m009)
print similarity_plifs(plif_m006, plif_m004)
print similarity_plifs(plif_m009, plif_m004)

The output is: 0.2, 0.5, 0.0.

All files used to generate the PLIFs cound be found here. Happy PLIF-making!

Viewing 3D molecules interactively in Jupyter iPython notebooks

Greg Landrum, curator of the invaluable open source cheminformatics API, RDKit, recently blogged about viewing molecules in a 3D window within a Jupyter-hosted iPython notebook (as long as your browser supports WebGL, that is).

The trick is to use py3Dmol. It’s easy to install:

pip install py3Dmol

This is built on the object-oriented, webGL based JavaScript library for online molecular visualization 3Dmol.js (Rego & Koes, 2015); here's a nice summary of the capabilities of 3Dmol.js. It's features include:

support for pdb, sdf, mol2, xyz, and cube formats
parallelized molecular surface computation
sphere, stick, line, cross, cartoon, and surface styles
atom property based selection and styling
labels
clickable interactivity with molecular data
geometric shapes including spheres and arrows

I tried a simple example and it worked beautifully:

import py3Dmol
view = py3Dmol.view(query='pdb:1hvr')
view.setStyle({'cartoon':{'color':'spectrum'}})
view

The 3Dmol.js website summarizes how to view molecules, along with how to choose representations, how to embed it, and even how to develop with it.

References

Nicholas Rego & David Koes (2015). “3Dmol.js: molecular visualization with WebGL”.
Bioinformatics, 31 (8): 1322-1324. doi:10.1093/bioinformatics/btu829

Plotting and storing a 3D network in R

A simple toy example of a three layered network:

Note 1: In order to view the 3D plots, mac users will need Xquartz installed (https://www.xquartz.org/).

 
require(igraph)
require(rgl)
#Another package that might be needed is "rglwidget". The function writeWebGL will show an error stating if rglwidget is required.
######################################//// 
######The basics######################////
######################################////
#1) Create a "food" network (three layers) 
set.seed(432)
g1<-watts.strogatz.game(dim = 1,size = 5,nei = 2,p = .5,loops = FALSE,multiple = FALSE)
g2<-watts.strogatz.game(dim = 1,size = 10,nei = 2,p = .2,loops = FALSE,multiple = FALSE)
g3<-watts.strogatz.game(dim = 1,size = 30,nei = 1,p = .5,loops = FALSE,multiple = FALSE)
g123=g1+g2+g3 


#Create more edges btw layers 
g123=rewire(g123,each_edge(prob=.4,loops = FALSE,multiple = FALSE)) 
ne=15;add_edges(g123,edges = cbind(sample(1:vcount(g1),size = ne,replace = TRUE), sample((vcount(g1)+1):vcount(g123),size = ne,replace = TRUE)))#top layer 
ne=30;add_edges(g123,edges = cbind(sample((vcount(g1)+1):(vcount(g1)+vcount(g2)),size = ne,replace = TRUE), sample((vcount(g1)+vcount(g2)+1):vcount(g123),size = ne,replace = TRUE)))#second layer 

#A quick plot of the graph
plot(g123,vertex.size=1,vertex.label.cex=0.02)


#Create 3d coordinates of the network layout
circpos=function(n,r=1){#Coordinates on a circle
 rad=seq(0,2*pi,length.out=n+1)[-1];x=cos(rad)*r;y=sin(rad)*r
 return(cbind(x,y))
}
#
lay=rbind(cbind(circpos(vcount(g1),r=1), runif(n = vcount(g1),-1,1)),
 cbind(circpos(vcount(g2),r=2), runif(n = vcount(g2),6,7)),
 cbind(circpos(vcount(g3),r=4), runif(n = vcount(g3),13,17))
)



#2d plot using the previous layout
plot(g123,vertex.size=5,vertex.label=NA,layout=lay[,c(1,3)])
plot(g123,vertex.size=1,vertex.label=NA,layout=lay[,c(1,2)])

 
#3D graph plot
#Add some colour to nodes and edges
nodecols=c(rep("red",vcount(g1)),
 rep("blue",vcount(g2)),
 rep("yellow",vcount(g3)))

edgecols=function(elist,cols,grouplist){
 whatcol=rep(length(cols)+1,nrow(elist))
 finalcol=whatcol
 for(i in 1:nrow(elist)){
 for(k in length(cols):1){ 
 if( k * (length( intersect(elist[i,], grouplist[[k]]) ) > 0)){
 whatcol[i]=min(whatcol[i], k )
 }
 }
 finalcol[i]=cols[whatcol[i]]
 }
 return(finalcol)
}

#Open 3d viewer
rgl.open()
rglplot(g123, layout=lay,vertex.size=5,vertex.label=NA,vertex.color=nodecols,
 edge.color=edgecols(elist=get.edgelist(g123,names = FALSE),cols=c("orange","green","pink"),grouplist=list(1:vcount(g1), (vcount(g1)+1):(vcount(g1)+vcount(g2)), (vcount(g1)+vcount(g2)+1):vcount(g123)) )
)

###Storing the plot in an html file###

dirfolder="..." #your dir
#rgl.open()#instead of rgl.open use open3d, in order to save the plot. 
open3d()
rglplot(g123, layout=lay,vertex.size=5,vertex.label=NA,vertex.color=nodecols,
 edge.color=edgecols(elist=get.edgelist(g123,names = FALSE),cols=c("orange","green","pink"),grouplist=list(1:vcount(g1), (vcount(g1)+1):(vcount(g1)+vcount(g2)), (vcount(g1)+vcount(g2)+1):vcount(g123)) )
)
#Fix the view
rgl.viewpoint(theta=90, phi=0)

#Save a static 2d image:
rgl.snapshot(paste(dirfolder,"a_png_pic_0.png",sep=""), fmt="png", top=TRUE)

#Save the plot in a .htlm file:
rglfolder=writeWebGL(dir = paste(dirfolder,"first_net3d",sep=""), width=700)

#The previous function should create a file called index.htlm inside the folder "first_net3d". By opening this file in a browser (with javascript enabled) the 3d plot will be displayed again.
#Also the following command will open the plot in the browser:
browseURL(rglfolder)

Note 2: In order to view the .htlm file javascript should be enabled in the browser. (Here is an example on how to do this for safari ).

Although not covered in the previous script, further options are available such as edge/vertex size and the ability to control independently each of the nodes and edges in the graph. Here is an example that makes more use of these options:

3d network representing a T cell receptor. Edges are coloured according to a relevant path found between the bottom green node and the upper red node cluster.

T cell receptor (in blue), binding to a peptide (in red).

Counting Threads

When someone talks about “counting threads” the first thing that you think of is probably shopping for bed sheets. But this post is not about the happy feeling of drifting off to sleep on smooth, comfortable Egyptian cotton. This post is about that much less happy feeling when you want to quickly run a bit of code on a couple of data sets to finish the results section of a thesis chapter, and you see this:

Luis blocking the server again….

Obviously someone (*cough*Luis*cough*) is having some fun on the server without nice-ing their code to allow people who are much less organized than they should be (*cough*me*cough*) to do a quick last-minute data analysis run.

The solution: confront the culprit with their excessive server usage (Note: alternatives include manual server restart with a power cord to make the world your enemy – not recommended).

So… now we just need to find out how much of the server “ospina” is using. Screenshots won’t convince him… and we can’t take enough screenshots to show the extent of the server-hogging with his 1000s of processes anyway. We need to count…

Luckily there is a handy function to find out information about processes called pgrep. This is basically a ‘ps | grep’ function which has a bunch of options to reflect the many ways it can be used. We see opsina is running R, so here goes:

pgrep -c R

The -c flag counts processes and the pattern matches the command name that was run. But yeah, it turns out this wasn’t the best idea ever. A lot of people are running R (as might be expected in the Statistics Department), and you get a number that is really too high to be likely. We need to be more specific in our query, so let’s go back to the ps command. Second attempt:

ps -Af | grep ospina | wc -l

What we’re doing now is first showing all processes that are run on the server (ps -A) also showing details of the command run and who ran it (-f flag). Then we’re finding the ones that are labelled with our server culprit (grep ospina) and counting the lines we find. There are annoyingly still a few problems with this approach.

We just ran this command on the server and thus will count a command like grep –color=auto ospina,
User “ospina” is probably running a few more things than just his R command (like ssh-ing into the server and maybe a couple of screens)
We get a number than looks far lower than what we expected just by visual inspection.

So… what happened? We can fix problems 1 and 2 by just piping to a further grep command. But problem 3 is different. As it turns out, our culprit is running multiple threads from the same process (which is also why you find so many chrome instances on htop for example). We just counted processes, when really the server is being occupied by his multi-threading exploits. So… if you want to back up your complaint with a nice number, here’s your baby:

ps -ALf | grep opsina | grep R-3.3 | wc -l

The -L flag displays all threads instead of only the processes. I further used R-3.3 as it turns out he is using a specific version of R, which I can use to specify this command. Otherwise it also helps to use inputs arguments to functions to search against. If your fingers get too tired to press the shift-key that often, ps -ALf is equivalent to ps -eLf.

For now: moan away, folks!

Disclaimer: Any scenarios alluded to in the above text are fictitious and do not represent the behaviour of the individuals mentioned. Read: obviously I do not do my analysis last-minute.

A beginner’s guide to Rosetta

Rosetta is a big software suite, and I mean really big. It includes applications for protein structure prediction, refinement, docking, and design, and specific adaptations of these applications (and others) to a particular case, for example protein-protein docking of membrane proteins to form membrane protein complexes. Some applications are available in one of the hassle-free servers online (e.g. ROSIE, Robetta, rosetta.design), which might work well if you’ve got just a few tests you would like to try using standard parameters and protocols. However, it’s likely that you will want to download and install a version if you’re interested in carrying out a large amount of modelling, or using an unusual combination of steps or scoring function. This is not a trivial task, as the source code is a 2.5G download, then your machine will be busy compiling for some time (around 5 hours on two cores on my old laptop). Alternatively, if the protocols and objects you’re interested in are part of PyRosetta, this is available in a pre-compiled package for most common operating systems and is less than 1G.

This brings me to the different ways to use Rosetta. Most applications come as an executable which you can find in Rosetta/main/source/bin/ after completing the build. There is documentation available on how to use most of these, and on the different flags which can be used to input PDB structures and parameters. Some applications can be run using RosettaScripts, which uses an xml file to define the protocol, including scoring functions, movers and other options. In this case, Rosetta/main/source/bin/rosetta_scripts.* is run, which will read the xml and execute the required protocol.

An example RosettaScript, used for the MPrelax protocol

PyRosetta is even more flexible, and relatively easy to use for anyone accustomed to programming in python. There are python bindings for the fast C++ objects and movers so that the increased usability is generally not greatly compromised by slower speeds. One of the really handy things about PyRosetta is the link to PyMOL which can be used to view the trajectory of your protein moving while a simulation is running. Just add the following to your .pymolrc file in your home directory to set up the link every time you open pymol:

run /PATH/TO/PYROSETTA/PyMOLPyRosettaServer.py

When it comes to finding your way around the Rosetta package, there are a few things it is very useful to know to start with. The demos directory contains plenty of useful example scripts and instructions for running your first jobs. In demos/tutorials you will find introductions to the main concepts. The demos/protocol_capture subdirectory is particularly helpful, as most papers which report a new Rosetta protocol will deposit here the scripts required to reproduce their results. These may not currently be the best methods to approach a problem, but if you have found a research article describing some results which would be useful to get for your system, they are a good starting point to learn how to make an application work. Then the world is your oyster as you explore the many possible options and inputs to change and add!

Processing large files using python: part duex

Last week I wrote a post on some of the methods I use in python to efficiently process very large datasets. You can find that here. Roughly it details how one can break a large file into chunks which then can be passed onto multiple cores to do the number crunching. Below I expand upon this, first creating a parent class which turns a given (large) file into chunks. I construct it in a manner which children classes can be easily created and tailored for specific file types, given some examples. Finally, I give some wrapping functions for use in conjunction with any of the chunkers so that the chunks can be processed using multiple cores.

First, and as an aside, I was asked after the previous post, at what scale these methods should be considered. A rough answer would be when the size of the data becomes comparable to the available RAM. A better answer would be, when the overhead of reading each individual line(/entry) is more than the operation on that entry. Here is an example of this case, though it isn’t really that fair a comparison:


>>> import timeit,os.path
>>> os.path.getsize("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa")
10095955
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([l.count(">") for l in f])',number=10)
0.8403599262237549
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")])',number=10)
0.15671014785766602

For a small 10MB fasta file, we count the number of sequences present in a fifth of the time using chunks. I should be honest though, and state that the speedup is mostly due not having the identify newline characters in the chunking method; but nevertheless, it shows the power one can have using chunks. For a 14GB fasta file, the times for the chunking (1Mb chunks) and non-chunking methods are 55s and 130s respectively.

Getting back on track, let’s turn the chunking method into a parent class from which we can build on:


import os.path

class Chunker(object):

    #Iterator that yields start and end locations of a file chunk of default size 1MB.
    @classmethod
    def chunkify(cls,fname,size=1024*1024):
        fileEnd = os.path.getsize(fname)
        with open(fname,'r') as f:
            chunkEnd = f.tell()
            while True:
                chunkStart = chunkEnd
                f.seek(size,1)
                cls._EOC(f)
                chunkEnd = f.tell()
                yield chunkStart, chunkEnd - chunkStart
                if chunkEnd >= fileEnd:
                    break

    #Move file pointer to end of chunk
    @staticmethod
    def _EOC(f):
        f.readline()

    #read chunk
    @staticmethod
    def read(fname,chunk):
        with open(fname,'r') as f:
            f.seek(chunk[0])
            return f.read(chunk[1])

    #iterator that splits a chunk into units
    @staticmethod
    def parse(chunk):
        for line in chunk.splitlines():
            yield chunk

In the above, we create the class Chunker which has the class method chunkify as well as the static methods, _EOC, read, and parse. The method chunkify does the actual chunking of a given file, returning an iterator that yields tuples containing a chunk’s start and size. It’s a class method so that it can make use of _EOC (end of chunk) static method, to move the pointer to a suitable location to split the chunks. For the simplest case, this is just the end/start of a newline. The read and parse methods read a given chunk from a file and split it into units (single lines in the simplest case) respectively. We make the non-chunkify methods static so that they can be called without the overhead of creating an instance of the class.

Let’s now create some children of this class for specific types of files. First, one of the most well known file types in bioinformatics, FASTA. Below is an segment of a FASTA file. Each entry has a header line, which begins with a ‘>’, followed by a unique identifier for the sequence. After the header line, one or more lines follow giving the sequence. Sequences may be either protein or nucleic acid sequences, and they may contain gaps and/or alignment characters.


>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

And here is the file type specific chunker:


from Bio import SeqIO
from cStringIO import StringIO

class Chunker_FASTA(Chunker):

    @staticmethod
    def _EOC(f):
        l = f.readline() #incomplete line
        p = f.tell()
        l = f.readline()
        while l and l[0] != '>': #find the start of sequence
            p = f.tell()
            l = f.readline()
        f.seek(p) #revert one line

    @staticmethod
    def parse(chunk):
        fh = cStringIO.StringIO(chunk)
        for record in SeqIO.parse(fh,"fasta"):
            yield record
        fh.close()

We update the _EOC method to find when one entry finishes and the next begins by locating “>”, following which we rewind the file handle pointer to the start of that line. We also update the parse method to use fasta parser from the BioPython module, this yielding SeqRecord objects for each entry in the chunk.

For a second slightly harder example, here is one designed to work with output produced by bowtie, an aligner of short reads from NGS data. The format consists of of tab separated columns, with the id of each read located in the first column. Note that a single read can align to multiple locations (up to 8 as default!), hence why the same id appears in multiple lines. A small example section of the output is given below.


SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-2	2502 GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3

with the corresponding chunker given by:


class Chunker_BWT(chunky.Chunker):

    @staticmethod
    def _EOC(f):
        l = f.readline()#incomplete line
        l = f.readline()
        if not l: return #EOF
        readID = l.split()[0]
        while l and (l.split()[0] != readID): #Keep reading lines until read IDs don't match
            p = f.tell()	
            l = f.readline()
        f.seek(p) #revert one line

    @staticmethod
    def parse(chunk):
        lines = chunk.splitlines()
        N = len(lines)
        i = 0 
        while i < N:
            readID = lines[i].split('\t')[0]
            j = i
            while lines[j].split('\t')[0] == readID:
                j += 1
                if j == N:
                    break
            yield lines[i:j]
            i = j

This time, the end of chunk is located by reading lines until there is a switch in the read id, whereupon we revert one line. For parsing, we yield all the different locations a given read aligns to as a single entry.

Hopefully these examples show you how the parent class can be expanded upon easily for most file types. Let’s now combine these various chunkers with the code from previous post to show how we can enable multicore parallel processing of the chunks they yield. The code below contains a few generalised wrapper functions which work in tandem with any of the above chunkers to allow most tasks to be parallelised .


import multiprocessing as mp, sys

def _printMP(text):
    print text
    sys.stdout.flush()

def _workerMP(chunk,inFile,jobID,worker,kwargs):
    _printMP("Processing chunk: "+str(jobID))
    output = worker(chunk,inFile,**kwargs)
    _printMP("Finished chunk: "+str(jobID))
    return output	

def main(inFile,worker,chunker=Chunker,cores=1,kwargs={}):
    pool = mp.Pool(cores)

    jobs = []
    for jobID,chunk in enumerate(chunker.chunkify(inFile)):
        job = pool.apply_async(_workerMP,(chunk,inFile,jobID,worker,kwargs))
        jobs.append(job)

    output = []
    for job in jobs:
        output.append( job.get() )

    pool.close()
    
    return output

The main function should be recognisable as the code from the previous post. It generates the pool of workers, ideally one for each core, before using the given chunker to split the corresponding file into a series of chunks for processing. Unlike before, we collect the output given by each job and return it after processing is finished. The main function acts as wrapper allowing us to specify different processing functions and different chunkers, as given by the variables worker and chunker respectively. We have wrapped the processing function call within the function _workerMP which prints to the terminal as tasks are completed. It uses the function _printMP to do this, as you need to flush the terminal after a print statement when using multi core processing, otherwise nothing appears until all tasks are completed.

Let’s finish by showing an example of how we would use the above to do the same task as we did at the start of this post, counting the sequences within a fasta file, using the base chunker:


def seq_counter(chunk,inFile):
    data = Chunker.read(inFile,chunk)
    return data.count('>')

and using the FASTA chunker:


def seq_counter_2(chunk,inFile):
    data = list(Chunker_FASTA.parse(Chunker_FASTA.read(inFile,chunk)))
    return len(data)

And time they take to count the sequences within the 14GB file from before:


>>> os.path.getsize("SRR951828.fa")
14944287128
>>> x=time.time();f = open("SRR951828.fa");sum([l.count(">") for l in f]);time.time()-x
136829250
190.05533599853516
>>> x=time.time();f = open("SRR951828.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")]);time.time()-x
136829250
26.343637943267822
>>> x=time.time();sum(main("SRR951828.fa",seq_counter,cores=8));time.time()-x
136829250
4.36846399307251
>>> x=time.time();main("SRR951828.fa",seq_counter_2,Chunker_FASTA,8);time.time()-x
136829250
398.94060492515564

Let’s ignore that last one, as the slowdown is due to turning the entries into BioPython SeqRecords. The prior one, which combines chunking and multicore processing, has roughly a factor of 50 speed up. I’m sure this could be further reduced using more cores and/or optimising the chunk size, however, this difference alone can change something from being computationally implausible, to plausible. Not bad for only a few lines of code.

Anyway, as before, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. If I feel like it, I may follow this up with a post about how to integrate a queue into the above which outputs the result of each job as they are produced. In the above, we currently hold collate all the results in the memory, which has the potential to cause a memory overflow depending on what is being returned.

Processing large files using python

In the last year or so, and with my increased focus on ribo-seq data, I have come to fully appreciate what the term big data means. The ribo-seq studies in their raw forms can easily reach into hundreds of GBs, which means that processing them in both a timely and efficient manner requires some thought. In this blog post, and hopefully those following, I want to detail some of the methods I have come up (read: pieced together from multiple stack exchange posts), that help me take on data of this magnitude. Specifically I will be detailing methods for python and R, though some of the methods are transferrable to other languages.

My first big data tip for python is learning how to break your files into smaller units (or chunks) in a manner that you can make use of multiple processors. Let’s start with the simplest way to read a file in python.


with open("input.txt") as f:
    data = f.readlines()
    for line in data:
        process(line)

This mistake made above, with regards to big data, is that it reads all the data into RAM before attempting to process it line by line. This is likely the simplest way to cause the memory to overflow and an error raised. Let’s fix this by reading the data in line by line, so that only a single line is stored in the RAM at any given time.


with open("input.txt") as f:
    for line in f:
        process(line)

This is a big improvement, namely it doesn’t crash when fed a big file (though also it’s shorter!). Next we should attempt to speed this up a bit by making use of all these otherwise idle cores.


import multiprocessing as mp

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    for line in f:
        jobs.append( pool.apply_async(process,(line)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Provided the order of which you process the lines don’t matter, the above generates a set (pool) of workers, ideally one for each core, before creating a bunch of tasks (jobs), one for each line, for the workers to do. I tend to use the Pool object provided by the multiprocessing module due to ease of use, however, you can spawn and control individual workers using mp.Process if you want finer control. For mere number crunching, the Pool object is very good.

While the above is now making use of all those cores, it sadly runs into memory problems once again. We specifically use apply_async function so that the pool isn’t blocked while each line processes. However, in doing so, all the data is read into memory once again; this time stored as individual lines associated with each job, waiting inline to be processed. As such, the memory will again overflow. Ideally the method will only read the line into memory when it is its turn to be processed.


import multiprocessing as mp

def process_wrapper(lineID):
    with open("input.txt") as f:
        for i,line in enumerate(f):
            if i != lineID:
                continue
            else:
                process(line)
                break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    for ID,line in enumerate(f):
        jobs.append( pool.apply_async(process_wrapper,(ID)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Above we’ve now changed the function fed to pool of workers to include opening the file, locating the specified line, reading it into memory, and then processing it. The only input now stored for each job spawned is the line number, thereby preventing the memory overflow. Sadly, the overhead involved in having to locate the line by reading iteratively through the file for each job is untenable, getting progressively more time consuming as you get further into the file. To avoid this we can use the seek function of file objects which skips you to a particular location within a file. Combining with the tell function, which returns the current location within a file, gives:


import multiprocessing as mp

def process_wrapper(lineByte):
    with open("input.txt") as f:
        f.seek(lineByte)
        line = f.readline()
        process(line)

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    nextLineByte = f.tell()
    for line in f:
        jobs.append( pool.apply_async(process_wrapper,(nextLineByte)) )
        nextLineByte = f.tell()

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Using seek we can move directly to the correct part of the file, whereupon we read a line into the memory and process it. We have to be careful to correctly handle the first and last lines, but otherwise this does exactly what we set out, namely using all the cores to process a given file while not overflowing the memory.

I’ll finish this post with a slight upgrade to the above as there is a reasonable amount of overhead associated with opening and closing the file for each individual line. If we process multiple lines of the file at a time as a chunk, we can reduce these operations. The biggest technicality when doing this is noting that when you jump to a location in a file, you are likely not located at the start of a line. For a simple file, as in this example, this just means you need to call readline, which reads to next newline character. More complex file types likely require additional code to locate a suitable location to start/end a chunk.


import multiprocessing as mp,os

def process_wrapper(chunkStart, chunkSize):
    with open("input.txt") as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        for line in lines:
            process(line)

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    with open(fname,'r') as f:
        chunkEnd = f.tell()
    while True:
        chunkStart = chunkEnd
        f.seek(size,1)
        f.readline()
        chunkEnd = f.tell()
        yield chunkStart, chunkEnd - chunkStart
        if chunkEnd > fileEnd:
            break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
for chunkStart,chunkSize in chunkify("input.txt"):
    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Anyway, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. In another post coming in the near future, I will expanded on this code, turning it into a parent class from which create multiple children to use with various file types.

Oxford Protein Informatics Group

or "OPIG" to friends