Category Archives: Technical

Storing your stuff with clever filesystems: ZFS and tmpfs

The filesystem is a a critical component of just about any operating system, however it’s often overlooked. When setting up a new server, the default filesystem options are often ticked and never thought about again. However, there exist a couple of filesystems which can provide some extraordinary features and speed. I’m talking about ZFS and tmpfs.

ZFS was originally developed by Oracle for their Solaris operating system, but has now been open-sourced and is freely available on linux. Tmpfs is a temporary file system which uses system memory to provide fast temporary storage for files. Together, they can provide outstanding reliability and speed for not very much effort.

Hard disk capacity has increased exponentially over the last 50 years. In the 1960s, you could rent a 5MB hard disk from IBM for the equivalent of $130,000 per month. Today you can buy for less than $600 a 12TB disk – a 2,400,000 times increase.

As storage technology has moved on, the filesystems which sit on top of them ideally need to be able to access the full capacity of those ever increasing disks. Many relatively new, or at least in-use, filesystems have serious limitations. Akin to “640K ought to be enough for anybody”, the likes of the FAT32 filesystem supports files which are at most 4GB on a chunk of disk (a partition) which can be at most 16TB. Bear in mind that arrays of disks can provide a working capacity of many times that of a single disk. You can buy the likes of a supermicro sc946ed shelf which will add 90 disks to your server. In an ideal world, as you buy bigger disks you should be able to pile them into your computer and tell your existing filesystem to make use of them, your filesystem should grow and you won’t have to remember a different drive letter or path depending on the hardware you’re using.

ZFS is a 128-bit file system, which means a single installation maxes out at 256 quadrillion zettabytes. All metadata is allocated dynamically so there isn’t the need to pre-allocate inodes and directories can have up to 2^48 (256 trillion) entries. ZFS provides the concept of “vdevs” (virtual devices) which can be a single disk or redundant/striped collections of multiple disks. These can be dynamically added to a pool of vdevs of the same type and your storage will grow onto the fresh hardware.

A further consideration is that both disks of the “spinning rust” variety and SSDs are subject to silent data corruption, i.e. “bit rot”. This can be caused by a number of factors even including cosmic rays, but the consequence is read errors when it comes time to retrieve your data. Manufacturers are aware of this and buried in the small print for your hard disk will be values for “unrecoverable read errors” i.e. data loss. ZFS works around this by providing several mechanisms:

Checksums for each block of data written.
Checksums for each pointer to data.
Scrub – Automatically validates checksums when the system is idle.
Multiple copies – Even if you have a single disk, it’s possible to provide redundancy by setting a copies=n variable during filesystem creation.
Self-healing – When a bad data block is detected, ZFS fetches the correct data from a redundant copy and replaces it with the correct data.

An additional bonus to ZFS is its ability to de-duplicate data. Should you be working with a number of very similar files, on a normal filesystem, each file will take up space proportional to the amount of data that’s contained. As ZFS keeps checksums of each block of data, it’s able to determine if two blocks contain identical data. ZFS therefore provides the ability to have multiple pointers to the same file and only store the differences between them.

ZFS also provides the ability to take a point in time snapshot of the entire filesystem and roll it back to a previous time. If you’re a software developer, got a package that has 101 dependencies and you need to upgrade it? Afraid to upgrade it in case it breaks things horribly? Working on code and you want to roll back to a previous version? ZFS snapshots can be run with cron or manually and provide a version of the filesystem which can be used to extract previous versions of overwritten or deleted files or used to roll everything back to a point in time when it worked.

Similar to deduplication, a snapshot won’t take up any disk extra space until the data starts to change.

The other filesystem worth mentioning is tmpfs. Tmpfs takes part of the system memory and turns it into a usable filesystem. This is incredibly useful for systems which create huge numbers of temporary files and attempt to re-read them. Tmpfs is also just about as fast as a filesystem can be. Compared to a single SSD or a RAID array of six disks, tmpfs blows them out of the water speed wise.

Creating a tmpfs filesystem is simple:
First create your mountpoint for the disk:

mkdir /mnt/ramdisk

Then mount it. The options are saying make it 1GB in size, it’s of type tmpfs and to mount it at the previously created mount point:

mount –t tmpfs -o size=1024m tmpfs /mnt/ramdisk

At this point, you can use it like any other filesystem:

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1  218G 128G   80G  62% /
/dev/sdb1  6.3T 2.4T  3.6T  40% /spinnyrust
tank       946G 3.5G  942G   1% /tank
tmpfs      1.0G 391M  634M  39% /mnt/ramdisk

Measuring correlation

Correlation is defined as how close two variables are to having a dependence relationship with each other. At first sight, it looks kind of simple, but there are two main problems:

Despite the obvious situations (i.e. correlation = 1), it is difficult to say whether 2 variables are correlated or not (i.e correlation = 0.7). For instance, would you be able to say if the variables X and Y from the following to plots are correlated?
There are different ways of measure of correlation that may not agree when comparing different distributions. As an example, which plot shows a higher correlation? The answer will depend on how you do measure the correlation since if you use Pearson correlation, you would pick A whereas if you choose Spearman correlation you will take B

Here, I will explain some of the different correlation measures you can use:

Pearson product-moment correlation coefficient

What does it measure? Only linear dependencies between the variables.
How it is obtained? By dividing the covariance of the two variables by the product of their standard deviations. (It is defined only if both of the standard deviations are finite and nonzero). $\rho _{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}}$

Properties:

ρ (X,Y) = +1 : perfect direct (increasing) linear relationship (correlation).
ρ (X,Y) = -1 : perfect decreasing (inverse) linear relationship (anticorrelation).
In all other cases, ρ (X,Y) indicates the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated).
Only gives a perfect value when X and Y are related by a linear function.

When is it useful? For the case of a linear model with a single independent variable, the coefficient of determination (R squared) is the square of r, Pearson’s product-moment coefficient.

Spearman’s rank correlation coefficient:

What does it measure? How well the relationship between two variables can be described using a monotonic function (a function that only goes up or only goes down).
How it is obtained? Pearson correlation between the rank values of the two variables.

$r_{s}=\rho _{\operatorname {rg} _{X},\operatorname {rg} _{Y}}={\frac {\operatorname {cov} (\operatorname {rg} _{X},\operatorname {rg} _{Y})}{\sigma _{\operatorname {rg} _{X}}\sigma _{\operatorname {rg} _{Y}}}}$

Only if all n ranks are distinct integers, it can be computed using the popular formula.

$r_{s}={1-{\frac {6\sum d_{i}^{2}}{n(n^{2}-1)}}}.$

$W$ here di is the difference between the two ranks of each observation.

Properties:

rs (X,Y) = +1: X and Y are related by any increasing monotonic function.
rs (X,Y) = -1: X and Y are related by any decreasing monotonic function.
The Spearman correlation increases in magnitude as X and Y become closer to being perfect monotone functions of each other.

When is it useful? It is appropriate for both continuous and discrete ordinal variables. It can be use for looking for non-linear dependence relationships.

Kendall’s tau coefficient

What does it measure? The ordinal association between two measured quantities.
How it is obtained?

$\tau ={\frac {({\text{number of concordant pairs}})-({\text{number of discordant pairs}})}{n(n-1)/2}}.$

Any pair of observations (xi , yi) and (xj, yj) are said to be concordant if the ranks for both elements agree. That happens if xi-xj and yi-xj have the same sign. If their sign are different, they are considered as discordant pairs

Properties:

τ (X,Y) = +1: The agreement between the two rankings is perfect (i.e., the two rankings are the same)
τ (X,Y) = -1: The disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other)
If X and Y are independent, then we would expect the coefficient to be approximately zero.

When is it useful? It is appropriate for both continuous and discrete ordinal variables. It can be use for looking for non-linear dependence relationships.

Distance correlation:

What does it measure? Both linear and nonlinear association between two random variables or random vectors.
How is it obtained? By dividing the variable’s distance covariance by the product of their distance standard deviations:

$\operatorname {dCor}(X,Y)={\frac {\operatorname {dCov}(X,Y)}{{\sqrt {\operatorname {dVar}(X)\,\operatorname {dVar}(Y)}}}},$

The distance covariance is defined as:

$\operatorname {dCov} _{n}^{2}(X,Y):={\frac {1}{n^{2}}}\sum _{j=1}^{n}\sum _{k=1}^{n}A_{j,k}\,B_{j,k}.$

Where:

$A_{j,k}:=a_{j,k}-{\overline {a}}_{j\cdot }-{\overline {a}}_{\cdot k}+{\overline {a}}_{\cdot \cdot },\qquad B_{j,k}:=b_{j,k}-{\overline {b}}_{j\cdot }-{\overline {b}}_{\cdot k}+{\overline {b}}_{\cdot \cdot },$

${\begin{aligned}a_{{j,k}}&=\|X_{j}-X_{k}\|,\qquad j,k=1,2,\ldots ,n,\\b_{{j,k}}&=\|Y_{j}-Y_{k}\|,\qquad j,k=1,2,\ldots ,n,\end{aligned}}$

where || ⋅ || denotes Euclidean norm.

Properties:

dCor (X,Y) = 0 if and only if the random vectors are independent.
dCor (X,Y) = 1: Perfect dependence between the two distributions.
$dCor (X, Y)$ is defined for $X$ and $Y$ in arbitrary dimension.

When is it useful? It is appropriate to find any kind dependence relationships between the 2 variables. Also if X and Y have different dimensions.

Helpful resources for people studying therapeutic antibodies

My work within OPIG involves studying therapeutic antibodies. It can be tough to find information about these commercial molecules, often known by unintelligible developmental names until the later stages of clinical trials. Their structures are frequently absent, as one might expect, but even their sequences are sometimes a nightmare to get hold of! Below is a list of resources that I have found particularly helpful.

IDENTITIES OF RELEVANT ANTIBODIES

1. Wikipedia (don’t judge!) is an extremely helpful resource to get started. They have the following databases:

(a) A list of FDA-approved therapeutic monoclonal antibody therapies
(b) A more general list of therapeutic, diagnostic and preventive monoclonal antibodies (includes some things that have been withdrawn)

2. The Antibody Society has list of FDA/EU approved and antibodies to watch on their website. NB: This is only available to members of the society (free for students and other concessions, standard membership is $100pa).

3. The journal ‘mAbs’ also has a series of ‘Antibodies to Watch in [Year]’ papers. Here are the ones for 2016, 2017 and 2018.

SEQUENCES

4. 137 clinical-stage (post-phase I) mAb sequences can be found in the SI of this paper by Jain et al.

5. A slightly outdated (last updated Nov 2016), but still extremely useful, resource of antibody seqeunces is this FASTA list, written by Dr Martin’s Group at UCL.

SEQUENCES & STRUCTURES

6. The IMGT monoclonal antibody database (mAb-DB) has been possibly the most helpful resource. This includes 798 entries of both therapeutics and non-therapeutics, so it’s helpful to get a list of the antibodies you are interested in first. You can search it with a wide range of parameters, including antibody name. A typical antibody result will include its mAb-DB ID, INN details, common & developmental names, species, receptor type and isotype, sequence (via the “IMGT/2Dstructure-DB” link), target, clinical trials details and – if available – the 3D structure (via the “IMGT/3Dstructure-DB” link).

7. SAbDab has a continually-updated section for all therapeutic antibody structures deposited in the PDB.

CURRENT STATUS OF THE THERAPEUTIC

8. Search the therapeutic name on AdisInsight, or Pharmacodia to see its current clinical trial status, and whether or not it has been withdrawn.

Working with Jupyter notebook on a remote server

To celebrate the recent beta release of Jupyter Lab (try it out of you haven’t already), today we’re going to look at how to run a Jupyter session (Notebook or Lab) on a remote server.

Suppose you have lots of data which lives on a remote server and you want to play with it in a Jupyter notebook. You can’t copy the data to your local machine (well, you can, but you’re sensible so you won’t), but you can run your Jupyter session on the remote server. There’s just one problem – since Jupyter notebook is browser-based and works by connecting to the Jupyter session running locally, you can’t just run Jupyter remotely and forward X11 like you would a traditional graphical IDE. Fortunately, the solution is simple: we run Jupyter remotely, create an ssh tunnel connecting a local port to the one used by the Jupyter session, and connect directly to the Jupyter session using our local browser. The best part about this is that you can set up the Jupyter session once then connect to it from any browser on any machine once an ssh tunnel is created, without worrying about X11 forwarding.

Here’s how to do it.

1. First, connect to the remote server if you haven’t already

ssh fergus@funkyserver

1.5. Jupyter takes browser security very seriously, so in order to access a remote session from a local browser we need to set up a password associated with the remote Jupyter session. This is stored in jupyter_notebook_config.py which by default lives in ~/.jupyter. You can edit this manually, but the easiest option is to set the password by running Jupyter with the password argument:

jupyter notebook password
>>> Enter password:

This password will be used to access any Jupyter session running from this installation, so pick something sensible. You can set a new password at any time on the remote server in exactly the same way.

2: Launch a Jupyter session on the remote server. You can specify the access port using the --port option. This might be useful on a shared server where others might be doing the same thing. You’ll also want to run this without launching a browser on the remote server since this is of no use to you.

jupyter lab --port=9000 --no-browser &

Here I’m using Jupyter Lab, but this works in exactly the same way for Jupyter Notebook.

3: Now for the fun part. Jupyter is running on our remote server, but what we really want is to work in our favourite browser on our local machine. To do this we just need to create an ssh tunnel between a port on our machine and the port our Jupyter session is using on the remote server. On our local machine:

ssh -N -f -L 8888:localhost:9000 fergus@funkyserver

For those not familiar with ssh tunneling, we’ve just created a secure, encrypted connection between port 8888 on our local machine and port 9000 on our remote server.

-N tells ssh we won’t be running any remote processes using the connection. This is useful for situations like this where all we want to do is port forwarding.
-f runs ssh in the background, so we don’t need to keep a terminal session running just for the tunnel.
-L specifies that we’ll be forwarding a local port to a remote address and port. In this case, we’re forwarding port 8888 on our machine to port 9000 on the remote server. The name ‘localhost’ just means ‘this computer’. If you’re a Java programmer who lives for verbosity, you could equivalently pass -L localhost:8888:localhost:9000.

4: If you’ve done everything correctly, you should now be able to access your Jupyter session via port 8888 on your machine. Fire up your favourite browser and type localhost:8888 into the address bar. This should bring up a Jupyter session and prompt you for a password. Enter the password you specified for Jupyter on the remote server.

Congratulations! You now have a Jupyter session running remotely which you can connect to anytime, anywhere, from any machine.

Disclaimer: I haven’t tried this on Windows, nor do I intend to. I value my sanity.

A short intro to machine precision and how to beat it

Most people who’ve ever sat through a Rounding Error is Bad lecture will be familiar with the following example:

> (0.1+0.1+0.1) == 0.3
FALSE

The reason this is so unsettling is because most of the time we think about numbers in base-10. This means we use ten digits $\{0, 1, \dots, 9\}$ , and we perform arithmetic based on this ten digit notation. This doesn’t always matter much for pen and paper maths but it’s an integral part of how we think about more complex operations and in particular how we think about accuracy. We see $0.1$ as a finite decimal fraction, and so it’s only natural that we should be able to do accurate sums with it. And if we can do simple arithmetic, then surely computers can too? In this blog post I’m going to try and briefly explain what causes rounding errors such as the one above, and how we might get away with going beyond machine precision.

Take a number $x \in [0; 1)$ , say $x=1/3$ . The decimal representation of $x$ is of the form $x=\sum_{i=1}^{\infty} a_i \times 10^{-i}$ . The $a_i \in \{0, 1, \dots, 9\}$ here are the digits that go after the radix point. In the case of $x=1/3$ these are all equal $a_i=3$ , or $x=0.333\dots _{10}$ . Some numbers, such as our favourite $x$ , don’t have a finite decimal expansion. Others, such as $0.3$ , do, meaning that after some $i \in \mathbb{N}$ , all $a_{i+j}=0$ . When we talk about rounding errors and accuracy, what we actually mean is that we only care about the first few digits, say $i\leq 5$ , and we’re happy to approximate to $x\approx \sum_{i=1}^{5} a_i \times 10^{-i}=0.33333$ , potentially rounding up at the last digit.

Computers, on the other hand, store numbers in base-2 rather than base-10, which means that they use a different series expansion $x=\sum_{i=1}^{\infty} b_i \times 2^{-i}$ , $b_i \in \{0, 1\}$ to represent the same number. Our favourite number $x$ is actually stored as $0.1010101\dots _{2}$ rather than $0.3333333\dots _{10}$ , despite the fact it appears as the latter on a computer screen. Crucially, arithmetic is done in base-2 and, since only a finite number of binary digits are stored ( $i\leq 52$ for most purposes these days), rounding errors also occur in base-2.

All numbers with a finite binary expansion, such as $0.25_{10}=0\times 1/2+1\times 1/4=0.01_{2}$ also have a finite decimal expansion, meaning we can do accurate arithmetic with them in both systems. However, the reverse isn’t true, which is what causes the issue with $0.1+0.1+0.1\neq 0.3$ . In binary, the nice and tidy $0.1_{10} = 0.00011001100\dots _{2}$ . We observe the rounding error because unlike us, the computer is trying to sum over infinite series.

While it’s not possible to do infinite sums with finite resources, there is a way to go beyond machine precision if you wanted to, at least for rational $x=p/q$ , where $p, q \in \mathbb{N}$ . In the example above, the issue comes from dividing by 10 on each side of the (in)equality. Luckily for us, we can avoid doing so. Integer arithmetic is easy in any base, and so

> (1+1+1) == 3 
TRUE

Shocking, I know. On a more serious note, it is possible to write an algorithm which calculates the binary expansion of $x=p/q$ using only integer arithmetic. Usually binary expansion happens in this way:

set x, maxIter
initialise b, i=1
while x>0 AND i<=maxIter {
 if 2*x>=1
    b[i]=1
 else
    b[i]=0
 x = 2*x-b[i]
 i = i+1 
}
return b

Problems arise whenever we try to compute something non-integer (the highlighted lines 3, 4, and 8). However, we can rewrite these using $x = p/q$ and shifting the division by $q$ to the right-hand side of each inequality / assignment operator:

set p, q, maxIter
initialise b, i=1 
while p>0 AND i<=maxIter { 
 if 2*p>=q 
    b[i]=1 
 else 
    b[i]=0 
 p = 2*p-b[i]*q 
 i = i+1 
} 
return b

Provided we’re not dealing with monstrously large integers (i.e. as long as we can safely double $p$ ), implementing the above lets us compute $p/q$ with arbitrary precision given by maxIter. So we can beat machine precision for rationals! And the combination of arbitrarily accurate rationals and arbitrarily accurate series approximations (think Riemann zeta function, for example) means we can also get the occasional arbitrarily accurate irrational.

To sum up, rounding errors are annoying, partly because it’s not always intuitive when and how they happen. As a general rule the best way to avoid them is to make your computer do as little work as possible, and to avoid non-integer calculations whenever you can. But you already knew that, didn’t you?

This post was partially inspired by the undergraduate course on Simulation and Statistical Programming lectured by Prof Julien Berestycki and Prof Robin Evans. It was also inspired by my former maths teacher who used to mark us down for doing more work than necessary even when our solutions were correct. He had a point.

A very basic introduction to Random Forests using R

Random Forests is a powerful tool used extensively across a multitude of fields. As a matter of fact, it is hard to come upon a data scientist that never had to resort to this technique at some point. Motivated by the fact that I have been using Random Forests quite a lot recently, I decided to give a quick intro to Random Forests using R.

So what are Random Forests? Well, I am probably not the most suited person to answer this question (a google search will reveal much more interesting answers) , still I shall give it a go. Random Forests is a learning method for classification (and others applications — see below). It is based on generating a large number of decision trees, each constructed using a different subset of your training set. These subsets are usually selected by sampling at random and with replacement from the original data set. The decision trees are then used to identify a classification consensus by selecting the most common output (mode). While random forests can be used for other applications (i.e. regression), for the sake of keeping this post short, I shall focus solely on classification.

Why R? Well, the quick and easy question for this is that I do all my plotting in R (mostly because I think ggplot2 looks very pretty). I decided to explore Random Forests in R and to assess what are its advantages and shortcomings. I am planning to compare Random Forests in R against the python implementation in scikit-learn. Do expect a post about this in the near future!

The data: to keep things simple, I decided to use the Edgar Anderson’s Iris Data set. You can have a look at it by inspecting the contents of iris in R. This data set contains observations for four features (sepal length and width, and petal length and width – all in cm) of 150 flowers, equally split between three different iris species. This data set is fairly canon in classification and data analysis. Let us take a look at it, shall we:

As you can observe, there seems to be some separation in regards to the different features and our three species of irises [note: this set is not very representative of a real world data set and results should be taken with a grain of salt].

Training and Validation sets: great care needs to be taken to ensure clear separation between training and validation sets. I tend to save the cases for which I am actually interested in performing predictions as a second validation set (Validation 2). Then I split the remaining data evenly into Training and Validation 1.

Let us split our data set then, shall we?

# Set random seed to make results reproducible:
set.seed(17)
# Calculate the size of each of the data sets:
data_set_size <- floor(nrow(iris)/2)
# Generate a random sample of "data_set_size" indexes
indexes <- sample(1:nrow(iris), size = data_set_size)

# Assign the data to the correct sets
training <- iris[indexes,]
validation1 <- iris[-indexes,]

Before we can move on, here are some things to consider:

1- The size of your data set usually imposes a hard limit on how many features you can consider. This occurs due to the curse of dimensionality, i.e. your data becomes sparser and sparser as you increase the number of features considered, which usually leads to overfitting. While there is no rule of thumb relating to how many features vs. the number of observations you should use, I try to keep e^Nf < No (Nf = number of features, No = number of observations) to minimise overfitting [this is not always possible and it does not ensure that we won’t overfit]. In this case, our training set has 75 observations, which suggests that using four features (e^4 ~ 54.6) is not entirely absurd. Obviously, this depends on your data, so we will cover some further overfitting checks later on.

2- An important thing to consider when assembling training sets is the proportion of negatives vs. positives in your data. Think of an extreme scenario where you have many, many more observations for one class vs. the others. How will this affect classification? This would make it more likely for the classifier to predict the dominant class when given new values. I mentioned before that the iris set is quite nice to play with. It comes with exactly 50 observations for each species of irises. What happens if you have a data set with a much higher number of observations for a particular class? You can bypass any imbalance regarding the representation of each class by carefully constructing your training set in order not to favour any particular class. In this case, our randomly selected set has 21 observations for species setosa and 27 observations for each of species versicolor and virginica, so we are good to go.

3- Another common occurrence that is not represented by the iris data set is missing values (NAs) for observations. There are many ways of dealing with missing values, including assigning the median or the mode for that particular feature to the missing observation or even disregarding some observations entirely, depending on how many observations you have. There are even ways to use random forests to estimate a good value to assign to the missing observations, but for the sake of brevity, this will not be covered here.

Right, data sets prepared and no missing values, it is time to fire our random forests algorithm. I am using the randomForest package. You can click the link for additional documentation. Here is the example usage code:

#import the package
library(randomForest)
# Perform training:
rf_classifier = randomForest(Species ~ ., data=training, ntree=100, mtry=2, importance=TRUE)

Note some important parameters:

-The first parameter specifies our formula: Species ~ . (we want to predict Species using each of the remaining columns of data).
–ntree defines the number of trees to be generated. It is typical to test a range of values for this parameter (i.e. 100,200,300,400,500) and choose the one that minimises the OOB estimate of error rate.
–mtry is the number of features used in the construction of each tree. These features are selected at random, which is where the “random” in “random forests” comes from. The default value for this parameter, when performing classification, is sqrt(number of features).
–importance enables the algorithm to calculate variable importance.

We can quickly look at the results of our classifier for our training set by printing the contents of rf_classifier:

> rf_classifier

Call:
 randomForest(formula = Species ~ ., data = training,ntree=100,mtry=2, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 2

        OOB estimate of  error rate: 5.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         21          0         0  0.00000000
versicolor      0         25         2  0.07407407
virginica       0          2        25  0.07407407

As you can see, it lists the call used to build the classifier, the number of trees (100), the variables at each split (2), and it outputs a very useful confusion matrix and OOB estimate of error rate. This estimate is calculated by counting however many points in the training set were misclassified (2 versicolor and 2 virginica observations = 4) and dividing this number by the total number of observations (4/75 ~= 5.33%).

The OOB estimate of error rate is a useful measure to discriminate between different random forest classifiers. We could, for instance, vary the number of trees or the number of variables to be considered, and select the combination that produces the smallest value for this error rate. For more complicated data sets, i.e. when a higher number of features is present, a good idea is to use cross-validation to perform feature selection using the OOB error rate (see rfcv from randomForest for more details).

Remember the importance parameter? Let us take a look at the importance that our classifier has assigned to each variable:

varImpPlot(rf_classifier)

Each features’s importance is assessed based on two criteria:

-MeanDecreaseAccuracy: gives a rough estimate of the loss in prediction performance when that particular variable is omitted from the training set. Caveat: if two variables are somewhat redundant, then omitting one of them may not lead to massive gains in prediction performance, but would make the second variable more important.

-MeanDecreaseGini: GINI is a measure of node impurity. Think of it like this, if you use this feature to split the data, how pure will the nodes be? Highest purity means that each node contains only elements of a single class. Assessing the decrease in GINI when that feature is omitted leads to an understanding of how important that feature is to split the data correctly.

Do note that these measures are used to rank variables in terms of importance and, thus, their absolute values could be disregarded.

Ok, great. Looks like we have a classifier that was properly trained and is producing somewhat good predictions for our training set. Shall we evaluate what happens when we try to use this classifier to predict classes for our validation1 set?

# Validation set assessment #1: looking at confusion matrix
prediction_for_table <- predict(rf_classifier,validation1[,-5])
table(observed=validation1[,5],predicted=prediction_for_table)

            predicted
observed     setosa versicolor virginica
  setosa         29          0         0
  versicolor      0         20         3
  virginica       0          1        22

The confusion matrix is a good way of looking at how good our classifier is performing when presented with new data.

Another way of assessing the performance of our classifier is to generate a ROC curve and compute the area under the curve:

# Validation set assessment #2: ROC curves and AUC

# Needs to import ROCR package for ROC curve plotting:
library(ROCR)

# Calculate the probability of new observations belonging to each class
# prediction_for_roc_curve will be a matrix with dimensions data_set_size x number_of_classes
prediction_for_roc_curve <- predict(rf_classifier,validation1[,-5],type="prob")

# Use pretty colours:
pretty_colours <- c("#F8766D","#00BA38","#619CFF")
# Specify the different classes 
classes <- levels(validation1$Species)
# For each class
for (i in 1:3)
{
 # Define which observations belong to class[i]
 true_values <- ifelse(validation1[,5]==classes[i],1,0)
 # Assess the performance of classifier for class[i]
 pred <- prediction(prediction_for_roc_curve[,i],true_values)
 perf <- performance(pred, "tpr", "fpr")
 if (i==1)
 {
     plot(perf,main="ROC Curve",col=pretty_colours[i]) 
 }
 else
 {
     plot(perf,main="ROC Curve",col=pretty_colours[i],add=TRUE) 
 }
 # Calculate the AUC and print it to screen
 auc.perf <- performance(pred, measure = "auc")
 print(auc.perf@y.values)
}

Here is the final product (ROC curve):

And here are the values for our AUCs:

Setosa
AUC = 1

Versicolor
AUC = 0.98

Virginica
AUC = 0.98

Voila! I hope this was somewhat useful!

Interesting Jupyter and IPython Notebooks

Here’s a treasure trove of interesting Jupyter and iPython notebooks, with lots of diverse examples relevant to OPIG, including an RDKit notebook, but also:

Entire books or other large collections of notebooks on a topic (covering Introductory Tutorials; Programming and Computer Science; Statistics, Machine Learning and Data Science; Mathematics, Physics, Chemistry, Biology; Linguistics and Text Mining; Signal Processing; Scientific computing and data analysis with the SciPy Stack; General topics in scientific computing; Machine Learning, Statistics and Probability; Physics, Chemistry and Biology; Data visualization and plotting; Mathematics; Signal, Sound and Image Processing; Natural Language Processing; Pandas for data analysis); General Python Programming; Notebooks in languages other than Python (Julia; Haskell; Ruby; Perl; F#; C#); Miscellaneous topics about doing various things with the Notebook itself; Reproducible academic publications; and lots more!

Viewing 3D molecules interactively in Jupyter iPython notebooks

Greg Landrum, curator of the invaluable open source cheminformatics API, RDKit, recently blogged about viewing molecules in a 3D window within a Jupyter-hosted iPython notebook (as long as your browser supports WebGL, that is).

The trick is to use py3Dmol. It’s easy to install:

pip install py3Dmol

This is built on the object-oriented, webGL based JavaScript library for online molecular visualization 3Dmol.js (Rego & Koes, 2015); here's a nice summary of the capabilities of 3Dmol.js. It's features include:

support for pdb, sdf, mol2, xyz, and cube formats
parallelized molecular surface computation
sphere, stick, line, cross, cartoon, and surface styles
atom property based selection and styling
labels
clickable interactivity with molecular data
geometric shapes including spheres and arrows

I tried a simple example and it worked beautifully:

import py3Dmol
view = py3Dmol.view(query='pdb:1hvr')
view.setStyle({'cartoon':{'color':'spectrum'}})
view

The 3Dmol.js website summarizes how to view molecules, along with how to choose representations, how to embed it, and even how to develop with it.

References

Nicholas Rego & David Koes (2015). “3Dmol.js: molecular visualization with WebGL”.
Bioinformatics, 31 (8): 1322-1324. doi:10.1093/bioinformatics/btu829

Processing large files using python: part duex

Last week I wrote a post on some of the methods I use in python to efficiently process very large datasets. You can find that here. Roughly it details how one can break a large file into chunks which then can be passed onto multiple cores to do the number crunching. Below I expand upon this, first creating a parent class which turns a given (large) file into chunks. I construct it in a manner which children classes can be easily created and tailored for specific file types, given some examples. Finally, I give some wrapping functions for use in conjunction with any of the chunkers so that the chunks can be processed using multiple cores.

First, and as an aside, I was asked after the previous post, at what scale these methods should be considered. A rough answer would be when the size of the data becomes comparable to the available RAM. A better answer would be, when the overhead of reading each individual line(/entry) is more than the operation on that entry. Here is an example of this case, though it isn’t really that fair a comparison:


>>> import timeit,os.path
>>> os.path.getsize("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa")
10095955
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([l.count(">") for l in f])',number=10)
0.8403599262237549
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")])',number=10)
0.15671014785766602

For a small 10MB fasta file, we count the number of sequences present in a fifth of the time using chunks. I should be honest though, and state that the speedup is mostly due not having the identify newline characters in the chunking method; but nevertheless, it shows the power one can have using chunks. For a 14GB fasta file, the times for the chunking (1Mb chunks) and non-chunking methods are 55s and 130s respectively.

Getting back on track, let’s turn the chunking method into a parent class from which we can build on:


import os.path

class Chunker(object):

    #Iterator that yields start and end locations of a file chunk of default size 1MB.
    @classmethod
    def chunkify(cls,fname,size=1024*1024):
        fileEnd = os.path.getsize(fname)
        with open(fname,'r') as f:
            chunkEnd = f.tell()
            while True:
                chunkStart = chunkEnd
                f.seek(size,1)
                cls._EOC(f)
                chunkEnd = f.tell()
                yield chunkStart, chunkEnd - chunkStart
                if chunkEnd >= fileEnd:
                    break

    #Move file pointer to end of chunk
    @staticmethod
    def _EOC(f):
        f.readline()

    #read chunk
    @staticmethod
    def read(fname,chunk):
        with open(fname,'r') as f:
            f.seek(chunk[0])
            return f.read(chunk[1])

    #iterator that splits a chunk into units
    @staticmethod
    def parse(chunk):
        for line in chunk.splitlines():
            yield chunk

In the above, we create the class Chunker which has the class method chunkify as well as the static methods, _EOC, read, and parse. The method chunkify does the actual chunking of a given file, returning an iterator that yields tuples containing a chunk’s start and size. It’s a class method so that it can make use of _EOC (end of chunk) static method, to move the pointer to a suitable location to split the chunks. For the simplest case, this is just the end/start of a newline. The read and parse methods read a given chunk from a file and split it into units (single lines in the simplest case) respectively. We make the non-chunkify methods static so that they can be called without the overhead of creating an instance of the class.

Let’s now create some children of this class for specific types of files. First, one of the most well known file types in bioinformatics, FASTA. Below is an segment of a FASTA file. Each entry has a header line, which begins with a ‘>’, followed by a unique identifier for the sequence. After the header line, one or more lines follow giving the sequence. Sequences may be either protein or nucleic acid sequences, and they may contain gaps and/or alignment characters.


>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

And here is the file type specific chunker:


from Bio import SeqIO
from cStringIO import StringIO

class Chunker_FASTA(Chunker):

    @staticmethod
    def _EOC(f):
        l = f.readline() #incomplete line
        p = f.tell()
        l = f.readline()
        while l and l[0] != '>': #find the start of sequence
            p = f.tell()
            l = f.readline()
        f.seek(p) #revert one line

    @staticmethod
    def parse(chunk):
        fh = cStringIO.StringIO(chunk)
        for record in SeqIO.parse(fh,"fasta"):
            yield record
        fh.close()

We update the _EOC method to find when one entry finishes and the next begins by locating “>”, following which we rewind the file handle pointer to the start of that line. We also update the parse method to use fasta parser from the BioPython module, this yielding SeqRecord objects for each entry in the chunk.

For a second slightly harder example, here is one designed to work with output produced by bowtie, an aligner of short reads from NGS data. The format consists of of tab separated columns, with the id of each read located in the first column. Note that a single read can align to multiple locations (up to 8 as default!), hence why the same id appears in multiple lines. A small example section of the output is given below.


SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-2	2502 GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3

with the corresponding chunker given by:


class Chunker_BWT(chunky.Chunker):

    @staticmethod
    def _EOC(f):
        l = f.readline()#incomplete line
        l = f.readline()
        if not l: return #EOF
        readID = l.split()[0]
        while l and (l.split()[0] != readID): #Keep reading lines until read IDs don't match
            p = f.tell()	
            l = f.readline()
        f.seek(p) #revert one line

    @staticmethod
    def parse(chunk):
        lines = chunk.splitlines()
        N = len(lines)
        i = 0 
        while i < N:
            readID = lines[i].split('\t')[0]
            j = i
            while lines[j].split('\t')[0] == readID:
                j += 1
                if j == N:
                    break
            yield lines[i:j]
            i = j

This time, the end of chunk is located by reading lines until there is a switch in the read id, whereupon we revert one line. For parsing, we yield all the different locations a given read aligns to as a single entry.

Hopefully these examples show you how the parent class can be expanded upon easily for most file types. Let’s now combine these various chunkers with the code from previous post to show how we can enable multicore parallel processing of the chunks they yield. The code below contains a few generalised wrapper functions which work in tandem with any of the above chunkers to allow most tasks to be parallelised .


import multiprocessing as mp, sys

def _printMP(text):
    print text
    sys.stdout.flush()

def _workerMP(chunk,inFile,jobID,worker,kwargs):
    _printMP("Processing chunk: "+str(jobID))
    output = worker(chunk,inFile,**kwargs)
    _printMP("Finished chunk: "+str(jobID))
    return output	

def main(inFile,worker,chunker=Chunker,cores=1,kwargs={}):
    pool = mp.Pool(cores)

    jobs = []
    for jobID,chunk in enumerate(chunker.chunkify(inFile)):
        job = pool.apply_async(_workerMP,(chunk,inFile,jobID,worker,kwargs))
        jobs.append(job)

    output = []
    for job in jobs:
        output.append( job.get() )

    pool.close()
    
    return output

The main function should be recognisable as the code from the previous post. It generates the pool of workers, ideally one for each core, before using the given chunker to split the corresponding file into a series of chunks for processing. Unlike before, we collect the output given by each job and return it after processing is finished. The main function acts as wrapper allowing us to specify different processing functions and different chunkers, as given by the variables worker and chunker respectively. We have wrapped the processing function call within the function _workerMP which prints to the terminal as tasks are completed. It uses the function _printMP to do this, as you need to flush the terminal after a print statement when using multi core processing, otherwise nothing appears until all tasks are completed.

Let’s finish by showing an example of how we would use the above to do the same task as we did at the start of this post, counting the sequences within a fasta file, using the base chunker:


def seq_counter(chunk,inFile):
    data = Chunker.read(inFile,chunk)
    return data.count('>')

and using the FASTA chunker:


def seq_counter_2(chunk,inFile):
    data = list(Chunker_FASTA.parse(Chunker_FASTA.read(inFile,chunk)))
    return len(data)

And time they take to count the sequences within the 14GB file from before:


>>> os.path.getsize("SRR951828.fa")
14944287128
>>> x=time.time();f = open("SRR951828.fa");sum([l.count(">") for l in f]);time.time()-x
136829250
190.05533599853516
>>> x=time.time();f = open("SRR951828.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")]);time.time()-x
136829250
26.343637943267822
>>> x=time.time();sum(main("SRR951828.fa",seq_counter,cores=8));time.time()-x
136829250
4.36846399307251
>>> x=time.time();main("SRR951828.fa",seq_counter_2,Chunker_FASTA,8);time.time()-x
136829250
398.94060492515564

Let’s ignore that last one, as the slowdown is due to turning the entries into BioPython SeqRecords. The prior one, which combines chunking and multicore processing, has roughly a factor of 50 speed up. I’m sure this could be further reduced using more cores and/or optimising the chunk size, however, this difference alone can change something from being computationally implausible, to plausible. Not bad for only a few lines of code.

Anyway, as before, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. If I feel like it, I may follow this up with a post about how to integrate a queue into the above which outputs the result of each job as they are produced. In the above, we currently hold collate all the results in the memory, which has the potential to cause a memory overflow depending on what is being returned.

Processing large files using python

In the last year or so, and with my increased focus on ribo-seq data, I have come to fully appreciate what the term big data means. The ribo-seq studies in their raw forms can easily reach into hundreds of GBs, which means that processing them in both a timely and efficient manner requires some thought. In this blog post, and hopefully those following, I want to detail some of the methods I have come up (read: pieced together from multiple stack exchange posts), that help me take on data of this magnitude. Specifically I will be detailing methods for python and R, though some of the methods are transferrable to other languages.

My first big data tip for python is learning how to break your files into smaller units (or chunks) in a manner that you can make use of multiple processors. Let’s start with the simplest way to read a file in python.


with open("input.txt") as f:
    data = f.readlines()
    for line in data:
        process(line)

This mistake made above, with regards to big data, is that it reads all the data into RAM before attempting to process it line by line. This is likely the simplest way to cause the memory to overflow and an error raised. Let’s fix this by reading the data in line by line, so that only a single line is stored in the RAM at any given time.


with open("input.txt") as f:
    for line in f:
        process(line)

This is a big improvement, namely it doesn’t crash when fed a big file (though also it’s shorter!). Next we should attempt to speed this up a bit by making use of all these otherwise idle cores.


import multiprocessing as mp

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    for line in f:
        jobs.append( pool.apply_async(process,(line)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Provided the order of which you process the lines don’t matter, the above generates a set (pool) of workers, ideally one for each core, before creating a bunch of tasks (jobs), one for each line, for the workers to do. I tend to use the Pool object provided by the multiprocessing module due to ease of use, however, you can spawn and control individual workers using mp.Process if you want finer control. For mere number crunching, the Pool object is very good.

While the above is now making use of all those cores, it sadly runs into memory problems once again. We specifically use apply_async function so that the pool isn’t blocked while each line processes. However, in doing so, all the data is read into memory once again; this time stored as individual lines associated with each job, waiting inline to be processed. As such, the memory will again overflow. Ideally the method will only read the line into memory when it is its turn to be processed.


import multiprocessing as mp

def process_wrapper(lineID):
    with open("input.txt") as f:
        for i,line in enumerate(f):
            if i != lineID:
                continue
            else:
                process(line)
                break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    for ID,line in enumerate(f):
        jobs.append( pool.apply_async(process_wrapper,(ID)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Above we’ve now changed the function fed to pool of workers to include opening the file, locating the specified line, reading it into memory, and then processing it. The only input now stored for each job spawned is the line number, thereby preventing the memory overflow. Sadly, the overhead involved in having to locate the line by reading iteratively through the file for each job is untenable, getting progressively more time consuming as you get further into the file. To avoid this we can use the seek function of file objects which skips you to a particular location within a file. Combining with the tell function, which returns the current location within a file, gives:


import multiprocessing as mp

def process_wrapper(lineByte):
    with open("input.txt") as f:
        f.seek(lineByte)
        line = f.readline()
        process(line)

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    nextLineByte = f.tell()
    for line in f:
        jobs.append( pool.apply_async(process_wrapper,(nextLineByte)) )
        nextLineByte = f.tell()

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Using seek we can move directly to the correct part of the file, whereupon we read a line into the memory and process it. We have to be careful to correctly handle the first and last lines, but otherwise this does exactly what we set out, namely using all the cores to process a given file while not overflowing the memory.

I’ll finish this post with a slight upgrade to the above as there is a reasonable amount of overhead associated with opening and closing the file for each individual line. If we process multiple lines of the file at a time as a chunk, we can reduce these operations. The biggest technicality when doing this is noting that when you jump to a location in a file, you are likely not located at the start of a line. For a simple file, as in this example, this just means you need to call readline, which reads to next newline character. More complex file types likely require additional code to locate a suitable location to start/end a chunk.


import multiprocessing as mp,os

def process_wrapper(chunkStart, chunkSize):
    with open("input.txt") as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        for line in lines:
            process(line)

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    with open(fname,'r') as f:
        chunkEnd = f.tell()
    while True:
        chunkStart = chunkEnd
        f.seek(size,1)
        f.readline()
        chunkEnd = f.tell()
        yield chunkStart, chunkEnd - chunkStart
        if chunkEnd > fileEnd:
            break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
for chunkStart,chunkSize in chunkify("input.txt"):
    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Anyway, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. In another post coming in the near future, I will expanded on this code, turning it into a parent class from which create multiple children to use with various file types.

Oxford Protein Informatics Group

or "OPIG" to friends