Category Archives: Code

Making the most of your CPUs when using python

Over the last decade, single-threaded CPU performance has begun to plateau, whilst the number of logical cores has been increasing exponentially.

Like it or loathe it, for the last few years, python has featured as one of the top ten most popular languages [tiobe / PYPL].   That being said however, python has an issue which makes life harder for the user wanting to take advantage of this parallelism windfall.  That issue is called the GIL (Global Interpreter Lock).  The GIL can be thought of as the conch shell from Lord of the Flies.  You have to hold the conch (GIL) for your thread to be computed.  With only one conch, no matter how beautifully written and multithreaded your code, there will still only be one thread will be executed at any point in time.

Continue reading

Automated testing with doctest

One of the ways to make your code more robust to unexpected input is to develop with boundary cases in your mind. Test-driven code development begins with writing a set of unit tests for each class. These tests often includes normal and extreme use cases. Thanks to packages like doctest for Python, Mocha and Jasmine for Javascript etc., we can write and test codes with an easy format. In this blog post, I will present a short example of how to get started with doctest in Python. N.B. doctest is best suited for small tests with a few scripts. If you would like to run a system testing, look for some other packages!

Continue reading

Vim and I

Vim is great. Despite its steep learning curve , it has many advantages and many loyal Vim followers will tell you that you should force yourself to use it.

Personally I started using Vim when I was ssh-ing into the group servers or into my computer in department. In such scenarios, I could not open the IDEs with the nice GUIs 🙁 However, as time passed, Vim started to grow on me. Now, I can list a few reasons why I think it is great, for example, it requires a small amount of memory to run, has a short start up time and can handle large files pretty well. 

Although, I am definitely not a Vim expert, I will tell you about some of the things I have added to my .vimrc. The .vimrc file is very handy for containing all your favourite settings, such as key mappings, custom commands, formatting and syntax highlighting. The file uses vimscript which is a programming language in itself. However, there is a lot of help online that tells you with what lines to add to your .vimrc. I would recommend installing Vundle which is a Vim plugin manager. 

Here I will list some cool things that I have discovered you can do with your .vimrc.   It has certainly made my life a bit nicer.

  1. Code Folding
    Most IDEs provide a way to collapse functions and classes that results in only seeing the function/class definition and hiding the code. To do this in Vim add the following lines to your .
    vimrc 

    " Enable folding
    set foldmethod=indent
    set foldlevel=99
    " Enable folding with the spacebar
    nnoremap <space> za


    Alternatively, you can install the Vim plugin SimpylFold.

  2. Python indentation
    Vim does not do auto indention like many IDEs. To automatically do PEP-8 indentation for Python, add the following to your .vimrc . 

    " PEP indentation
    au BufNewFile,BufRead *.py
    \ set tabstop=4    
    \ set softtabstop=4    
    \ set shiftwidth=4    
    \ set textwidth=79    
    \ set expandtab    
    \ set autoindent    
    \ set fileformat=unix
    

    You can also install the Vim plugin vim-flake8 which is a static syntax and style checker for Python source code. It shows errors in a quickfix window and lets you jump to their location inside your code.

  3. Turn line numbers on 
    Rather than typing in  
    :set nu 
    every time you open your files. You can always have them turned on by adding :set nu to your .vimrc
  4. Autocompletion 
    When I switch from PyCharm to Vim I feel a bit lost without the autocompletion however, after a quick search I found many are using the Vim package Youcompleteme and it is awesome. 

docopt for dummies

Parsing command line arguments is an annoying piece of boilerplate we all have to do. Documenting our code is either an absolutely essential part of software engineering, or a frivolous waste of research time, depending on who you ask. But what if I told you that we can combine the two? That you can handle your argument parsing simply by documenting how your code works? Well, the dream is now reality. Continue reading

Introduction to R Markdown

Two of our esteemed OPIGlets presented a workshop on collaborative research using Jupyter Notebook this week at ISMB in Chicago. Their workshop highlights the importance of finding ways to share your work conveniently and reproducibly. So on a related note, I thought I would share a brief introduction to another useful tool, R Markdown with RStudio, which I use to present updates to various supervisors and to remember what I did three months (or three days) ago. This method of sharing work is highly readable, reproducible, and narrative-driven.

I use R for much of my data analysis and all of my visualisation, and I count the tidyverse among my most beloved friends. If you’re so inclined, it’s easy to execute python, bash, and more from within R Markdown. You also don’t need to use RStudio to use R Markdown, but that’s a whole other story.

Starting a new markdown file in RStudio will generate a template script explaining most of what you need to know. If I showed you that then I’d be out of a blog post, but I will at least link to the R Markdown Reference Guide.

R Markdown files consist of text written in markdown, and code chunks that can be individually executed and displayed inline within RStudio. To “knit” the whole thing together, the knitr package is used to execute and combine code chunks, then pandoc converts the whole thing into an attractive document.

Here’s an example. The metadata at the top sets up the document. I’ll be generating an html document here, but notice some other tempting examples commented out. Yes, you can use it for Latex (swoon). You can even make a Word document, but really, why would you?

---
title: "Informative Title"
author: "Clare E. West"
date: "10/07/2018"
output: html_document
#output: beamer_presentation
#output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(ggplot2)
library(tidyr)
library(dplyr)
```

## Big Title
### Smaller title

R Markdown scripts have the extension .Rmd

R Markdown is __so__ *fun*. You can read all about it [here](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf).

```{r}
print("Hello world")
```

Notice that chunks are enclosed within three backticks, with the language and options in braces. Single commands can be executed inline using single backticks.

As highlighted in the example above, global options are set like this:

knitr::opts_chunk$set(echo = TRUE)

“echo=TRUE” means that the code in each chunk is displayed in the final product; this is useful to show collaborators (or your future self) exactly how you did something. Change this option (“echo = FALSE”) globally or in individual blocks to prevent code from printing. This is useful to hide uninteresting commands, or when presenting to people who don’t have the time or inclination to read your code (hard to imagine). Notice I’ve also used “include = FALSE”  for the library-loading code chunk, which means evaluate but don’t include in the output. Another useful option is “eval = FALSE”, which means don’t even run this chunk.

So let’s see what that looks like when we render it:

The above example output as HTML

The above example output as Latex

Plots generated in code chunks or images from other sources can be embedded. Set the width in the options. “fig.width” sets the width (in inches) of the figure generated, while “out.width” scales the image in the final documents, for which the units will depend on the document type. Within RStudio, these are previewed inline below the code chunk.

## Including plots/images
```{r fig.width = 4, fig.height = 3, out.width = "400px", echo=FALSE}
t  %>% group_by(Tour, Winner, N, Tournament) %>% filter(WRank <= 20) %>% summarise(WPts = max(WPts))  %>% ggplot(aes(x=N, y=WPts, group=Winner, colour=(Winner=="Murray A."))) + geom_point() + geom_line() + labs(x="Tournament Number",y="Ranking Points") + scale_colour_discrete("",labels=c("Not Andy Murray", "Andy Murray")) + theme_bw() + theme(legend.position = "bottom", legend.margin = margin(0, 0, 0, 0))
knitr::include_graphics("https://s.yimg.com/ny/api/res/1.2/69ZUzNSMYb09GKd8CNJeew--~A/YXBwaWQ9aGlnaGxhbmRlcjtzbT0xO3c9ODAwO2g9NjAw/http://media.zenfs.com/en_us/News/afp.com/0102e1f7d0d3c35303c8a62d56a5eb79c2c8b4d8.jpg")
```

Rather than just printing data R-style, you can nicely format it into a table using kable (part of knitr). I also style mine using kableExtra, which makes it look nice and gives you extra options. By default tables fill the full width, you can override this using e.g. kable_styling(full.width = FALSE, position = “left”). When making a latex document, use kable(table, booktabs = T, “latex”) to get a (reproducible) latex-style table.

Here’s how to use python and bash. Thanks to the package reticulate, you can even share objects between your R and Python chunks. Exclude reticulate (knitr::opts_chunk$set(python.reticulate=FALSE) if you prefer to keep your languages separate.

 

### Mix it up with python
```{python}
a='Wow python'
print(a.split()[0])
```

What a wild ride. 

### or bash

```{bash, echo=TRUE}
ls | head 
```

Oh look, there's our output, ready to share.

Finally, if you hate GUIs – and you know I do – you can ditch the interactive notebook part and just generate documents from R Markdown files like this:

rmarkdown::render("BlogExample.Rmd")

 

How to parse OAS data

We have recently released the Observed Antibody Space database – collection of cleaned and annotated antibody sequence (Ig-seq or AIRR-seq) data from 53 studies. We have formatted the data in the format that should facilitate data mining and since release we had several queries on how to parse the data out. Therefore here we give a small example of how to parse the data and make sense of it.

You should download the bulk data file from OAS, available here.

The datasets are separated into ‘data units‘  – collections of sequences that can be uniquely assigned to a range of metadata parameters such as study, organism etc. Our task therefore is to iterate through all those files and read sequences from each of these. Firstly we will attempt to iterate through the files and I will assume that you uncompressed the bulk data file into ../data/json folder. We will write a helper function that will simply list all files with its full paths in a directory and call it list_file_paths

import os

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        print f

The code above will list all the files in ../data/json which incidentally are all the ‘data units’. Now our task is to parse out the output from each of the data units. They are gzipped files with data element on each line. Therefore we will use gzip library to stream the contents of a gzipped file rather than uncompressing each one of them separately. This is achieved by function parse_single_file

import os,gzip

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

#Parse out the contents of a single file.
def parse_single_file(src):
    #The first line are the meta entries.
    meta_line = True
    for line in gzip.open(src,'rb'):
        print line
    

if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        parse_single_file(f)

The code above will simply go through all the data unit files, stream the gzipped lines and print each one of them separately. Each line however is formatted as json – meaning it can be parsed using pythons json library and act as a pythonic dictionary. below we have parsed out the basic elements in the final incarnation of the code:

import os,gzip,json,pprint

#Fetch all files in directory and subdirectories.
def list_file_paths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           yield os.path.abspath(os.path.join(dirpath, f))

#Parse out the contents of a single file.
def parse_single_file(src):
    #The first line are the meta entries.
    meta_line = True
    for line in gzip.open(src,'rb'):
        if meta_line == True:
                metadata = json.loads(line)
                meta_line=False
                print "Metadata:"
                pprint.pprint(metadata)
                continue
        #Parse actual sequence data.
        basic_data = json.loads(line)
        print "Basic data:"
        pprint.pprint(basic_data)

        #IMGT-Numbered sequence.
        print "IMGT-numbered sequence"
        d = json.loads(basic_data['data'])
        pprint.pprint(d)
        print "==========="
    
if __name__ == '__main__':
    #Replace this with the location of where you uncompressed the bulk data file.
    directory = '../data/json'

    for f in list_file_paths(directory):
        parse_single_file(f)

The first line of each data unit are meta entries. These look as follows:

{u'Age': u'22-70',
 u'Author': u'Halliley et al., (2015)',
 u'BSource': u'Bone-Marrow',
 u'BType': u'Plasma-B-Cells',
 u'Chain': u'Heavy',
 u'Disease': u'None',
 u'Isotype': u'IGHM',
 u'Link': u'https://doi.org/10.1016/j.immuni.2015.06.016',
 u'Longitudinal': u'no',
 u'Size': 934,
 u'Species': u'human',
 u'Subject': u'no',
 u'Vaccine': u'Tetanus/Flu'}

The attributes should be self-explanatory and the existence of this data on top of each file is supposed to streamline searching through data-units if you wish to parse sequences given a particular configuration of meta-data entries (e.g. organism).

Next, the code parses out data on each sequence that is associated with its genes, full sequence, CDR3 and numbered sequence. Therefore the output for this will look something like this:

{u'cdr3': u'ARHQGVYWVTTAGLSH',
 u'data': u'{"fwh1": {"11": "G", "24": "T", "13": "V", "12": "L", "15": "P", "14": "K", "17": "E", "16": "S", "19": "L", "18": "T", "22": "T", "26": "S", "25": "V", "21": "L", "20": "S", "23": "C"}, "fwh3": {"68": "N", "88": "S", "89": "L", "66": "Y", "67": "Y", "82": "T", "83": "S", "80": "V", "81": "D", "86": "Q", "87": "F", "84": "K", "85": "N", "92": "S", "79": "S", "69": "P", "104": "C", "78": "I", "77": "T", "76": "V", "75": "R", "74": "S", "72": "K", "71": "L", "70": "S", "102": "Y", "90": "K", "100": "A", "101": "V", "95": "T", "94": "V", "97": "A", "96": "A", "91": "L", "99": "T", "98": "D", "93": "S", "103": "Y"}, "fwh2": {"52": "W", "39": "W", "48": "Q", "49": "G", "46": "P", "47": "G", "44": "Q", "45": "P", "51": "E", "43": "R", "40": "G", "42": "I", "55": "S", "53": "I", "54": "G", "41": "W", "50": "L"}, "fwh4": {"120": "Q", "121": "G", "122": "T", "123": "L", "124": "V", "125": "P", "126": "V", "127": "S", "128": "S", "119": "G", "118": "W"}, "cdrh1": {"27": "G", "37": "Y", "31": "S", "30": "I", "28": "G", "29": "S", "35": "S", "34": "S", "38": "Y", "36": "S"}, "cdrh2": {"59": "S", "58": "Y", "57": "S", "56": "I", "63": "G", "64": "T", "65": "T"}, "cdrh3": {"111A": "W", "109": "G", "108": "Q", "115": "L", "114": "G", "117": "H", "116": "S", "111": "Y", "110": "V", "113": "A", "112": "T", "112A": "T", "112B": "V", "106": "R", "107": "H", "105": "A"}}',
 u'j': u'IGHJ1*01',
 u'name': 12,
 u'redundancy': 1,
 u'seq': u'GLVKPSETLSLTCTVSGGSISSSSYYWGWIRQPPGQGLEWIGSISYSGTTYYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARHQGVYWVTTAGLSHWGQGTLVPVSS',
 u'v': u'IGHV4-39*07'}

Above, redundancy refers to how many times we see a given sequence (seq) in a particular study. We also store the IMGT-numbered data (the data attribute) which needs a second round of json parsing and its output is a dictionary of IMGT-number – amino acid associations grouped by the regions of an antibody (cdrs and framework regions):

{u'cdrh1': {u'27': u'G',
            u'28': u'G',
            u'29': u'S',
            u'30': u'I',
            u'31': u'S',
            u'34': u'S',
            u'35': u'S',
            u'36': u'S',
            u'37': u'Y',
            u'38': u'Y'},
 u'cdrh2': {u'56': u'I',
            u'57': u'S',
            u'58': u'Y',
            u'59': u'S',
            u'63': u'G',
            u'64': u'T',
            u'65': u'T'},
 u'cdrh3': {u'105': u'A',
            u'106': u'R',
            u'107': u'H',
            u'108': u'Q',
            u'109': u'G',
            u'110': u'V',
            u'111': u'Y',
            u'111A': u'W',
            u'112': u'T',
            u'112A': u'T',
            u'112B': u'V',
            u'113': u'A',
            u'114': u'G',
            u'115': u'L',
            u'116': u'S',
            u'117': u'H'},
 u'fwh1': {u'11': u'G',
           u'12': u'L',
           u'13': u'V',
           u'14': u'K',
           u'15': u'P',
           u'16': u'S',
           u'17': u'E',
           u'18': u'T',
           u'19': u'L',
           u'20': u'S',
           u'21': u'L',
           u'22': u'T',
           u'23': u'C',
           u'24': u'T',
           u'25': u'V',
           u'26': u'S'},
 u'fwh2': {u'39': u'W',
           u'40': u'G',
           u'41': u'W',
           u'42': u'I',
           u'43': u'R',
           u'44': u'Q',
           u'45': u'P',
           u'46': u'P',
           u'47': u'G',
           u'48': u'Q',
           u'49': u'G',
           u'50': u'L',
           u'51': u'E',
           u'52': u'W',
           u'53': u'I',
           u'54': u'G',
           u'55': u'S'},
 u'fwh3': {u'100': u'A',
           u'101': u'V',
           u'102': u'Y',
           u'103': u'Y',
           u'104': u'C',
           u'66': u'Y',
           u'67': u'Y',
           u'68': u'N',
           u'69': u'P',
           u'70': u'S',
           u'71': u'L',
           u'72': u'K',
           u'74': u'S',
           u'75': u'R',
           u'76': u'V',
           u'77': u'T',
           u'78': u'I',
           u'79': u'S',
           u'80': u'V',
           u'81': u'D',
           u'82': u'T',
           u'83': u'S',
           u'84': u'K',
           u'85': u'N',
           u'86': u'Q',
           u'87': u'F',
           u'88': u'S',
           u'89': u'L',
           u'90': u'K',
           u'91': u'L',
           u'92': u'S',
           u'93': u'S',
           u'94': u'V',
           u'95': u'T',
           u'96': u'A',
           u'97': u'A',
           u'98': u'D',
           u'99': u'T'},
 u'fwh4': {u'118': u'W',
           u'119': u'G',
           u'120': u'Q',
           u'121': u'G',
           u'122': u'T',
           u'123': u'L',
           u'124': u'V',
           u'125': u'P',
           u'126': u'V',
           u'127': u'S',
           u'128': u'S'}}

We hope this quick intro to our data format will allow you to do great science with this data.

Interactive Illustration of Collaboration Networks with D3

D3 is a JavaScript Library that allows the creation of interactive data visualisations. D3 stands for Data-Driven Documents and its advantage is that it is using internet standards as HTML, CSS, and SVG as the foundation. This gives maximal compatibility across all moderns browsers. It is widely used by journalists, data scientists, and starts to be used by academic scientists, too.

Here I want to present a simple way of creating interactive illustrations of collaboration networks, e.g., for your own website. Before going into details, have a look at the example below, it presents some of Rosalind Franklin’s papers and her co-authors.

[d3-source canvas=”wpd3-3903-2″]

The network illustration is a so-called bipartite network because it consists of two types of nodes, blue nodes represent scientists and orange nodes represent publications. These nodes are connected if a scientist is an author on a publication. This way of presenting is a so-called force layout, which simulates repelling Coulomb forces between all nodes and spring-like attractive forces act on nodes that are connected via links. You can drag the nodes around and explore the behaviour of the network.

You will have noticed that you can also see the name of the publications and co-authors when you hover over the nodes. For some of them, a representative figure or photograph is shown, too. You can also double-click the nodes and will be directed to the webpage of the publication or author.

For larger collaboration networks the visualisation is still possible but might get a bit messy. Thus, not showing any figures is advised. Furthermore, the name of the nodes should be shown either below the illustration or as a tooltip (the later is not possible in WordPress but on normal web pages as here). The illustration below shows all 124 publications written by OPIG, together with the 310 authors. Here, I had to remove the Head of the Group Charlotte Deane, as she is on almost all of these papers and would clutter the illustration, unfortunately. In network science, we call such a node an ego node.

[d3-source canvas=”wpd3-3903-0″]

If you want to play around with this, check out my Github repository. The creation of the network is fairly simple. You only need a Bibtex file and can then execute a Python script that creates a JSON file that is read into the JavaSript. This can then be incorporated in all web pages easily.

PS: To enable D3 in WordPress you need a special plugin.  Apparently, it is not up to date with the current stable D3 version 4 but you can load any missing functions manually.

Crystallographic programming: Super short tour of the cctbx

Two of the leading packages in crystallography are Phenix and CCP4. For most practicing crystallographers they will interact via with these to progress a single crystallographic data-set from diffraction images, through integration, merging, phasing, model building and hopefully deposition.

However, if you want to develop crystallographic software, you will likely need to decide on a framework to build upon. Phenix is built on the comprehensive cctbx library, whereas CCP4 programs are typically standlone, although common crystallographic libraries such as clipper and cctbx are utilised.

CCTBX is written mainly in python, with core crystallographic functionality written in C++. My usual starting place for understanding functionality is through the pdb parser tutorial. This introduces the concept of a hierarchy, a iterative way to represent a macromolecule:

from iotbx.pdb import hierarchy
pdb_in = hierarchy.input(file_name="model.pdb")
for chain in pdb_in.hierarchy.only_model().chains() :
  for residue_group in chain.residue_groups() :
    for atom_group in residue_group.atom_groups() :
      for atom in atom_group.atoms() :
        if (atom.element.strip().upper() == "ZN") :
          atom_group.remove_atom(atom)
      if (atom_group.atoms_size() == 0) :
        residue_group.remove_atom_group(atom_group)
    if (residue_group.atom_groups_size() == 0) :
      chain.remove_residue_group(residue_group)
f = open("model_Zn_free.pdb", "w")
f.write(pdb_in.hierarchy.as_pdb_string(
  crystal_symmetry=pdb_in.input.crystal_symmetry()))
f.close()

Although there are many ways to parse a pdb file, the introduction to iotbx.pdb, gives a view of how xray structure data can be associated to the model. The tour of the cctbx can be helpful starting place, especially for understanding how the python and c++ functionality interact through boost and the scitbx.array_family.flex. Unfortunately, documentation on cctbx tends to vary in quality and quantity throughout the modules:

Other components of the library include ways to simulate crystallographic data through simtbx,  and tools for processing xfel data.

As the library is open source, github hosted source code allows exploration of previously written routines, which can be very helpful for understanding the inner workings of the library. Note that there are also bulletin boards for users and developers of phenix and cctbx respectively. A few tutorials can also be found.

Hopefully this post will give someone other than me a reminder of where to find resources to get started developing within CCTBX.

Latexing with gvim

Here I’ll share my set-up for writing Latex with gvim instead of a separate Latex editor. If you are text-editor averse, this blog post is not for you. But if, like me, you love vim and hate useless GUIs, this might be helpful.

We’re lucky to have nice big screens in the Stats Department, but I tend to prefer writing on my MacBook (I find it’s easier to transport to e.g. a cafe, my home, etc). Until now, I’ve been happily using TexMaker for writing, but during a recent period of intense Latexing I started to find the useable screen space oppressively small. The unnecessary GUI had to go.   

No offence TexMaker but I don’t like you

One of our good friends in Statistical Genetics recommended some things to help me with the transition to just using good old (g)vim, which I will now recommend to you.

The key thing is the LaTex-Box plug-in for vim, which gives you the compilation commands, as well as the essentials such as smart indentation, highlight matching, command completion, etc. I used pathogen to install it (see the GitHub for instructions).

Of course, you can then customise your .vimrc file to add more helpful things. This can be the simple preferences, such as using a light background when using gvim:

 

if has(“gui_running”)

        set background=light

endif

You can also do more complicated magic like tabbing through available commands, and the ability to minimise sections, etc. Sidenote: to make working with paragraphs easier, I recommend setting the up/down arrows to move the cursor to the next line in the GUI rather than the next actual line. I prefer overriding this behaviour only in gvim, while leaving the normal behaviour in vim (for actual coding). But each to their own.

To get started, open a .tex file, then compile and view the document with the command Latexmk.

Command suggestions are an example of a magical feature added in .vimrc

The configurations for this command are set in the file .latexmkrc. Mine looks like this:

 

$recorder = 1;
$pdf_mode = 1;
$bibtex_use = 2;
$pdflatex = "pdflatex --shell-escape %O %S";
$pdf_previewer = "start open -a skim %O %S";

My pdf viewer of choice on Mac is Skim, which autoupdates. I view the source and preview at the same time using split view. Please admire the beauty below:

Wow what a beautiful screen

My favourite part is that whenever you save (w), it recompiles and updates the preview. As someone who accidentally types :w everywhere that isn’t vim, it’s nice that this is now productive. It also recompiles automatically if the .bib file is updated. Note that if you have errors at compilation (I’m sure you don’t), you can view them with the command LatexErrors.

Now you too can be a (nearly) GUI-free lightweight Latexer. Enjoy!

 

Using Random Forests in Python with Scikit-Learn

I spend a lot of time experimenting with machine learning tools in my research; in particular I seem to spend a lot of time chasing data into random forests and watching the other side to see what comes out. In my many hours of Googling “random forest foobar” a disproportionate number of hits offer solutions implemented in R. As a young Pythonista in the present year I find this a thoroughly unacceptable state of affairs, so I decided to write a crash course in how to build random forest models in Python using the machine learning library scikit-learn (or sklearn to friends). This is far from exhaustive, and I won’t be delving into the machinery of how and why we might want to use a random forest. Rather, the hope is that this will be useful to anyone looking for a hands-on introduction to random forests (or machine learning in general) in Python.

In the future I’ll write a more in-depth post on how a few libraries turn Python into a powerful environment for data handling and machine learning. Until then, though, let’s jump into random forests!

Toy datasets

Sklearn comes with several nicely formatted real-world toy data sets which we can use to experiment with the tools at our disposal. We’ll be using the venerable iris dataset for classification and the Boston housing set for regression. Sklearn comes with a nice selection of data sets and tools for generating synthetic data, all of which are well-documented. Now, let’s write some Python!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
iris = datasets.load_iris()

Classification using random forests

First we’ll look at how to do solve a simple classification problem using a random forest. The iris dataset is probably the most widely-used example for this problem and nicely illustrates the problem of classification when some classes are not linearly separable from the others.

First we’ll load the iris dataset into a pandas dataframe. Pandas is a nifty Python library which provides a data structure comparable to the dataframes found in R with database style querying. As an added bonus, the seaborn visualization library integrates nicely with pandas allowing us to generate a nice scatter matrix of our data with minimal fuss.

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# sklearn provides the iris species as integer values since this is required for classification
# here we're just adding a column with the species names to the dataframe for visualisation
df['species'] = np.array([iris.target_names[i] for i in iris.target])

sns.pairplot(df, hue='species')

Neat. Notice that iris-setosa is easily identifiable by petal length and petal width, while the other two species are much more difficult to distinguish. We could do all sorts of pre-processing and exploratory analysis at this stage, but since this is such a simple dataset let’s just fire on. We’ll do a bit of pre-processing later when we come to the Boston data set.

First, let’s split the data into training and test sets. We’ll used stratified sampling by iris class to ensure both the training and test sets contain a balanced number of representatives of each of the three classes. Sklearn requires that all features and targets be numeric, so the three classes are represented as integers (0, 1, 2). Here we’re doing a simple 50/50 split because the data are so nicely behaved. Typically however we might use a 75/25 or even 80/20 training/test split to ensure we have enough training data. In true Python style this is a one-liner.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)

Now let’s fit a random forest classifier to our training set. For the most part we’ll use the default settings since they’re quite robust. One exception is the out-of-bag estimate: by default an out-of-bag error estimate is not computed, so we need to tell the classifier object that we want this.

If you’re used to the R implementation, or you ever find yourself having to compare results using the two, be aware that some parameter names and default settings are different between the two. Fortunately both have excellent documentation so it’s easy to ensure you’re using the right parameters if you ever need to compare models.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)

Let’s see how well our model performs when classifying our unseen test data. For a random forest classifier, the out-of-bag score computed by sklearn is an estimate of the classification accuracy we might expect to observe on new data. We’ll compare this to the actual score obtained on our test data.

from sklearn.metrics import accuracy_score

predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)

print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')
Out-of-bag score estimate: 0.973
Mean accuracy score: 0.933

Not bad. However, this doesn’t really tell us anything about where we’re doing well. A useful technique for visualising performance is the confusion matrix. This is simply a matrix whose diagonal values are true positive counts, while off-diagonal values are false positive and false negative counts for each class against the other.

from sklearn.metrics import confusion_matrix

cm = pd.DataFrame(confusion_matrix(y_test, predicted), columns=iris.target_names, index=iris.target_names)
sns.heatmap(cm, annot=True)

This lets us know that our model correctly separates the setosa examples, but exhibits a small amount of confusion when attempting to distinguish between versicolor and virginica.

Random forest regression

Now let’s look at using a random forest to solve a regression problem. The Boston housing data set consists of census housing price data in the region of Boston, Massachusetts, together with a series of values quantifying various properties of the local area such as crime rate, air pollution, and student-teacher ratio in schools. The question for us is whether we can use these data to accurately predict median house prices. One caveat of this data set is that the median house price is truncated at $50,000 which suggests that there may be considerable noise in this region of the data. You might want to remove all data with a median house price of $50,000 from the set and see if the regression improves at all.

As before we’ll load the data into a pandas dataframe. This time, however, we’re going to do some pre-processing of our data by independently transforming each feature to have zero mean and unit variance. The values of different features vary greatly in order of magnitude. If we were to analyse the raw data as-is, we run the risk of our analysis being skewed by certain features dominating the variance. This isn’t strictly necessary for a random forest, but will enable us to perform a more meaningful principal component analysis later. Performing this transformation in sklearn is super simple using the StandardScaler class of the preprocessing module. This time we’re going to use an 80/20 split of our data. You could bin the house prices to perform stratified sampling, but we won’t worry about that for now.

boston = datasets.load_boston()

features = pd.DataFrame(boston.data, columns=boston.feature_names)
targets = boston.target

As before, we’ve loaded our data into a pandas dataframe. Notice how I have to construct new dataframes from the transformed data. This is because sklearn is built around numpy arrays. While it’s possible to return a view of a dataframe as an array, transforming the contents of a dataframe requires a little more work. Of course, there’s a library for that, but I’m lazy so I didn’t use it this time.

from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.8, random_state=42)

scaler = StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(scaler.transform(X_train), index=X_train.index.values, columns=X_train.columns.values)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), index=X_test.index.values, columns=X_test.columns.values)

With the data standardised, let’s do a quick principal-component analysis to see if we could reduce the dimensionality of the problem. This is quick and easy in sklearn using the PCA class of the decomposition module.

from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train)
cpts = pd.DataFrame(pca.transform(X_train))
x_axis = np.arange(1, pca.n_components_+1)
pca_scaled = PCA()
pca_scaled.fit(X_train_scaled)
cpts_scaled = pd.DataFrame(pca.transform(X_train_scaled))

# matplotlib boilerplate goes here

Notice how without data standardisation the variance is completely dominated by the first principal component. With standardisation, however, we see that in fact we must consider multiple features in order to explain a significant proportion of the variance. You might want to experiment with building regression models using the principal components (or indeed just combinations of the raw features) to see how well you can do with less information. For now though we’re going to use all of the (scaled) features as the regressors for our model. As with the classification problem fitting the random forest is simple using the RandomForestRegressor class.

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=500, oob_score=True, random_state=0)
rf.fit(X_train, y_train)

Now let’s see how we do on our test set. As before we’ll compare the out-of-bag estimate (this time it’s an R-squared score) to the R-squared score for our predictions. We’ll also compute Spearman rank and Pearson correlation coefficients for our predictions to get a feel for how we’re doing.

from sklearn.metrics import r2_score
from scipy.stats import spearmanr, pearsonr

predicted_train = rf.predict(X_train)
predicted_test = rf.predict(X_test)

test_score = r2_score(y_test, predicted_test)
spearman = spearmanr(y_test, predicted_test)
pearson = pearsonr(y_test, predicted_test)

print(f'Out-of-bag R-2 score estimate: {rf.oob_score_:>5.3}')
print(f'Test data R-2 score: {test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
Out-of-bag R-2 score estimate: 0.841
Test data R-2 score: 0.886
Test data Spearman correlation: 0.904
Test data Pearson correlation: 0.942

Not too bad, though there are a few outliers that would be worth looking into. Your challenge, should you choose to accept it, is to see if removing the $50,000 data improves the regression.

Wrapping up

Congratulations on making it this far. Now you know how to pre-process your data and build random forest models all from the comfort of your iPython session. I plan on writing more in the future about how to use Python for machine learning, and in particular how to make use of some of the powerful tools available in sklearn (a pipeline for data preparation, model fitting, prediction, in one line of Python? Yes please!), and how to make sklearn and pandas play nicely with minimal hassle. If you’re lucky, and if I can bring myself to process the data nicely, I might include some fun examples from less well-behaved real-world data sets.

Until then, though, happy Pythoning!