Singularity: a guide for the bewildered bioinformatician

Have you ever worked with a piece of software that is awfully difficult to set up? That legacy code written on FORTRAN 77, that other one that requires significant modifications to compile, or any of those that require a long-winded bash script with a thousand dependencies (which you also have to install!). Would it not be helpful if, when that red-eyed PhD student, that one that just spent three months writing up their thesis, says that they absolutely must use that server where you have installed all your stuff, you could just relocate to another one without trouble? Well, you may be able to do that now. You just need to use containerization.

The idea behind containerization is rather simple. The best way to ensure anyone can reproduce your work is to, well, ship your entire system to whomever needs to use it. You could, for example, pack up your desktop in a box, and ship it to your collaborators anywhere in the world. Unfortunately, this idea is quite unpractical, not only because of tedious logistics (ever had to deal with customs?), but also because suddenly you won’t be able to run your own pipeline. However, it is a good enough thought that at some point made a clever engineer wonder whether there was a way to ship an entire system without physically delivering the computer. And that’s exactly what they designed.

40ft x 8ft (9ft 6") One trip high cube shipping container bl
Best way to make sure your collaborators on the other side of the world can run your pipeline — just pack your desktop in one of these, and ship it away!
Continue reading

RNA-Seq for dummies

RNA sequencing (RNA-Seq) is a powerful technique to study the transcriptome of an organism at a given moment. As its name suggests, RNA-Seq is sequencing the RNA molecules from the sample. But how are the samples prepared? Here I will present a summary of this process:

Disclaimer: This post is not a guide or protocol to perform RNA extraction for RNA-Seq. The objective giving an overview of the process, highlighting the most important steps. 

Continue reading

Antibody Binding is Mediated by a Compact Vocabulary of Paratope-Epitope Interactions

While my own research focuses mainly on what happens in an antibody before it binds its antigen, I recently came across a paper by Akbar et al [1] that examines antibody-antigen interactions using an elegant approach to identify a set of structural motifs that antibodies use to interact with their epitopes. Since I am interested in emergent properties that arise when a sequence is mapped onto an antibody structure, this paper was very exciting. I will also shamelessly admit that I’m a sucker for a pretty figure and this paper has many! Regardless, on to the findings!

Example of identified interaction motifs. Figure from Akbar et al, 2021
Continue reading

Is bigger better?

Recent work in Natural Language Processing (NLP) indicates that the bigger your model is, the better performance you will get. In a paper by Kaplan, Jared, et al., they show that loss scales as a power-law with model size, dataset size, and the amount of compute used for training.

Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv preprint arXiv:2001.08361 (2020).
Continue reading

C++ python bindings in 5 minutes

You don’t even need to use CMake!

Most of the time, we can use libraries like numpy (which is largely written in C) to speed up our calculations, which works when we are dealing with matrices or vectors – but sometimes loops are unavoidable. In those instances, it would be nice if we could use a compiled language such as C++ to remove the bottleneck.

This can be achieved extremely easily using pybind11, which enables us to export C++ functions and classes as importable python objects. We can do all of this very easily, without using CMake, using pybind11’s Pybind11Extension class, along with a modified setup.py. Pybind11 can be compiled from source or installed using:

pip install pybind11
Continue reading

Slippery slopes and slippery flats

In this episode of my decade-long quest to correct popular British misconceptions, I wish to turn to one of my most geeky obsessions: trains. In particular, I would like to address a particularly British obsession, which many take as a signal of the lapse of British know-how from its mid-Empire industrial-revolution heights.

This is, of course, ‘leaves on the line‘. Why – demand the British public – must timetables run five to ten minutes slower when there are more leaves on the ground? Why do no other modern countries suffer from these ills? And why does the railway take no action over this commuting scourge?

Continue reading

IWD 2021 and the Gender Pay Gap

Throughout the pandemic, the statistics on division of childcare and home-schooling responsibilities have been shocking: mothers are taking on 150% more homeschooling than fathers (1), while 71% of working mothers’ furlough applications were rejected (2).  A third of working mothers reported having lost  some or all work due to a lack of childcare during the pandemic, with this figure rising to 44% for  BAME mothers. On top of this, 90% of the UK’s 2 million single parents are women (3). These unequal divisions are threatening to undo decades of progress towards gender equality.

In April 2019, the pay gap between men and women in the UK was 17.3% (4), and at the current rate of gender pay gap reduction, the gap will not be closed until 2052 (5).  The causes of this gap continue to be unequal caring responsibilities,  more women in low-paid work and (illegal) discrimination. BAME women are also subject to the ethnicity pay gap.  While this varies regionally and by ethnicity, in London in 2018 the overall figure was 23% (6).

Continue reading

Better understanding of correlation

Although correlation is often used as the linear relationship between two sets of points, I will in the following text use it more broadly to mean any relationship between two sets of points.

You have tasked yourself with finding the correlation between the different features in your dataset. Your purpose could be to remove highly correlated features or just improve your understanding of your data. Nonetheless, calculating and using the Pearson Correlation Coefficient (PCC) or the Spearman’s rank Correlation Coefficient (SCC) to get an overview of the correlations might be the first thing that comes to your mind.

Unfortunately, both of these are limited to linear (PCC) or monotonic (SCC) relationships. In datasets with many and complex features, many of them will be highly correlated, just not linearly (or monotonic). Instead these correlations can be non-linear which, as seen in the third row in the below figure, does not get detected with PCC.

Figure: PCC of different sets of x and y points. https://en.wikipedia.org/wiki/Correlation_and_dependence
Continue reading

ORDER!: Returning bond order information to your docked poses

John Bercow Order Remix - YouTube

Common docking software, such as AutoDock Vina or AutoDock 4, require the ligand and receptor files to be converted into the PDBQT format. Once a correct pose has been identified, the pose will be produced also as a .pdbqt file.

Continue reading

Commercialising your research: Where to start?

If you look at some of the biggest technology companies in the world, from Google and Facebook to hardware companies like Dell or even biotech unicorns like Oxford’s own Oxford Nanopore, all of them started on university campuses. If you are a researcher interested in finding out how to make the first steps to commercialise your research here is a quick guide:

Continue reading