Category Archives: Data Science

Singularity: a guide for the bewildered bioinformatician

Have you ever worked with a piece of software that is awfully difficult to set up? That legacy code written on FORTRAN 77, that other one that requires significant modifications to compile, or any of those that require a long-winded bash script with a thousand dependencies (which you also have to install!). Would it not be helpful if, when that red-eyed PhD student, that one that just spent three months writing up their thesis, says that they absolutely must use that server where you have installed all your stuff, you could just relocate to another one without trouble? Well, you may be able to do that now. You just need to use containerization.

The idea behind containerization is rather simple. The best way to ensure anyone can reproduce your work is to, well, ship your entire system to whomever needs to use it. You could, for example, pack up your desktop in a box, and ship it to your collaborators anywhere in the world. Unfortunately, this idea is quite unpractical, not only because of tedious logistics (ever had to deal with customs?), but also because suddenly you won’t be able to run your own pipeline. However, it is a good enough thought that at some point made a clever engineer wonder whether there was a way to ship an entire system without physically delivering the computer. And that’s exactly what they designed.

40ft x 8ft (9ft 6") One trip high cube shipping container bl
Best way to make sure your collaborators on the other side of the world can run your pipeline — just pack your desktop in one of these, and ship it away!
Continue reading

Better understanding of correlation

Although correlation is often used as the linear relationship between two sets of points, I will in the following text use it more broadly to mean any relationship between two sets of points.

You have tasked yourself with finding the correlation between the different features in your dataset. Your purpose could be to remove highly correlated features or just improve your understanding of your data. Nonetheless, calculating and using the Pearson Correlation Coefficient (PCC) or the Spearman’s rank Correlation Coefficient (SCC) to get an overview of the correlations might be the first thing that comes to your mind.

Unfortunately, both of these are limited to linear (PCC) or monotonic (SCC) relationships. In datasets with many and complex features, many of them will be highly correlated, just not linearly (or monotonic). Instead these correlations can be non-linear which, as seen in the third row in the below figure, does not get detected with PCC.

Figure: PCC of different sets of x and y points. https://en.wikipedia.org/wiki/Correlation_and_dependence
Continue reading

The Coronavirus Antibody Database: 10 months on, 10x the data!

Back in May 2020, we released the Coronavirus Antibody Database (‘CoV-AbDab’) to capture molecular information on existing coronavirus-binding antibodies, and to track what we anticipated would be a boon of data on antibodies able to bind SARS-CoV-2. At the time, we had found around 300 relevant antibody sequences and a handful of solved crystal structures, most of which were characterised shortly after the SARS-CoV epidemic of 2003. We had no idea just how many SARS-CoV-2 binding antibody sequences would come to be released into the public domain…

10 months later (2nd March 2021), we now have tracked 2,673 coronavirus-binding antibodies, ~95% with full Fv sequence information and ~5% with solved structures. These datapoints originate from 100s of independent studies reported in either the academic literature or patent filings.

The entire contents CoV-AbDab database as of 2nd March 2021.
Continue reading

Plotly for interactive 3D plotting

An recently wrote a post on how to use the seaborn library. I really like seaborn and use it a lot for 2D plots. However, recently I have been dealing with 3D data and have found plotly to be best. When used in a jupyter notebook, it allows you to easily generate 3D interactive plots. This is extremely useful to visualize structural data.

Continue reading