To address some of the karmic imbalance created by computational scientists complaining about other people’s code, I am listing here some (not all) of other people’s code that I love.
IgBLAST
IgBLAST is a sequence alignment tool for immunoglobulin sequences implemented in the NCBI C++ toolkit – it applies the classic BLAST algorithm to searching immunoglobulin germline gene databases. It always impresses me how quickly it works. The paper is here, and the authors are Jian Ye, Ning Ma, Thomas L. Madden and James M. Ostell.
In a recent OPIG antibody meeting, the topic of immune system differences between men and women came up. I thought this was cool and something I hadn’t read about, so what a brilliant topic for a blog most. This post is a high-level overview – I’ve listed the papers I’ve used at the bottom of this post so please consult them for more details!
Differences between males and females can lead to pretty big disparities in disease prevalence and outcomes. For example, non-reproductive cancers occur predominantly in males, whilst the majority of autoimmune disease occurs in females. Many factors may be impacting this, including environmental, genetic and hormonal influences, and much more research is required to fully understand these processes. Here I focus on sex-based biology, rather than gender, though both can influence the immune response.
A week ago I participated in Copenhagen Bioinformatics Hackathon 2021, a hackathon focusing on machine learning and proteins, as a mentor for a challenge proposed by our group. The whole experience was fun, but I am also sitting here contemplating over a lot of things I wish I had done differently. For this blog text, I therefore want to highlight two changes which I believe would have greatly improved my challenge and which can hopefully also work as an inspiration for others presenting a hackathon challenge.
Going into this event I had some experience from a few hackathons I had previously attended. Based on this, I wanted to create a challenge containing two parts. First, a simple task which everyone would be able to create a solution for, and second, a more challenging addition to the first task for more experienced participants. I decided to go with the challenge of predicting which heavy and light chains can form a pair, where the additional challenge was to try to visualize which residues were relevant for this interaction. Together with OAS containing a really nice positive dataset of paired chains, I thought this was going to be an amazing challenge, but as soon as the event began I started seeing the flaws of the challenge.
Despite spending only four weeks working in the department, this month roughy marks a year since I started my unlikely career as a statistician and was inaugurated into the hall of opiglets (if you account for my foray into the magic of quantum computing last summer). The past year has been filled with learning opportunities, some of which I ought to take note and others are probably worth forgetting. Nonetheless, here is a short list of things I’ve learned in my first year as a DPhil Student, which you may find helpful in what I hope are more precedented times.
Simple and stupid first
When it comes to deciding how to tackle your next scientific problem or which lesson to start your blog post with, often the simplest and sometimes most ‘stupid’ idea is the way to go. Keeping things simple gives you the time to better understand your question without getting lost in the details of a complex solution. Plus, the results will inform your later next steps.
Have you ever worked with a piece of software that is awfully difficult to set up? That legacy code written on FORTRAN 77, that other one that requires significant modifications to compile, or any of those that require a long-winded bash script with a thousand dependencies (which you also have to install!). Would it not be helpful if, when that red-eyed PhD student, that one that just spent three months writing up their thesis, says that they absolutely must use that server where you have installed all your stuff, you could just relocate to another one without trouble? Well, you may be able to do that now. You just need to use containerization.
The idea behind containerization is rather simple. The best way to ensure anyone can reproduce your work is to, well, ship your entire system to whomever needs to use it. You could, for example, pack up your desktop in a box, and ship it to your collaborators anywhere in the world. Unfortunately, this idea is quite unpractical, not only because of tedious logistics (ever had to deal with customs?), but also because suddenly you won’t be able to run your own pipeline. However, it is a good enough thought that at some point made a clever engineer wonder whether there was a way to ship an entire system without physically delivering the computer. And that’s exactly what they designed.
RNA sequencing (RNA-Seq) is a powerful technique to study the transcriptome of an organism at a given moment. As its name suggests, RNA-Seq is sequencing the RNA molecules from the sample. But how are the samples prepared? Here I will present a summary of this process:
Disclaimer: This post is not a guide or protocol to perform RNA extraction for RNA-Seq. The objective giving an overview of the process, highlighting the most important steps.
While my own research focuses mainly on what happens in an antibody before it binds its antigen, I recently came across a paper by Akbar et al [1] that examines antibody-antigen interactions using an elegant approach to identify a set of structural motifs that antibodies use to interact with their epitopes. Since I am interested in emergent properties that arise when a sequence is mapped onto an antibody structure, this paper was very exciting. I will also shamelessly admit that I’m a sucker for a pretty figure and this paper has many! Regardless, on to the findings!
Recent work in Natural Language Processing (NLP) indicates that the bigger your model is, the better performance you will get. In a paper by Kaplan, Jared, et al., they show that loss scales as a power-law with model size, dataset size, and the amount of compute used for training.
Most of the time, we can use libraries like numpy (which is largely written in C) to speed up our calculations, which works when we are dealing with matrices or vectors – but sometimes loops are unavoidable. In those instances, it would be nice if we could use a compiled language such as C++ to remove the bottleneck.
This can be achieved extremely easily using pybind11, which enables us to export C++ functions and classes as importable python objects. We can do all of this very easily, without using CMake, using pybind11’s Pybind11Extension class, along with a modified setup.py. Pybind11 can be compiled from source or installed using:
In this episode of my decade-long quest to correct popular British misconceptions, I wish to turn to one of my most geeky obsessions: trains. In particular, I would like to address a particularly British obsession, which many take as a signal of the lapse of British know-how from its mid-Empire industrial-revolution heights.
This is, of course, ‘leaves on the line‘. Why – demand the British public – must timetables run five to ten minutes slower when there are more leaves on the ground? Why do no other modern countries suffer from these ills? And why does the railway take no action over this commuting scourge?