Bioinformatics Hackathon Reflection

A week ago I participated in Copenhagen Bioinformatics Hackathon 2021, a hackathon focusing on machine learning and proteins, as a mentor for a challenge proposed by our group. The whole experience was fun, but I am also sitting here contemplating over a lot of things I wish I had done differently. For this blog text, I therefore want to highlight two changes which I believe would have greatly improved my challenge and which can hopefully also work as an inspiration for others presenting a hackathon challenge.

Going into this event I had some experience from a few hackathons I had previously attended. Based on this, I wanted to create a challenge containing two parts. First, a simple task which everyone would be able to create a solution for, and second, a more challenging addition to the first task for more experienced participants. I decided to go with the challenge of predicting which heavy and light chains can form a pair, where the additional challenge was to try to visualize which residues were relevant for this interaction. Together with OAS containing a really nice positive dataset of paired chains, I thought this was going to be an amazing challenge, but as soon as the event began I started seeing the flaws of the challenge.

Continue reading →

6 things I’ve learnt in my first year as a PhD student

Despite spending only four weeks working in the department, this month roughy marks a year since I started my unlikely career as a statistician and was inaugurated into the hall of opiglets (if you account for my foray into the magic of quantum computing last summer). The past year has been filled with learning opportunities, some of which I ought to take note and others are probably worth forgetting. Nonetheless, here is a short list of things I’ve learned in my first year as a DPhil Student, which you may find helpful in what I hope are more precedented times.

Simple and stupid first

When it comes to deciding how to tackle your next scientific problem or which lesson to start your blog post with, often the simplest and sometimes most ‘stupid’ idea is the way to go. Keeping things simple gives you the time to better understand your question without getting lost in the details of a complex solution. Plus, the results will inform your later next steps.

Continue reading →

Singularity: a guide for the bewildered bioinformatician

Have you ever worked with a piece of software that is awfully difficult to set up? That legacy code written on FORTRAN 77, that other one that requires significant modifications to compile, or any of those that require a long-winded bash script with a thousand dependencies (which you also have to install!). Would it not be helpful if, when that red-eyed PhD student, that one that just spent three months writing up their thesis, says that they absolutely must use that server where you have installed all your stuff, you could just relocate to another one without trouble? Well, you may be able to do that now. You just need to use containerization.

The idea behind containerization is rather simple. The best way to ensure anyone can reproduce your work is to, well, ship your entire system to whomever needs to use it. You could, for example, pack up your desktop in a box, and ship it to your collaborators anywhere in the world. Unfortunately, this idea is quite unpractical, not only because of tedious logistics (ever had to deal with customs?), but also because suddenly you won’t be able to run your own pipeline. However, it is a good enough thought that at some point made a clever engineer wonder whether there was a way to ship an entire system without physically delivering the computer. And that’s exactly what they designed.

40ft x 8ft (9ft 6") One trip high cube shipping container bl — Best way to make sure your collaborators on the other side of the world can run your pipeline — just pack your desktop in one of these, and ship it away!

Continue reading →

RNA-Seq for dummies

RNA sequencing (RNA-Seq) is a powerful technique to study the transcriptome of an organism at a given moment. As its name suggests, RNA-Seq is sequencing the RNA molecules from the sample. But how are the samples prepared? Here I will present a summary of this process:

Disclaimer: This post is not a guide or protocol to perform RNA extraction for RNA-Seq. The objective giving an overview of the process, highlighting the most important steps.

Continue reading →

Antibody Binding is Mediated by a Compact Vocabulary of Paratope-Epitope Interactions

While my own research focuses mainly on what happens in an antibody before it binds its antigen, I recently came across a paper by Akbar et al [1] that examines antibody-antigen interactions using an elegant approach to identify a set of structural motifs that antibodies use to interact with their epitopes. Since I am interested in emergent properties that arise when a sequence is mapped onto an antibody structure, this paper was very exciting. I will also shamelessly admit that I’m a sucker for a pretty figure and this paper has many! Regardless, on to the findings!

Example of identified interaction motifs. Figure from Akbar et al, 2021

Continue reading →

Is bigger better?

Recent work in Natural Language Processing (NLP) indicates that the bigger your model is, the better performance you will get. In a paper by Kaplan, Jared, et al., they show that loss scales as a power-law with model size, dataset size, and the amount of compute used for training.

Kaplan, Jared, et al. “Scaling laws for neural language models.” *arXiv preprint arXiv:2001.08361* (2020).

Continue reading →

C++ python bindings in 5 minutes

You don’t even need to use CMake!

Most of the time, we can use libraries like numpy (which is largely written in C) to speed up our calculations, which works when we are dealing with matrices or vectors – but sometimes loops are unavoidable. In those instances, it would be nice if we could use a compiled language such as C++ to remove the bottleneck.

This can be achieved extremely easily using pybind11, which enables us to export C++ functions and classes as importable python objects. We can do all of this very easily, without using CMake, using pybind11’s Pybind11Extension class, along with a modified setup.py. Pybind11 can be compiled from source or installed using:

pip install pybind11

Continue reading →

Slippery slopes and slippery flats

In this episode of my decade-long quest to correct popular British misconceptions, I wish to turn to one of my most geeky obsessions: trains. In particular, I would like to address a particularly British obsession, which many take as a signal of the lapse of British know-how from its mid-Empire industrial-revolution heights.

This is, of course, ‘leaves on the line‘. Why – demand the British public – must timetables run five to ten minutes slower when there are more leaves on the ground? Why do no other modern countries suffer from these ills? And why does the railway take no action over this commuting scourge?

Continue reading →

IWD 2021 and the Gender Pay Gap

Throughout the pandemic, the statistics on division of childcare and home-schooling responsibilities have been shocking: mothers are taking on 150% more homeschooling than fathers (1), while 71% of working mothers’ furlough applications were rejected (2). A third of working mothers reported having lost some or all work due to a lack of childcare during the pandemic, with this figure rising to 44% for BAME mothers. On top of this, 90% of the UK’s 2 million single parents are women (3). These unequal divisions are threatening to undo decades of progress towards gender equality.

In April 2019, the pay gap between men and women in the UK was 17.3% (4), and at the current rate of gender pay gap reduction, the gap will not be closed until 2052 (5). The causes of this gap continue to be unequal caring responsibilities, more women in low-paid work and (illegal) discrimination. BAME women are also subject to the ethnicity pay gap. While this varies regionally and by ethnicity, in London in 2018 the overall figure was 23% (6).

Continue reading →

Better understanding of correlation

Although correlation is often used as the linear relationship between two sets of points, I will in the following text use it more broadly to mean any relationship between two sets of points.

You have tasked yourself with finding the correlation between the different features in your dataset. Your purpose could be to remove highly correlated features or just improve your understanding of your data. Nonetheless, calculating and using the Pearson Correlation Coefficient (PCC) or the Spearman’s rank Correlation Coefficient (SCC) to get an overview of the correlations might be the first thing that comes to your mind.

Unfortunately, both of these are limited to linear (PCC) or monotonic (SCC) relationships. In datasets with many and complex features, many of them will be highly correlated, just not linearly (or monotonic). Instead these correlations can be non-linear which, as seen in the third row in the below figure, does not get detected with PCC.

Figure: PCC of different sets of x and y points. https://en.wikipedia.org/wiki/Correlation_and_dependence

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Bioinformatics Hackathon Reflection

6 things I’ve learnt in my first year as a PhD student

Singularity: a guide for the bewildered bioinformatician

RNA-Seq for dummies

Antibody Binding is Mediated by a Compact Vocabulary of Paratope-Epitope Interactions

Is bigger better?

C++ python bindings in 5 minutes

Slippery slopes and slippery flats

IWD 2021 and the Gender Pay Gap

Better understanding of correlation