Lessons in Scientific Code Deployment

So, I recently deployed my first piece of scientific code. Well, sort of. I made a github with instructions on how to download, install and run it.

And then everyone broke it.

So, now having been on tech support duty for a few weeks, it seemed like a good idea to have a think about what I’ve learned.

Now, there is a big preface to this: the first and most important thing I learned is that I should do some reading on how to do this well. I have not yet done that reading, so this post isn’t so much going to offer any advice as catalogue my mistakes. Mistakes that will probably look extremely silly to anyone who has any familiarity with deployment, but might be interesting to anyone who doesn’t.

A surprising number of people really don’t want to touch the command line

Being a programmer who spends the vast majority of their time on the command line, invoking programs from there is very natural. As such, I very much underestimated the obstacle that even installing anaconda, a few packages, and cloning the source code would be. Even with instructions to copy and paste.

The issue is, if anything goes wrong, there is a good chance they don’t know whether it is my code or their environment breaking, which probably means they need to contact me about it (more on environments later).

Really, I probably could have saved myself an awful lot of support by making it an installable, and more with a gui to guide people through using the program.

Python is a pain

So, the first thing I learned was something I’d kind of been warned about: deploying python code is a pain in the butt. Especially to people who aren’t familiar with python, managing python environments is both tricky and overwhelming easy to break code with. Run a python script from the wrong environment and it is going to fail: if you are lucky with a failure to import a module, if you are unlucky with a cryptic error due to say changes between various python versions.
Speaking of python versions, developing in 3.9 and not testing in 3.7 then telling people to install that can result in a surprising number of surprisingly difficult bugs.

The instructions weren’t clear enough

Scientific code I think generally caters an awful lot to expert users, people who really understand the model and even are willing to open the source code to figure out the implementation.

My first stab at documentation managed to not be clear enough to the people who didn’t want to touch the command line and those who were willing to open the source code because they wanted to do something spicy.

So yeah, good documentation is an acquired skill.

Distributed computing is a nightmare

In principle, distribution is terrific: get a library that will allow you to reduce running arbitrary python code on multiple nodes to a simple map-like interface. On big clusters, like a lot of scientists use, this can mean speed ups from 10 to even 1000 times.

The only problem is, everyone’s cluster is a special snowflake, and you can’t access most of them to fix things. This can make iteration with a non-programmer painfully slow.

Libraries don’t help as much as I’d have thought either: indeed, my experience of Dask and Dask Jobqueue has been a consistently uphill battle. From the fact that my workload likes individual nodes sharing lots of memory and a few cpus to some truly arcane errors (one that broke in the msgpack code), I have generally considered (and even started) writing my own code to do this.

Active development doesn’t reach people

Code that is being updated several times a day in response to bugfixes can be great – but if people aren’t pulling and installing it, no-one is going to benefit. I’m seriously tempted to write some code to either auto-update on running or at least let folk know it has been updated.

Summary

In summary, a lot went wrong in my first stab at this. Very much come to appreciate a good deployment is an artform, and I’ve got an awful lot of reading to do. In particular, the above problem areas really have eaten a lot of time that probably could have been used doing actual science with the code, so there is a good incentive to get it right.

New search features for the Structural Antibody Database (SAbDab)

Since its original publication in 2013, we have added several advanced search features to the Structural Antibody Database. This post aims to give an overview over some of these features.

Continue reading →

Antibodies for gut or bad

Over the last two decades, there has been mounting evidence of the role of the gut microbiome (the collection of microorganisms in the GI tract) in metabolic disorder (Fan and Pedersen 2021) and more recently, in psychiatric illness (Morais, Schreiber, and Mazmanian 2021). The maintenance of the equilibrium of commensal bacteria and their proper compartmentalization and stratification in the gut is critical for health.

There are diverse factors regulating microbiota composition (microbiota homeostasis) (Macpherson and McCoy 2013). I am principally interested in the role of antibodies – the idea that antibodies participate in this process is controversial (Kubinak and Round 2016) because of the difficulty of controlling for the multiple confounding environmental variables that influence the microbiome, but there are theories as to how this happens. The process of the shaping of the microbiota by antibodies was dubbed “antibody-mediated immunoselection” (AMIS) by (Kubinak and Round 2016).

Continue reading →

Former OPIGlets – where are they now?

Since OPIG began in 2003, 53 students* have managed to escape. But where are these glorious people now? I decided to find out, using my best detective skills (aka LinkedIn, Google and Twitter).

* I’m only including full members who have left the group, as per the former members list on the OPIG website

Where are they?

Firstly, the countries. OPIGlets are mostly still residing in the UK, primarily in the ‘golden triangle’ of London, Oxford and Cambridge. The US comes in second, followed closely by Germany (Note: one former OPIGlet is in Malta, which is too small to be recognised in Geopandas so just imagine it is shown on the world map below)

Continue reading →

2021 likely to be a bumper year for therapeutic antibodies entering clinical trials; massive increase in new targets

Earlier this month the World Health Organisation (WHO) released Proposed International Nonproprietary Name List 125 (PL125), comprising the therapeutics entering clinical trials during the first half of 2021. We have just added this data to our Therapeutic Structural Antibody Database (Thera-SAbDab), bringing the total number of therapeutic antibodies recognised by the WHO to 711.

This is up from 651 at the end of 2020, a year which saw 89 new therapeutic antibodies introduced to the clinic. This rise of 60 in just the first half of 2021 bodes well for a record-breaking year of therapeutics entering trials.

Continue reading →

A Smattering of Olympic Trivia!

Tokyo 2020 is now firmly in our rearview mirror, and I for one will be sad to be deprived of the opportunity to wake up at 4AM to passionately cheer on someone I’ve never heard of in an event I know nothing about as they go for Gold. The heyday of amateurism in the Olympics may be long gone, but it’s never been better for the amateur fan, with 24/7, on-demand, coverage, unprecedented access to the athletes via social media and remote working offering the opportunity to watch the games on a second screen without worrying about one’s boss noticing (not that I would ever engage in such an irresponsible practice, in case my Supervisor is reading this…).

To indulge both my post-Olympics melancholy and my addiction to sports trivia, I’ve trawled the internet to find some interest factoids related to the Summer Games and present them below for your mild enjoyment:

Continue reading →

A handful of lesser known python libraries

There are more python libraries than you can shake a stick at, but here are a handful that don’t get much love and may save you some brain power, compute time or both.

Fire is a library which turns your normal python functions into command-line utilities without requiring more than a couple of additional lines of copy-and-paste code. Being able to immediately access your functions from the command line is amazingly helpful when you’re making quick and dirty utilities and saves needing to reach for the nuclear approach of using getopt.

Continue reading →

Writing Papers in OPIG

I’m dedicating this blog post to something I spend a great deal of my time doing – reading the manuscripts that members of OPIG produce.

As every member of OPIG knows we often go through a very large number of drafts as I inexpertly attempt to pull the paper into a shape that I think is acceptable.

When I was a student I was not known for my ability to write, in fact I would say the opposite was probably true. Writing a paper is a skill that needs to be learnt and just like giving talks everyone needs to find their own style.

Before you write or type anything, remember that a good paper starts with researching how your work fits into existing literature. The next step is to craft a compelling story, whilst remembering to tailor your message to your intended audience.

There are many excellent websites/blogs/articles/books advising how to write a good paper so I am not going to attempt a full guide instead here are a few things to keep in mind.

Have one story not more than one and not less – when you write the paper look at every word/image to see how it helps to deliver your main message.
Once you know your key message it is often easiest to not write the paper in the order the sections appear! Creating the figures from the results first helps to structure the whole paper, then you can move on to methods, then write the results and discussion, then the conclusion, followed by the introduction, finishing up with the abstract and title.
Always place your work in the context of what has already been done, what makes your work significant or original.
Keep a consistent order – the order in which ideas come in the abstract should also be the same in the introduction, the methods, the results, the discussion etc.
A paper should have a logical flow. In each paragraph, the first sentence defines context, the body is the new information, the last sentence is the take-home message/conclusion. The whole paper builds in the same way from the introduction setting the context, through the results which give the content, to the discussion’s conclusion.
Papers don’t need cliff hangers – main results/conclusions should be clear in the abstract.
State your case with confidence.
Papers don’t need to be written in a dry/technical style…
…..but remove the hyperbole. Any claims should be backed up by the evidence in the paper.
Get other people to read your work – their comments will help you (and unless it’s me you can always ignore their suggestions!)

ISMB 2021: epitope prediction tools

I recently had the opportunity to present my work on antibody virtual screening at the 2021 ISMB/ECCB virtual conference. In this blogpost, I want to summarise two research projects presented in the 3DSIG immunoinformatics session (in which I also presented my work) highlighting two different avenues of approaching epitope prediction (and immunoinformatics questions in general): Structure-based (Epitope3D) and sequence-based (SeRenDIP-CE).

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Lessons in Scientific Code Deployment

A surprising number of people really don’t want to touch the command line

Python is a pain

The instructions weren’t clear enough

Distributed computing is a nightmare

Active development doesn’t reach people

Summary

New search features for the Structural Antibody Database (SAbDab)

Antibodies for gut or bad

Former OPIGlets – where are they now?

Where are they?

2021 likely to be a bumper year for therapeutic antibodies entering clinical trials; massive increase in new targets

A Smattering of Olympic Trivia!

A handful of lesser known python libraries

Writing Papers in OPIG

ISMB 2021: epitope prediction tools