Category Archives: Code

Reproducible publishing

I’m a big fan of Jupyter Notebooks. They’re a great way to document and explain code, and even better, you can run this code when connected to an appropriate kernel.

What if you want to work on something larger than a notebook? Say a chapter or even a whole book, with Python, R, Observable JS, or Julia code? Enter Quarto. You can combine Jupyter notebooks and/or plain text markdown to publish production quality articles, presentations, dashboards, website, blogs and books in HTML, PDF, Microsoft Word, ePub, and other formats. Quarto can also connect to publishing platforms like Posit Connect, Confluence Cloud, and others.

Continue reading

The Tale of the Undead Logger

A picture of a scary-looking zombie in a lumberjack outfit holding an axe, in the middle of a forest at night, staring menacingly at the viewer.
Fear the Undead Logger all ye who enter here.
For he may strike, and drain the life out nodes that you hold dear.
Among the smouldering embers of jobs you thought long dead,
he lingers on, to terrorise, and cause you frightful dread.
But hark ye all my tale to save you from much pain,
and fight ye not anew the battles I have fought in vain.

Or simply…

… Tips and Tricks to Use When wandb Logger Just. Won’t. DIE.

The Weights and Biases Logger (illustrated above by DALL-E; admittedly with some artistic license) hardly requires introduction. It’s something of an industry standard at this point, well-regarded for the extensive (and extensible) functionality of its interactive dashboard; for advanced features like checkpointing model weights in the cloud and automating hyperparameter sweeps; and for integrating painlessly with frameworks like PyTorch and PyTorch Lightning. It simplifies your life as an ML researcher enormously by making it easy to track and compare experiments, monitor system resource usage, all while giving you very fun interactive graphs to play with.
Plot arbitrary quantities you may be logging against each other, interactively, on the fly, however you like. In Dark Mode, of course (you’re a professional, after all). Here’s a less artistic impression to give you an idea, should you have been living under a rock:

Continue reading

Easy Python job queues with RQ

Job queueing is an important consideration for a web application, especially one that needs to play nice and share resources with other web applications. There are lots of options out there with varying levels of complexity and power, but for a simple pure Python job queue that just works, RQ is quick and easy to get up and running.

RQ is a Python job queueing package designed to work out of the box, using a Redis database as a message broker (the bit that allows the app and workers to exchange information about jobs). To use it, you just need a redis-server installation and the rq module in your python environment.

Continue reading

Interactive visualization of protein–ligand complexes with Py3Dmol

I recently had a problem where I wanted to provide an interactive visualization of multiple different protein–ligand complexes, requiring minimal setup by the user, allowing them to zoom in and out and change the visualization style, without just providing multiple PDB files or a PyMOL session.

Continue reading

Comparing pose and affinity prediction methods for follow-up designs from fragments

In any task in the realm of virtual screening, there need to be many filters applied to a dataset of ligands to downselect the ‘best’ ones on a number of parameters to produce a manageable size. One popular filter is if a compound has a physical pose and good affinity as predicted by tools such as docking or energy minimisation. In my pipeline for downselecting elaborations of compounds proposed as fragment follow-ups, I calculate the pose and ΔΔG by energy minimizing the ligand with atom restraints to matching atoms in the fragment inspiration. I either use RDKit using its MMFF94 forcefield or PyRosetta using its ref2015 scorefunction, all made possible by the lovely tool Fragmenstein.

With RDKit as the minimizer the protein neighborhood around the ligand is fixed and placements take on average 21s whereas with PyRosetta placements, they take on average 238s (and I can run placements in parallel luckily). I would ideally like to use RDKit as the placement method since it is so fast and I would like to perform 500K within a few days but, I wanted to confirm that RDKit is ‘good enough’ compared to the slightly more rigorous tool PyRosetta (it allows residues to relax and samples more conformations with the longer runtime I think).

Continue reading

Fine-tune generated molecular poses with a force field

Some molecular pose generation methods benefit from an energy relaxation post-processing step.

Predicted pose before energy minimization
Example of a small molecule pose before and after energy minimization. The pose before minimization is shown in white, the optimized prediction is shown in pink, and a crystal pose is shown as reference in light blue. Note how the aromatic rings are flattened and the leftmost bond is shortened by the optimization.

Here is a quick way to do this using OpenMM via a short script I prepared:

Continue reading

Organise Your ML Projects With Hydra

One of the most annoying parts of ML research is keeping track of all the various different experiments you’re running – quickly changing and keeping track of changes to your model, data or hyper-parameters can turn into an organisational nightmare. I’m normally a fan of avoiding too many different libraries/frameworks as they often break down if you to do anything even a little bit custom and days are often wasted trying to adapt yourself to a new framework or adapt the framework to you. However, my last codebase ended up straying pretty far into the chaotic side of things so I thought it might be worth trying something else out for my next project. In my quest to instil a bit more order, I’ve started using Hydra, which strikes a nice balance between giving you more structure to organise a project, while not rigidly insisting on it, and I’d highly recommend checking it out yourself.

Continue reading

Environmentally sustainable computing 

Did you know that it is approximated that you, a scientist, have a carbon footprint which is between 2 and 12 times higher than the set carbon budget per person to keep global warming below 1.5 °C [1]? 

Background

Global temperatures are rising. This has direct effects on the planet and contributes to increasing humanitarian emergencies. These include more frequent and intense heatwaves, wildfires, and floods [2]. The impact of climate change is already severe, with around 20 million internal displaced persons in 2023 alone due to those disasters [3]. 

Global warming and climate change are caused by the emissions of carbon dioxide and methane, known as carbon emissions. There are different ways in which you could minimise your carbon footprint. For example, I try to reduce the energy usage in the house, try eating mainly plant-based, and travel by train instead of by plane to family and for holidays and conferences. However, up until organising a Green Lecture with the Department of Statistics Green Team I never thought of my computational PhD as a major contributor to my carbon footprint. That doesn’t mean the work I, and all other scientists, do is not important and necessary. But the lecture on principles for environmentally sustainable research given by Loic Lannelongue made me aware of carbon costs of computing, which I would like to share with you. 

Continue reading

The War of the Roses: Tea Edition

Picture the following: the year is 1923, and it’s a sunny afternoon at a posh garden party in Cambridge. Among the polite chatter, one Muriel Bristol—a psychologist studying the mechanisms by which algae acquire nutrients—mentions she has a preference for tea poured over milk, as opposed to milk poured over tea. In a classic example of women not being able to express even the most insignificant preference without an opinionated man telling them they’re wrong, Ronald A. Fisher, a local statistician (later turned eugenicist who dismissed the notion of smoking cigarettes being dangerous as ‘propaganda’, mind you) decides to put her claim to the test with an experiment. Bristol is given eight cups of tea and asked to classify them as milk first or tea first. Luckily, she correctly identifies all eight of them, and gets to happily continue about her life (presumably until the next time she dares mention a similarly outrageous and consequential opinion like a preferred toothpaste brand or a favourite method for filing papers). Fisher, on the other hand, is incentivized to develop Fisher’s exact test, a statistical significance test used in the analysis of contingency tables.

Continue reading

Pyrosetta for RFdiffusion

I will not lie: I often struggle to find a snippet of code that did something in PyRosetta or I spend hours facing a problem caused by something not working as I expect it to. I recently did a tricky project involving RFdiffusion and I kept slipping on the PyRosetta side. So to make future me, others, and ChatGTP5 happy, here are some common operations to make working with PyRosetta for RFdiffusion easier.

Continue reading