Over the past few years I have explored different data visualization strategies with the goal of rapidly communicating information to medicinal chemists. I have recently fallen in love with “molecule networks” as an intuitive and interactive data visualization strategy. This blog gives a brief tutorial on how to start generating your own molecule networks.
Continue readingCategory Archives: Code
Making pretty, interactive graphs the simple way – Use Plotly.
Using an ESP8266 and some DS18B20 one-wire temperature sensors, I have been automatically recording temperature data from various parts of my pond, to see how it fluctuated with air temperature, depth and filter configuration.
Despite the help I was receiving from the feline fish monitor, I was getting a bit irked at the quality of the graphs I was getting using matplotlib.
Matplotlib has been around since 2003, more than 20 years now. It’s arguably the defacto method of producing graphs in python and it’s not going away. However, it’s also a pain to use and by default produces some quite ugly plots unless you put in the mileage. In fact, when attempting to quickly explore data, Michael L. Waskom’s frustrations with matplotlib were directly related to the production of the seaborn library. “By producing complete graphics from a single function call with minimal
arguments, seaborn facilitates rapid prototyping and exploratory data analysis.”
Seaborn makes use of matplotlib and integrates tightly with pandas provide a neat wrapper for matplotlib functions, allowing you to avoid a lot of the data herding needed to view a graph.
You may think “OK, so seaborn finally tames matplotlib, why should I use anything else?” In short, interactivity. Seaborn and Matplotlib may produce graphs, but a graph alone doesn’t really let you explore the data. If you look at a graph you’re limited to the scale the author thought made sense, you can’t zoom in or out and if one line is behind another, you’re kind of stuck.
Where plotly really shines is with just two lines you can generate your figure and then either save it as the image below, or as an interactive HTML graph such as this.
Visualising and validating differences between machine learning models on small benchmark datasets
Introduction
An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is ‘better’ than another (something Pat Walters has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.
The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (TDC). These leaderboard tables do not show:
- whether differences in metrics between methods are statistically significant,
- whether methods use ensembles or single models,
- whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,
- whether methods are pre-trained or not,
- whether pre-trained models are supervised, self-supervised, or both,
- the data and tasks that pre-trained models are pre-trained on.
This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.
Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by Ash et al. (2024) sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.
Continue readingControlling PyMol from afar
Do you keep downloading .pdb
and .sdf
files and loading them into PyMol repeatedly?
If yes, then PyMol remote might be just for you. With PyMol remote, you can control a PyMol session running on your laptop from any other machine. For example, from a Jupyter Notebook running on your HPC cluster.
Continue readingBuilding CLI Applications with Typer
Remember the last time you had to build a command-line tool? If you’re like me, you probably started with argparse
or click
, wrote boilerplate code, and still ended up with something that felt clunky. That’s where typer comes in – it’s a game-changer that lets you build CLI apps with minimal code. Although there are several other options, typer stands out because it leverages Python’s type hints to do the heavy lifting. No more manual argument parsing! The following snippet shows how to use typer in its simplest form:
import typer app = typer.Typer() @app.command() def hello(name: str): typer.echo(f"Hello {name}!") if __name__ == "__main__": app()
And you will be able to execute it with just:
$ python hello.py Pedro Hello Pedro!
In this simple example, we were only defining positional arguments, but having optional arguments is as easy as setting default values in the function signature.
Continue readingCross referencing across LaTeX documents in one project
A common scenario we come across is that we have a main manuscript document and a supplementary information document, each of which have their own sections, tables and figures. The question then becomes – how do we effectively cross-reference between the documents without having to tediously count all the numbers ourselves every time we make a change and recompile the documents?
The answer: cross referencing!
Continue readingUsing Jupyter on a remote slurm cluster
Suppose we need to do some interactive analysis in a Jupyter notebook, but our local machine lacks the power. We have access to a slurm cluster, but we can’t SSH from the head node to the worker node; we can only SSH from the worker node to the head node. Can we still interact with a Jupyter notebook running on the worker node? As it happens, the answer is “yes” – we just need to do some reverse SSH tunnelling.
Continue readingTesting python (or any!) command line applications
Through our work in OPIG, many of our projects come in the form of code bases written in Python. These can be many different things like databases, machine learning models, and other software tools. Often, the user interface for these tools is developed as both a web app and a command line application. Here, I will discuss one of my favourite tools for testing command-line applications: prysk!
Continue readingAider and Cheap, Free, and Local LLMs
Aider and the Future of Coding: Open-Source, Affordable, and Local LLMs
The landscape of AI coding is rapidly evolving, with tools like Cursor gaining popularity for multi-file editing and copilot for AI-assisted autocomplete. However, these solutions are both closed-source and require a subscription.
This blog post will explore Aider, an open-source AI coding tool that offers flexibility, cost-effectiveness, and impressive performance, especially when paired with affordable, free, and local LLMs like DeepSeek, Google Gemini, and Ollama.
Continue readingMDAnalysis: Work with dynamics trajectories of proteins
For a long time crystallographers and subsequently the authors of AlphaFold2 had you believe that proteins are a static group of atoms written to a .pdb file. Turns out this was a HOAX. If you don’t want to miss out on the latest trend of working with dynamic structural ensembles of proteins this blog post is exactly right for you. MDAnalysis is a python package which as the name says was designed to analyse molecular dyanmics simulation and lets you work with trajectories of protein structures easily.
Continue reading