Over the past few years I have explored different data visualization strategies with the goal of rapidly communicating information to medicinal chemists. I have recently fallen in love with “molecule networks” as an intuitive and interactive data visualization strategy. This blog gives a brief tutorial on how to start generating your own molecule networks.
Continue readingBaby’s First NeurIPS: A Survival Guide for Conference Newbies
There’s something very surreal about stepping into your first major machine learning conference: suddenly, all those GitHub usernames, paper authors, and protagonists of heated twitter spats become real people, the hallways are buzzing with discussions of papers you’ve been meaning to read, and somehow there are 17,000 other people trying to navigate it all alongside you. That was my experience at NeurIPS this year, and despite feeling like a microplankton in an ocean of ML research, I had a grand time. While some of this success was pure luck, much of it came down to excellent advice from the group’s ML conference veterans and lessons learned through trial and error. So, before the details fade into a blur of posters and coffee breaks, here’s my guide to making the most of your first major ML conference.
Continue readingDiagnostics on the Cutting Edge, Software in the Stone Age: A Microbiology Story
The need to treat and control infectious diseases has challenged humanity for millennia, driving a series of remarkable advancements in diagnostic tools and techniques. One of the earliest known legal texts, the Code of Hammurabi, references the visual and tactile diagnosis of leprosy. For centuries, the distinct smell of infected wounds was used to identify gangrene, and in Ancient Greece and Rome, the balance of the four humors (blood, phlegm, black bile, and yellow bile) was a central theory in diagnosing infections.
The invention of the compound microscope in 1590 by Hans and Zacharias Janssen, and its refinements by Robert Hooke and Antonie van Leeuwenhoek, marked a turning point as it enabled the direct observation of microorganisms, thereby linking diseases to their microbial origins. Louis Pasteur’s introduction of liquid media aided Joseph Lister in identifying microbes as the source of surgical infections, whilst Robert Koch’s experiments with Bacillus anthracis firmly established the connection between specific microbes and diseases.
Continue readingMaking pretty, interactive graphs the simple way – Use Plotly.
Using an ESP8266 and some DS18B20 one-wire temperature sensors, I have been automatically recording temperature data from various parts of my pond, to see how it fluctuated with air temperature, depth and filter configuration.
Despite the help I was receiving from the feline fish monitor, I was getting a bit irked at the quality of the graphs I was getting using matplotlib.
Matplotlib has been around since 2003, more than 20 years now. It’s arguably the defacto method of producing graphs in python and it’s not going away. However, it’s also a pain to use and by default produces some quite ugly plots unless you put in the mileage. In fact, when attempting to quickly explore data, Michael L. Waskom’s frustrations with matplotlib were directly related to the production of the seaborn library. “By producing complete graphics from a single function call with minimal
arguments, seaborn facilitates rapid prototyping and exploratory data analysis.”
Seaborn makes use of matplotlib and integrates tightly with pandas provide a neat wrapper for matplotlib functions, allowing you to avoid a lot of the data herding needed to view a graph.
You may think “OK, so seaborn finally tames matplotlib, why should I use anything else?” In short, interactivity. Seaborn and Matplotlib may produce graphs, but a graph alone doesn’t really let you explore the data. If you look at a graph you’re limited to the scale the author thought made sense, you can’t zoom in or out and if one line is behind another, you’re kind of stuck.
Where plotly really shines is with just two lines you can generate your figure and then either save it as the image below, or as an interactive HTML graph such as this.
A tougher molecular data split – spectral split
Scaffold splits have been widely used in molecular machine learning which involves identifying chemical scaffolds in the data set and ensuring scaffolds present in the train and test sets do not overlap. However, two very similar molecules can have differing scaffolds. In an example provided by Pat Walters in his article on splitting chemical data last month, he provides an example where two molecules just differ by a single atom and thus have a very high Tanimoto similarity score of 0.66. However, they have different scaffolds (figure below).
In this case, if one of the molecules were in the train set and the other in the test set, predicting the test molecule would be quite trivial as there is data leakage. Therefore, we need a better splitting method such that there is minimal overlap between the train and test set. In this blogpost, I will be discussing spectral split, a splitting method introduced by our fellow OPIG member, Klarner et. al (2023).
Spectral split
Spectral split or clustering is based on the spectral graph partitioning algorithm. The basic idea of spectral clustering is as follows: The dataset is projected on a R^n matrix. An affinity matrix using a kernel that could be domain-specific is defined. Following that, the graph Laplacian is computed from the affinity matrix, followed by its eigendecomposition. Then, k eigenvectors corresponding to the k lowest/highest eigenvalues are selected. Finally, the clusters are formed using k-means.
In the context of molecular data splitting, one could use the Tanimoto similarity metric to construct a similarity matrix between all the molecules in the dataset. Then, a spectral clustering method could be used to partition the similarity matrix such that the similarity within the cluster is maximized whereas the similarity between the clusters is minimized. Spectral split showed the least overlap between train (blue) and test (red) set molecules compared to scaffold splits (figure from Klarner at. al. (2024) below)
In addition to spectral splits, one could attempt other tougher splits one could attempt such as UMAP splits suggested by Guo et. al. (2024). For a detailed comparison between UMAP splits and other commonly used splits please refer to Pat Walters’ article on splitting chemical data.
Generating Haikus with Llama 3.2
At the recent OPIG retreat, I was tasked with writing the pub quiz. The quiz included five rounds, and it’s always fun to do a couple “how well do you know your group?” style rounds. Since I work with Transformers, I thought it would be fun to get AI to create Haiku summaries of OPIGlet research descriptions from the website.
AI isn’t as funny as it used to be, but it’s a lot easier to get it to write something coherent. There are also lots of knobs you can turn like temperature, top_p, and the details of the prompt. I decided to use Meta’s new Llama 3.2-3B-Instruct model which is publicly available on Hugging Face. I ran it locally using vllm, and instructed it to write a haiku for each member’s description using a short script which parses the html from the website.
Continue readingVisualising and validating differences between machine learning models on small benchmark datasets
Introduction
An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is ‘better’ than another (something Pat Walters has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.
The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (TDC). These leaderboard tables do not show:
- whether differences in metrics between methods are statistically significant,
- whether methods use ensembles or single models,
- whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,
- whether methods are pre-trained or not,
- whether pre-trained models are supervised, self-supervised, or both,
- the data and tasks that pre-trained models are pre-trained on.
This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.
Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by Ash et al. (2024) sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.
Continue readingMaking Pretty Pictures in PyMOL v2
Throughout my PhD I’ve needed nice PyMOL visualizations, but struggled to quickly and easily make the pictures I wanted. I’ve used Claire Marks‘ blopig post, Making Pretty Pictures in PyMOL, many times and wanted to expand it with what I’ve learned to make satisfying visualizations quickly!
Continue readingControlling PyMol from afar
Do you keep downloading .pdb
and .sdf
files and loading them into PyMol repeatedly?
If yes, then PyMol remote might be just for you. With PyMol remote, you can control a PyMol session running on your laptop from any other machine. For example, from a Jupyter Notebook running on your HPC cluster.
Continue readingBuilding CLI Applications with Typer
Remember the last time you had to build a command-line tool? If you’re like me, you probably started with argparse
or click
, wrote boilerplate code, and still ended up with something that felt clunky. That’s where typer comes in – it’s a game-changer that lets you build CLI apps with minimal code. Although there are several other options, typer stands out because it leverages Python’s type hints to do the heavy lifting. No more manual argument parsing! The following snippet shows how to use typer in its simplest form:
import typer app = typer.Typer() @app.command() def hello(name: str): typer.echo(f"Hello {name}!") if __name__ == "__main__": app()
And you will be able to execute it with just:
$ python hello.py Pedro Hello Pedro!
In this simple example, we were only defining positional arguments, but having optional arguments is as easy as setting default values in the function signature.
Continue reading