A tougher molecular data split – spectral split

Scaffold splits have been widely used in molecular machine learning which involves identifying chemical scaffolds in the data set and ensuring scaffolds present in the train and test sets do not overlap. However, two very similar molecules can have differing scaffolds. In an example provided by Pat Walters in his article on splitting chemical data last month, he provides an example where two molecules just differ by a single atom and thus have a very high Tanimoto similarity score of 0.66. However, they have different scaffolds (figure below).

In this case, if one of the molecules were in the train set and the other in the test set, predicting the test molecule would be quite trivial as there is data leakage. Therefore, we need a better splitting method such that there is minimal overlap between the train and test set. In this blogpost, I will be discussing spectral split, a splitting method introduced by our fellow OPIG member, Klarner et. al (2023).

Spectral split

Spectral split or clustering is based on the spectral graph partitioning algorithm. The basic idea of spectral clustering is as follows: The dataset is projected on a R^n matrix. An affinity matrix using a kernel that could be domain-specific is defined. Following that, the graph Laplacian is computed from the affinity matrix, followed by its eigendecomposition. Then,  k eigenvectors corresponding to the k lowest/highest eigenvalues are selected. Finally, the clusters are formed using k-means.

In the context of molecular data splitting, one could use the Tanimoto similarity metric to construct a similarity matrix between all the molecules in the dataset. Then, a spectral clustering method could be used to partition the similarity matrix such that the similarity within the cluster is maximized whereas the similarity between the clusters is minimized. Spectral split showed the least overlap between train (blue) and test (red) set molecules compared to scaffold splits (figure from Klarner at. al. (2024) below)

In addition to spectral splits, one could attempt other tougher splits one could attempt such as UMAP splits suggested by Guo et. al. (2024). For a detailed comparison between UMAP splits and other commonly used splits please refer to Pat Walters’ article on splitting chemical data.

Generating Haikus with Llama 3.2

At the recent OPIG retreat, I was tasked with writing the pub quiz. The quiz included five rounds, and it’s always fun to do a couple “how well do you know your group?” style rounds. Since I work with Transformers, I thought it would be fun to get AI to create Haiku summaries of OPIGlet research descriptions from the website.

AI isn’t as funny as it used to be, but it’s a lot easier to get it to write something coherent. There are also lots of knobs you can turn like temperature, top_p, and the details of the prompt. I decided to use Meta’s new Llama 3.2-3B-Instruct model which is publicly available on Hugging Face. I ran it locally using vllm, and instructed it to write a haiku for each member’s description using a short script which parses the html from the website.

Continue reading

Visualising and validating differences between machine learning models on small benchmark datasets

Introduction
Author

Sam Money-Kyrle

Introduction

An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is ‘better’ than another (something Pat Walters has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.

The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (TDC). These leaderboard tables do not show:

  1. whether differences in metrics between methods are statistically significant,
  2. whether methods use ensembles or single models,
  3. whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,
  4. whether methods are pre-trained or not,
  5. whether pre-trained models are supervised, self-supervised, or both,
  6. the data and tasks that pre-trained models are pre-trained on.

This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.

Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by Ash et al. (2024) sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.

Continue reading

Making Pretty Pictures in PyMOL v2

Throughout my PhD I’ve needed nice PyMOL visualizations, but struggled to quickly and easily make the pictures I wanted. I’ve used Claire Marks‘ blopig post, Making Pretty Pictures in PyMOL, many times and wanted to expand it with what I’ve learned to make satisfying visualizations quickly!

Continue reading

Controlling PyMol from afar

Do you keep downloading .pdb and .sdf files and loading them into PyMol repeatedly?

If yes, then PyMol remote might be just for you. With PyMol remote, you can control a PyMol session running on your laptop from any other machine. For example, from a Jupyter Notebook running on your HPC cluster.

Continue reading

Building CLI Applications with Typer

Remember the last time you had to build a command-line tool? If you’re like me, you probably started with argparse or click, wrote boilerplate code, and still ended up with something that felt clunky. That’s where typer comes in – it’s a game-changer that lets you build CLI apps with minimal code. Although there are several other options, typer stands out because it leverages Python’s type hints to do the heavy lifting. No more manual argument parsing! The following snippet shows how to use typer in its simplest form:

import typer

app = typer.Typer()

@app.command()
def hello(name: str):
    typer.echo(f"Hello {name}!")

if __name__ == "__main__":
    app()

And you will be able to execute it with just:

$ python hello.py Pedro
Hello Pedro!

In this simple example, we were only defining positional arguments, but having optional arguments is as easy as setting default values in the function signature.

Continue reading

Cross referencing across LaTeX documents in one project

A common scenario we come across is that we have a main manuscript document and a supplementary information document, each of which have their own sections, tables and figures. The question then becomes – how do we effectively cross-reference between the documents without having to tediously count all the numbers ourselves every time we make a change and recompile the documents?

The answer: cross referencing!

Continue reading

Using Jupyter on a remote slurm cluster

Suppose we need to do some interactive analysis in a Jupyter notebook, but our local machine lacks the power. We have access to a slurm cluster, but we can’t SSH from the head node to the worker node; we can only SSH from the worker node to the head node. Can we still interact with a Jupyter notebook running on the worker node? As it happens, the answer is “yes” – we just need to do some reverse SSH tunnelling.

Continue reading

Big Compute doesnt want you to know this! Maximising GPU Usage with CUDA MPS

Accelerating Simulations with CUDA MPS: An OpenMM Implementation Guide

Introduction

High-performance molecular dynamics simulations often require running concurrent simulations on GPUs. However, traditional GPU resource allocation can lead to inefficient utilization when running multiple processes, with users often resorting to using multiple GPUs to achieve this. While parrallelsing across nodes can improve time to solution, many processes require coordination and hence communication which quickly becomes a bottleneck. This is exacerbated with more powerful hardware as internal node communication for a single simulation on a single GPU can also become a bottleneck. This problem has been addressed for CPU parrallelism with multiprocessing and multithreading but previously this was challenging to do this efficiently on GPUs.

NVIDIA’s Multi-Process Service (MPS) offers a solution by enabling efficient and easy sharing of GPU resources among multiple processes with just a few commands. In this blog post, we’ll explore how to implement CUDA MPS with Python multiprocessing and OpenMM to accelerate molecular dynamics simulations.

Continue reading

Cream, Compression, and Complexity: Notes from a Coffee-Induced Rabbit Hole

I have recently stumbled upon this paper which, quite unexpectedly, sent me down a rabbit hole reading about compression, generalisation, algorithmic information theory and looking at gifs of milk mixing with coffee on the internet. Here are some half-processed takeaways from this weird journey.

Complexity of a cup of Coffee

First, check out this cool video.

Continue reading