Category Archives: Data Visualization

Visualising and validating differences between machine learning models on small benchmark datasets

Introduction
Author

Sam Money-Kyrle

Introduction

An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is ‘better’ than another (something Pat Walters has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.

The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (TDC). These leaderboard tables do not show:

  1. whether differences in metrics between methods are statistically significant,
  2. whether methods use ensembles or single models,
  3. whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,
  4. whether methods are pre-trained or not,
  5. whether pre-trained models are supervised, self-supervised, or both,
  6. the data and tasks that pre-trained models are pre-trained on.

This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.

Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by Ash et al. (2024) sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.

Continue reading

Controlling PyMol from afar

Do you keep downloading .pdb and .sdf files and loading them into PyMol repeatedly?

If yes, then PyMol remote might be just for you. With PyMol remote, you can control a PyMol session running on your laptop from any other machine. For example, from a Jupyter Notebook running on your HPC cluster.

Continue reading

Walk through a cell

In 2022, Maritan et al. released the first ever macromolecular model of an entire cell. The cell in question is a bacterial cell from the genus Mycoplasma. If you’re a biologist, you likely know Mycoplasma as a common cell culture contaminant.

Now, through the work of app developer Timothy Davison, you can interactively explore this cell model from the comfort of your iPhone or Apple Vision Pro. Here are three reasons why I like CellWalk:

1. It’s pretty

The visuals of CellWalk are striking. The app offers a rich depiction of the cell, allowing the user to zoom from the whole cell to individual atoms. I spent a while clicking through each protein I could see to see if I could guess what it was or what it did. Zooming out, CellWalk offers a beautiful tripartite cross section of the cell, showing first the lipid membrane, then a colourful jumble-bag of all its cellular proteins, and then finally the spaghetti-like polynucleic acids.

Tripartite cross section of a Mycoplasma cell. Screengrab taken from the CellWalk app on my phone.
Continue reading

ggPlotting tips with OPIG data

Ever wondered whether opiglets keep their ketchup in the fridge or cupboard? Perhaps you’ve wanted to know how to create nice figure to display lots of information simulataniously. Publication quality figures are easy in R with the ggplot package. We may also learn some good visualisation.

Continue reading

The Tale of the Undead Logger

A picture of a scary-looking zombie in a lumberjack outfit holding an axe, in the middle of a forest at night, staring menacingly at the viewer.
Fear the Undead Logger all ye who enter here.
For he may strike, and drain the life out nodes that you hold dear.
Among the smouldering embers of jobs you thought long dead,
he lingers on, to terrorise, and cause you frightful dread.
But hark ye all my tale to save you from much pain,
and fight ye not anew the battles I have fought in vain.

Or simply…

… Tips and Tricks to Use When wandb Logger Just. Won’t. DIE.

The Weights and Biases Logger (illustrated above by DALL-E; admittedly with some artistic license) hardly requires introduction. It’s something of an industry standard at this point, well-regarded for the extensive (and extensible) functionality of its interactive dashboard; for advanced features like checkpointing model weights in the cloud and automating hyperparameter sweeps; and for integrating painlessly with frameworks like PyTorch and PyTorch Lightning. It simplifies your life as an ML researcher enormously by making it easy to track and compare experiments, monitor system resource usage, all while giving you very fun interactive graphs to play with.
Plot arbitrary quantities you may be logging against each other, interactively, on the fly, however you like. In Dark Mode, of course (you’re a professional, after all). Here’s a less artistic impression to give you an idea, should you have been living under a rock:

Continue reading

Comparing pose and affinity prediction methods for follow-up designs from fragments

In any task in the realm of virtual screening, there need to be many filters applied to a dataset of ligands to downselect the ‘best’ ones on a number of parameters to produce a manageable size. One popular filter is if a compound has a physical pose and good affinity as predicted by tools such as docking or energy minimisation. In my pipeline for downselecting elaborations of compounds proposed as fragment follow-ups, I calculate the pose and ΔΔG by energy minimizing the ligand with atom restraints to matching atoms in the fragment inspiration. I either use RDKit using its MMFF94 forcefield or PyRosetta using its ref2015 scorefunction, all made possible by the lovely tool Fragmenstein.

With RDKit as the minimizer the protein neighborhood around the ligand is fixed and placements take on average 21s whereas with PyRosetta placements, they take on average 238s (and I can run placements in parallel luckily). I would ideally like to use RDKit as the placement method since it is so fast and I would like to perform 500K within a few days but, I wanted to confirm that RDKit is ‘good enough’ compared to the slightly more rigorous tool PyRosetta (it allows residues to relax and samples more conformations with the longer runtime I think).

Continue reading

Pitfalls of using Pearson’s correlation for comparing model performance

Pearson’s R (correlation coefficient) is a measure of the linear correlation between two variables, giving a value between -1 and 1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation. While it’s a useful statistic for understanding the relationship between two variables, it is often used to compare the performance of two or more models. For example, imagine we had experimental values that we are predicting and several models’ predictions. Obviously, we would prefer the model with the highest Pearson’s R … or perhaps not?

Continue reading

Making your figures more accessible

You might have created the most esthetic figures for your last presentation with a beautiful colour scheme, but have you considered how these might look to someone with colourblindness? Around 5% of the gerneral population suffer from some kind of color vision deficiency, so making your figures more accessible is actually quite important! There are a range of online tools that can help you create figures that look great to everyone.

Continue reading

Plotext: The Matplotlib Lookalike That Breaks Free from X Servers

Imagine this: you’ve spent days computing intricate analyses, and now it’s time to bring your findings to life with a nice plot. You fire up your cluster job, scripts hum along, and… matplotlib throws an error, demanding an X server it can’t find. Frustration sets in. What a waste of computation! What happened? You just forgot to add the -X to your ssh command, or it may be just that X forwarding is not allowed in your cluster. So you will need to rerun your scripts, once you have modified them to generate a file that you can copy to your local machine rather than plotting it directly.

But wait! Plotext to the rescue! This Python package provides an interface nearly identical to matplotlib, allowing you to seamlessly transition your plotting code without sacrificing functionality. But why choose Plotext over the familiar matplotlib? The key lies in its text-based backend. This means it is just printing characters in your console to generate the plots, making it ideal for cluster environments where X servers are often absent or restricted. What do those plots look like? Here is an example:

Continue reading

Some useful pandas functions

Pandas is one of the most used packages for data analysis in python. The library provides functionalities that allow to perfrom complex data manipulation operations in a few lines of code. However, as the number of functions provided is huge, it is impossible to keep track of all of them. More often than we’d like to admit we end up wiriting lines and lines of code only to later on discover that the same operation can be performed with a single pandas function.

To help avoiding this problem in the future, I will run through some of my favourite pandas functions and demonstrate their use on an example data set containing information of crystal structures in the PDB.

Continue reading