Category Archives: Python

Visualising and validating differences between machine learning models on small benchmark datasets

Introduction
Author

Sam Money-Kyrle

Introduction

An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is ‘better’ than another (something Pat Walters has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.

The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (TDC). These leaderboard tables do not show:

  1. whether differences in metrics between methods are statistically significant,
  2. whether methods use ensembles or single models,
  3. whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,
  4. whether methods are pre-trained or not,
  5. whether pre-trained models are supervised, self-supervised, or both,
  6. the data and tasks that pre-trained models are pre-trained on.

This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.

Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by Ash et al. (2024) sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.

Continue reading

Controlling PyMol from afar

Do you keep downloading .pdb and .sdf files and loading them into PyMol repeatedly?

If yes, then PyMol remote might be just for you. With PyMol remote, you can control a PyMol session running on your laptop from any other machine. For example, from a Jupyter Notebook running on your HPC cluster.

Continue reading

Building CLI Applications with Typer

Remember the last time you had to build a command-line tool? If you’re like me, you probably started with argparse or click, wrote boilerplate code, and still ended up with something that felt clunky. That’s where typer comes in – it’s a game-changer that lets you build CLI apps with minimal code. Although there are several other options, typer stands out because it leverages Python’s type hints to do the heavy lifting. No more manual argument parsing! The following snippet shows how to use typer in its simplest form:

import typer

app = typer.Typer()

@app.command()
def hello(name: str):
    typer.echo(f"Hello {name}!")

if __name__ == "__main__":
    app()

And you will be able to execute it with just:

$ python hello.py Pedro
Hello Pedro!

In this simple example, we were only defining positional arguments, but having optional arguments is as easy as setting default values in the function signature.

Continue reading

Testing python (or any!) command line applications

Through our work in OPIG, many of our projects come in the form of code bases written in Python. These can be many different things like databases, machine learning models, and other software tools. Often, the user interface for these tools is developed as both a web app and a command line application. Here, I will discuss one of my favourite tools for testing command-line applications: prysk!

Continue reading

MDAnalysis: Work with dynamics trajectories of proteins

For a long time crystallographers and subsequently the authors of AlphaFold2 had you believe that proteins are a static group of atoms written to a .pdb file. Turns out this was a HOAX. If you don’t want to miss out on the latest trend of working with dynamic structural ensembles of proteins this blog post is exactly right for you. MDAnalysis is a python package which as the name says was designed to analyse molecular dyanmics simulation and lets you work with trajectories of protein structures easily.

Continue reading

Out of the box RDKit-valid is an imperfect metric: a review of the KekulizeException and nitrogen protonation to correct this

In deep learning based compound generation models the metric of fraction of RDKit-valid compounds is ubiquitous, but is problematic from the cheminformatics viewpoint as a large fraction may be driven by pyrrolic nitrogens (see below) rather than Texas carbons (carbon with 5 bonds like the Star of Texas). In RDKit, no error is more irksome that the KekulizeException or ValenceException from RDKit sanitisation. These are raised when the molecule is not correct. This would make the RDKit-valid a good metric, except for a small detail: the validity is as interpreted from the the stated implicit and explicit hydrogens and formal charges on the atoms, which most models do not assign. Therefore, a compound may not be RDKit-valid because it is actually impossible, like a Texas carbon, but in many cases it is because the formal charge or implicit hydrogen numbers of some atoms are incorrect. In both case, the major culprit is nitrogen. Herein I go through what they are and how to fix them, with a focus on aromatic nitrogens.

Continue reading

Making your code pip installable

aka when to use a CutomBuildCommand or a CustomInstallCommand when building python packages with setup.py

Bioinformatics software is complicated, and often a little bit messy. Recently I found myself wading through a python package building quagmire and thought I could share something I learnt about when to use a custom build command and when to use a custom install command. I have also provided some information about how to copy executables to your package installation bin. **ChatGPT wrote the initial skeleton draft of this post, and I have corrected and edited.

Next time you need to create a pip installable package yourself, hopefully this can save you some time!

Continue reading

Memory-mapped files for efficient data processing

Memory management is a key concern when working with large datasets. Many researchers and developers will load entire datasets into memory for processing. Although this is a straightforward approach that allows for quick access and manipulation of data, it has its drawbacks. When the dataset size approaches or exceeds the available physical memory, performance degrades rapidly due to excessive swapping, leading to increased latency and reduced throughput. Memory-mapped files are an alternative strategy to access and manipulate large datasets without the need to load them fully into memory.


A background on memory-mapped Files

Memory mapping is the process of mapping a file or a portion of a file directly into virtual memory. This mapping establishes a one-to-one correspondence between the file’s contents on disk and specific addresses in the process’s memory space. Instead of relying on traditional I/O operations, such as read() an write(), which involve copying data between kernel space and user space, the process can access the file’s contents directly through memory addresses. Then, page faults are used to determine which chunks to load into physical memory. However, this chunks are significantly smaller than the whole file contents. This direct access reduces overhead and can significantly speed up data processing, especially for large files or applications that require high-throughput I/O operations.

Continue reading

Sort and Slice Tutorial – An alternative to extended connectivity fingerprints

Reproducible publishing

I’m a big fan of Jupyter Notebooks. They’re a great way to document and explain code, and even better, you can run this code when connected to an appropriate kernel.

What if you want to work on something larger than a notebook? Say a chapter or even a whole book, with Python, R, Observable JS, or Julia code? Enter Quarto. You can combine Jupyter notebooks and/or plain text markdown to publish production quality articles, presentations, dashboards, website, blogs and books in HTML, PDF, Microsoft Word, ePub, and other formats. Quarto can also connect to publishing platforms like Posit Connect, Confluence Cloud, and others.

Continue reading