In deep learning based compound generation models the metric of fraction of RDKit-valid compounds is ubiquitous, but is problematic from the cheminformatics viewpoint as a large fraction may be driven by pyrrolic nitrogens (see below) rather than Texas carbons (carbon with 5 bonds like the Star of Texas). In RDKit, no error is more irksome that the KekulizeException
or ValenceException
from RDKit sanitisation. These are raised when the molecule is not correct. This would make the RDKit-valid a good metric, except for a small detail: the validity is as interpreted from the the stated implicit and explicit hydrogens and formal charges on the atoms, which most models do not assign. Therefore, a compound may not be RDKit-valid because it is actually impossible, like a Texas carbon, but in many cases it is because the formal charge or implicit hydrogen numbers of some atoms are incorrect. In both case, the major culprit is nitrogen. Herein I go through what they are and how to fix them, with a focus on aromatic nitrogens.
Category Archives: Hints and Tips
Do not forget to add your data folder to .gitignore
It is good practice not to commit a data folder to version control if the data is available elsewhere and you do not want to track changes of the data. But do not forget to also add an entry for this folder to .gitignore
because otherwise git iterates over all the files in the folder when checking for file changes, which may take a long time if there are many files.
Converting or renaming files, whilst still maintaining the directory structure
For various reasons we might need to convert files from one format to another, for instance from lossless FLAC to MP3. For example:
ffmpeg -i lossless-audio.flac -acodec libmp3lame -ab 128k compressed-audio.mp3
This could be any conversion, but it implies that the input file and the output file are in the same directory. What if we have a carefully curated directory structure and we want to convert (or rename) every file within that structure?
find . -name “*.whateveryouneed” -exec somecommand {} \; is the tool for you.
Continue readingReproducible publishing
I’m a big fan of Jupyter Notebooks. They’re a great way to document and explain code, and even better, you can run this code when connected to an appropriate kernel.
What if you want to work on something larger than a notebook? Say a chapter or even a whole book, with Python, R, Observable JS, or Julia code? Enter Quarto. You can combine Jupyter notebooks and/or plain text markdown to publish production quality articles, presentations, dashboards, website, blogs and books in HTML, PDF, Microsoft Word, ePub, and other formats. Quarto can also connect to publishing platforms like Posit Connect, Confluence Cloud, and others.
Continue readingInteractive visualization of protein–ligand complexes with Py3Dmol
I recently had a problem where I wanted to provide an interactive visualization of multiple different protein–ligand complexes, requiring minimal setup by the user, allowing them to zoom in and out and change the visualization style, without just providing multiple PDB files or a PyMOL session.
Continue readingFine-tune generated molecular poses with a force field
Some molecular pose generation methods benefit from an energy relaxation post-processing step.

Here is a quick way to do this using OpenMM via a short script I prepared:
Continue readingOrganise Your ML Projects With Hydra
One of the most annoying parts of ML research is keeping track of all the various different experiments you’re running – quickly changing and keeping track of changes to your model, data or hyper-parameters can turn into an organisational nightmare. I’m normally a fan of avoiding too many different libraries/frameworks as they often break down if you to do anything even a little bit custom and days are often wasted trying to adapt yourself to a new framework or adapt the framework to you. However, my last codebase ended up straying pretty far into the chaotic side of things so I thought it might be worth trying something else out for my next project. In my quest to instil a bit more order, I’ve started using Hydra, which strikes a nice balance between giving you more structure to organise a project, while not rigidly insisting on it, and I’d highly recommend checking it out yourself.
Continue readingQuickly (and lazily) scale your data processing in Python
Do you use pandas for your data processing/wrangling? If you do, and your code involves any data-heavy steps such as data generation, exploding operations, featurization, etc, then it can quickly become inconvenient to test your code.
- Inconvenient compute times (>tens of minutes). Perhaps fine for a one-off, but over repeated test iterations your efficiency and focus will take a hit.
- Inconvenient memory usage. Perhaps your dataset is too large for memory, or loads in but then causes an OOM error during a mid-operation memory spike.
Conference Summary: MGMS Adaptive Immune Receptors Meeting 2024
On 5th April 2024, over 60 researchers braved the train strikes and gusty weather to gather at Lady Margaret Hall in Oxford and engage in a day full of scientific talks, posters and discussions on the topic of adaptive immune receptor (AIR) analysis!

Optimising for PR AUC vs ROC AUC – an intuitive understanding
When training a machine learning (ML) model, our main aim is usually to get the ‘best’ model out the other end in an unbiased manner. Of course, there are other considerations such as quick training and inference, but mostly we want to be good at predicting the right answer.
A number of factors will affect the quality of our final model, including the chosen architecture, optimiser, and – importantly – the metric we are optimising for. So, how should we pick this metric?
Continue reading