As a long-standing social sec of the OPIG research group, the key part of the job is to ensure that the event name includes some kind of pun around OPIG (and then maybe organise something). Notable ones include OPIGmas, the annual Christmas party that I am neglecting to organise at the moment, and O’Punting, our annual punting trip. For the next generation of social secs, I have proposed some new activities with even worse names.
Continue readingMaybe we should train on our test set?
One of the fundamental (pitfalls) of machine learning is to ensure that you don’t train on your test set, but what if I told you that you could?

The “AI-ntibody” Competition: benchmarking in silico antibody screening/design
We recently contributed to a communication in Nature Biotechnology detailing an upcoming competition coordinated by Specifica to evaluate the relative performance of in vitro display and in silico methods at identifying target-specific antibody binders and performing downstream antibody candidate optimisation.
Following in the footsteps of tournaments such as the Critical Assessment of Structure Prediction (CASP), which have led to substantial breakthroughs in computational methods for biomolecular structure prediction, the AI-ntibody initiative seeks to establish a periodic benchmarking exercise for in silico antibody discovery/design methods. It should help to identify the most significant breakthroughs in the space and orient future methods’ development.
Continue readingMaking Peace with Molecular Entropy
I first stumbled upon OPIG blogs through a post on ligand-binding thermodynamics, which refreshed my understanding of some thermodynamics concepts from undergrad, bringing me face-to-face with the concept that made most molecular physics students break out in cold sweats: Entropy. Entropy is that perplexing measure of disorder and randomness in a system. In the context of molecular dynamics simulations (MD), it calculates the conformational freedom and disorder within protein molecules which becomes particularly relevant when calculating binding free energies.
In MD, MM/GBSA and MM/PBSA are fancy terms for trying to predict how strongly molecules stick together and are the go-to methods for binding free energy calculations. MM/PBSA uses the Poisson–Boltzmann (PB) equation to account for solvent polarisation and ionic effects accurately but at a high computational cost. While MM/GBSA approximates PB, using the Generalised Born (GB) model, offering faster calculations suitable for large systems, though with reduced accuracy. Consider MM/PBSA as the careful accountant who considers every detail but takes forever, while MM/GBSA is its faster, slightly less accurate coworker who gets the job done when you’re in a hurry.
Like many before me, I made the classic error of ignoring entropy, assuming that entropy changes that were similar across systems being compared would have their terms cancel out and could be neglected. This would simplify calculations and ease computational constraints (in other words it was too complicated, and I had deadlines breathing down my neck). This worked fine… until it didn’t. The wake-up call came during a project studying metal-isocitrate complexes in IDH1. For context, IDH1 is a homodimer with a flexible ‘hinge’ region that becomes unstable without its corresponding subunit, giving rise to very high fluctuations. By ignoring entropy in this unstable system, I managed to generate binding free energy results that violated several laws of thermodynamics and would make Clausius roll in his grave.
Continue readingNavigating Hallucinations in Large Language Models: A Simple Guide
AI is moving fast, and large language models (LLMs) are at the centre of it all, doing everything from generating coherent, human-like text to tackling complex coding challenges. And this is just scratching the surface—LLMs are popping up everywhere, and their list of talents keeps growing by the day.
However, these models aren’t infallible. One of their most intriguing and concerning quirks is the phenomenon known as “hallucination” – instances where the AI confidently produces information that is fabricated or factually incorrect. As we increasingly rely on AI-powered systems in our daily lives, understanding what hallucinations are is crucial. This post briefly explores LLM hallucinations, exploring what they are, why they occur, and how we can navigate them and get the most out of our new favourite tools.
Continue readingProtein Property Prediction Using Graph Neural Networks
Proteins are fundamental biological molecules whose structure and interactions underpin a wide array of biological functions. To better understand and predict protein properties, scientists leverage graph neural networks (GNNs), which are particularly well-suited for modeling the complex relationships between protein structure and sequence. This post will explore how GNNs provide a natural representation of proteins, the incorporation of protein language models (PLLMs) like ESM, and the use of techniques like residual layers to improve training efficiency.
Why Graph Neural Networks are Ideal for Representing Proteins
Graph Neural Networks (GNNs) have emerged as a promising framework to fuse primary and secondary structure representation of proteins. GNNs are uniquely suited to represent proteins by modeling atoms or residues as nodes and their spatial connections as edges. Moreover, GNNs operate hierarchically, propagating information through the graph in multiple layers and learning representations of the protein at different levels of granularity. In the context of protein property prediction, this hierarchical learning can reveal important structural motifs, local interactions, and global patterns that contribute to biochemical properties.
Continue readingTesting python (or any!) command line applications
Through our work in OPIG, many of our projects come in the form of code bases written in Python. These can be many different things like databases, machine learning models, and other software tools. Often, the user interface for these tools is developed as both a web app and a command line application. Here, I will discuss one of my favourite tools for testing command-line applications: prysk!
Continue readingThe XChem trove of protein–small-molecules structures not in the PDB
The XChem facility at Diamond Light Source is truly impressive feat of automation in fragment-based drug discovery, where visitors comes clutching a styrofoam ice box teeming with apo-form protein crystals, which the shifter soaks with compounds from one or more fragment libraries and a robot at the i04-1 beamline kindly processes each of the thousands of crystal-laden pins, while the visitor enjoys the excellent food in the Diamond canteen (R22). I would especially recommend the jambalaya. Following data collection, the magic of data processing happens: the PanDDA method is used to find partial occupancy in the density, which is processed semi-automatedly and most open targets are uploaded in the Fragalysis web app allowing the ligand binding to be studied and further compounds elaborated. This collection of targets bound to hundreds of small molecules is a true treasure trove of data as many have yet to be deposited in the PDB, making it a perfect test set for algorithm design: fragments are notorious fickle to model and deep learning models cannot cheat by remembering these from the protein database.
Continue readingWhy the vegans will say “I told you so…”
I am writing this on Wednesday 2nd October 2024. The news has all eyes on the middle eastern skies. Yesterday a story was circulating on BBC news warning of a drop in uptake of the seasonal flu jab.
https://www.bbc.co.uk/news/articles/c62d8r0nnl6o
Four days ago, on Friday 27th September, several news outlets reported that several healthcare workers had shown flu-like symptoms following exposure to the first patient known to have contracted avian flu (H5N1) without any animal contact. PCR testing has been inconclusive, with none of these workers testing positive for signs of the virus.
https://www.bbc.co.uk/news/articles/czd1v3vn6ero
Continue readingAider and Cheap, Free, and Local LLMs
Aider and the Future of Coding: Open-Source, Affordable, and Local LLMs
The landscape of AI coding is rapidly evolving, with tools like Cursor gaining popularity for multi-file editing and copilot for AI-assisted autocomplete. However, these solutions are both closed-source and require a subscription.
This blog post will explore Aider, an open-source AI coding tool that offers flexibility, cost-effectiveness, and impressive performance, especially when paired with affordable, free, and local LLMs like DeepSeek, Google Gemini, and Ollama.