Category Archives: AI

The Good (and limitations) of using a Local CoPilot with Ollama

Interactive code editors have been around for a while now, and tools like GitHub Copilot have woven their way into most development pipelines, and for good reason. They’re easy to use, exceptionally helpful (at certain tasks), and have undeniably made life as a developer smoother. Recently, I decided to switch away from relying on GitHub Copilot in favour of a local model for a few key reasons. While I don’t use it all the time, it has proven to be a useful option in many situations. In this blog post, I’ll go over why I made the switch, how I set it up, and share a bit about my experience so far.

Continue reading

Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data

I’m delighted to report our collaboration (Ísak Valsson, Matthew Warren, Aniket Magarkar, Phil Biggin, & Charlotte Deane), on “Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data”, has been published in Nature’s Communications Chemistry (https://doi.org/10.1038/s42004-025-01428-y).


During his MSc dissertation project in the Department of Statistics, University of Oxford, OPIG member Ísak Valsson developed an attention-based GNN to predict protein-ligand binding affinity called “AEV-PLIG”. It featurizes a ligand’s atoms using Atomic Environment Vectors to describe the Protein-Ligand Interactions found in a 3D protein-ligand complex. AEV-PLIG is free and open source (BSD 3-Clause), available from GitHub at https://github.com/oxpig/AEV-PLIG, and forked at https://github.com/bigginlab/AEV-PLIG.

Continue reading

LLM Coding Tools – An Overview

We’ve come a long way since GitHub Copilot first showed us what AI-assisted coding could look like. These days, there’s a whole ecosystem of LLM coding tools out there, each with their own strengths and approaches. In this blog, I’ll give you a quick overview to help you figure out which one might work best for your workflow.

Level 1: Interactive Code Assistance

Continue reading

Baby’s First NeurIPS: A Survival Guide for Conference Newbies

There’s something very surreal about stepping into your first major machine learning conference: suddenly, all those GitHub usernames, paper authors, and protagonists of heated twitter spats become real people, the hallways are buzzing with discussions of papers you’ve been meaning to read, and somehow there are 17,000 other people trying to navigate it all alongside you. That was my experience at NeurIPS this year, and despite feeling like a microplankton in an ocean of ML research, I had a grand time. While some of this success was pure luck, much of it came down to excellent advice from the group’s ML conference veterans and lessons learned through trial and error. So, before the details fade into a blur of posters and coffee breaks, here’s my guide to making the most of your first major ML conference.

Continue reading

A tougher molecular data split – spectral split

Scaffold splits have been widely used in molecular machine learning which involves identifying chemical scaffolds in the data set and ensuring scaffolds present in the train and test sets do not overlap. However, two very similar molecules can have differing scaffolds. In an example provided by Pat Walters in his article on splitting chemical data last month, he provides an example where two molecules just differ by a single atom and thus have a very high Tanimoto similarity score of 0.66. However, they have different scaffolds (figure below).

In this case, if one of the molecules were in the train set and the other in the test set, predicting the test molecule would be quite trivial as there is data leakage. Therefore, we need a better splitting method such that there is minimal overlap between the train and test set. In this blogpost, I will be discussing spectral split, a splitting method introduced by our fellow OPIG member, Klarner et. al (2023).

Spectral split

Spectral split or clustering is based on the spectral graph partitioning algorithm. The basic idea of spectral clustering is as follows: The dataset is projected on a R^n matrix. An affinity matrix using a kernel that could be domain-specific is defined. Following that, the graph Laplacian is computed from the affinity matrix, followed by its eigendecomposition. Then,  k eigenvectors corresponding to the k lowest/highest eigenvalues are selected. Finally, the clusters are formed using k-means.

In the context of molecular data splitting, one could use the Tanimoto similarity metric to construct a similarity matrix between all the molecules in the dataset. Then, a spectral clustering method could be used to partition the similarity matrix such that the similarity within the cluster is maximized whereas the similarity between the clusters is minimized. Spectral split showed the least overlap between train (blue) and test (red) set molecules compared to scaffold splits (figure from Klarner at. al. (2024) below)

In addition to spectral splits, one could attempt other tougher splits one could attempt such as UMAP splits suggested by Guo et. al. (2024). For a detailed comparison between UMAP splits and other commonly used splits please refer to Pat Walters’ article on splitting chemical data.

Generating Haikus with Llama 3.2

At the recent OPIG retreat, I was tasked with writing the pub quiz. The quiz included five rounds, and it’s always fun to do a couple “how well do you know your group?” style rounds. Since I work with Transformers, I thought it would be fun to get AI to create Haiku summaries of OPIGlet research descriptions from the website.

AI isn’t as funny as it used to be, but it’s a lot easier to get it to write something coherent. There are also lots of knobs you can turn like temperature, top_p, and the details of the prompt. I decided to use Meta’s new Llama 3.2-3B-Instruct model which is publicly available on Hugging Face. I ran it locally using vllm, and instructed it to write a haiku for each member’s description using a short script which parses the html from the website.

Continue reading

Visualising and validating differences between machine learning models on small benchmark datasets

Introduction
Author

Sam Money-Kyrle

Introduction

An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is ‘better’ than another (something Pat Walters has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.

The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (TDC). These leaderboard tables do not show:

  1. whether differences in metrics between methods are statistically significant,
  2. whether methods use ensembles or single models,
  3. whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,
  4. whether methods are pre-trained or not,
  5. whether pre-trained models are supervised, self-supervised, or both,
  6. the data and tasks that pre-trained models are pre-trained on.

This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.

Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by Ash et al. (2024) sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.

Continue reading

The “AI-ntibody” Competition: benchmarking in silico antibody screening/design

We recently contributed to a communication in Nature Biotechnology detailing an upcoming competition coordinated by Specifica to evaluate the relative performance of in vitro display and in silico methods at identifying target-specific antibody binders and performing downstream antibody candidate optimisation.

Following in the footsteps of tournaments such as the Critical Assessment of Structure Prediction (CASP), which have led to substantial breakthroughs in computational methods for biomolecular structure prediction, the AI-ntibody initiative seeks to establish a periodic benchmarking exercise for in silico antibody discovery/design methods. It should help to identify the most significant breakthroughs in the space and orient future methods’ development.

Continue reading

Navigating Hallucinations in Large Language Models: A Simple Guide

AI is moving fast, and large language models (LLMs) are at the centre of it all, doing everything from generating coherent, human-like text to tackling complex coding challenges. And this is just scratching the surface—LLMs are popping up everywhere, and their list of talents keeps growing by the day.

However, these models aren’t infallible. One of their most intriguing and concerning quirks is the phenomenon known as “hallucination” – instances where the AI confidently produces information that is fabricated or factually incorrect. As we increasingly rely on AI-powered systems in our daily lives, understanding what hallucinations are is crucial. This post briefly explores LLM hallucinations, exploring what they are, why they occur, and how we can navigate them and get the most out of our new favourite tools.

Continue reading

Protein Property Prediction Using Graph Neural Networks

Proteins are fundamental biological molecules whose structure and interactions underpin a wide array of biological functions. To better understand and predict protein properties, scientists leverage graph neural networks (GNNs), which are particularly well-suited for modeling the complex relationships between protein structure and sequence. This post will explore how GNNs provide a natural representation of proteins, the incorporation of protein language models (PLLMs) like ESM, and the use of techniques like residual layers to improve training efficiency.

Why Graph Neural Networks are Ideal for Representing Proteins

Graph Neural Networks (GNNs) have emerged as a promising framework to fuse primary and secondary structure representation of proteins. GNNs are uniquely suited to represent proteins by modeling atoms or residues as nodes and their spatial connections as edges. Moreover, GNNs operate hierarchically, propagating information through the graph in multiple layers and learning representations of the protein at different levels of granularity. In the context of protein property prediction, this hierarchical learning can reveal important structural motifs, local interactions, and global patterns that contribute to biochemical properties.

Continue reading