Aider and Cheap, Free, and Local LLMs

Aider and the Future of Coding: Open-Source, Affordable, and Local LLMs

The landscape of AI coding is rapidly evolving, with tools like Cursor gaining popularity for multi-file editing and copilot for AI-assisted autocomplete. However, these solutions are both closed-source and require a subscription.

This blog post will explore Aider, an open-source AI coding tool that offers flexibility, cost-effectiveness, and impressive performance, especially when paired with affordable, free, and local LLMs like DeepSeek, Google Gemini, and Ollama.

Continue reading →

MDAnalysis: Work with dynamics trajectories of proteins

For a long time crystallographers and subsequently the authors of AlphaFold2 had you believe that proteins are a static group of atoms written to a .pdb file. Turns out this was a HOAX. If you don’t want to miss out on the latest trend of working with dynamic structural ensembles of proteins this blog post is exactly right for you. MDAnalysis is a python package which as the name says was designed to analyse molecular dyanmics simulation and lets you work with trajectories of protein structures easily.

Continue reading →

Our future health: A new UK health research programme

Last week I walked into Boots and, after giving some physical measurements, including my blood pressure and cholesterol levels, I gave a blood sample to be part of the Our Future Health initiative. Our Future Health (https://ourfuturehealth.org.uk/) is set to become the UK’s largest health research programme ever. With the aim of recruiting five million volunteers across the country, it aims to revolutionise the way we detect, prevent and treat disease.

The breadth, depth and detail of Our Future Health makes it a world-leading resource. The data collected could hold the key to a wide range of health discoveries, such as:

Identifying early signals to detect disease much earlier.
Accurately predicting who is at higher risk of disease.
Developing better interventions and more effective treatments and technologies.

How’s it going so far?

Since the start of recruitment in July 2022 (delyed because of Covid), the programme has recruited over one million participants where:

Continue reading →

Walk through a cell

In 2022, Maritan et al. released the first ever macromolecular model of an entire cell. The cell in question is a bacterial cell from the genus Mycoplasma. If you’re a biologist, you likely know Mycoplasma as a common cell culture contaminant.

Now, through the work of app developer Timothy Davison, you can interactively explore this cell model from the comfort of your iPhone or Apple Vision Pro. Here are three reasons why I like CellWalk:

1. It’s pretty

The visuals of CellWalk are striking. The app offers a rich depiction of the cell, allowing the user to zoom from the whole cell to individual atoms. I spent a while clicking through each protein I could see to see if I could guess what it was or what it did. Zooming out, CellWalk offers a beautiful tripartite cross section of the cell, showing first the lipid membrane, then a colourful jumble-bag of all its cellular proteins, and then finally the spaghetti-like polynucleic acids.

Tripartite cross section of a *Mycoplasma* cell. Screengrab taken from the CellWalk app on my phone.

Continue reading →

Out of the box RDKit-valid is an imperfect metric: a review of the KekulizeException and nitrogen protonation to correct this

In deep learning based compound generation models the metric of fraction of RDKit-valid compounds is ubiquitous, but is problematic from the cheminformatics viewpoint as a large fraction may be driven by pyrrolic nitrogens (see below) rather than Texas carbons (carbon with 5 bonds like the Star of Texas). In RDKit, no error is more irksome that the KekulizeException or ValenceException from RDKit sanitisation. These are raised when the molecule is not correct. This would make the RDKit-valid a good metric, except for a small detail: the validity is as interpreted from the the stated implicit and explicit hydrogens and formal charges on the atoms, which most models do not assign. Therefore, a compound may not be RDKit-valid because it is actually impossible, like a Texas carbon, but in many cases it is because the formal charge or implicit hydrogen numbers of some atoms are incorrect. In both case, the major culprit is nitrogen. Herein I go through what they are and how to fix them, with a focus on aromatic nitrogens.

Continue reading →

The Patterns that Escape Us

Part The First: An Outrageous Claim

Reproduced below is the introductory passage from a psycholinguistics paper, published in the mid-nineties. Riveted, as I’m sure you are, having just read that banger opening line to my blog post, humour me and read on; I promise it gets interesting.

Continue reading →

Drug Discovery Tools, but they’re olympic sports…

The Olympic Games may have come and gone, but like me, I’m sure you’re all wondering which Olympic sport your favourite drug discovery tool would compete in. Fear not, I have taken it upon myself to answer this pressing question. In this blogpost, we’ll match some of the most popular tools in our field with their Olympic counterparts. Before we begin, let me clarify that I’m using the term ‘tool’ rather loosely here; I’ve included a variety of resources. I don’t claim these to be the most popular, just the ones I thought were most sport like.

RDKit: Athletics. I’m biased, but we must start with the big one. Like track and field events at the heart of the Olympics, RDKit is at the centre of many other tools in our field. It’s versatile, essential, and it’s hard to imagine our work without it. RDKit does it all.

Continue reading →

Do not forget to add your data folder to .gitignore

It is good practice not to commit a data folder to version control if the data is available elsewhere and you do not want to track changes of the data. But do not forget to also add an entry for this folder to .gitignore because otherwise git iterates over all the files in the folder when checking for file changes, which may take a long time if there are many files.

Continue reading →

Tanimoto similarity of ECFPs with RDKit: Common pitfalls

A common measure for the similarity of two molecules is the Tanimoto similarity of their ECFPs (Extended Connectivity FingerPrint). However, there is no clear standard in literature for what kind of ECFPs should be used when calculating the Tanimoto similarity, and that choice can lead to substantially different results. In this post I wish to shed light on some results you should know about before you jump into your calculations.

A blog post on how ECFPs are generated was written by Marcus Dablander in 2022 so please take a look at that. In short, ECFPs have a hyperparameter called the radius r, and sometimes a fingerprint length L. Each entry in the fingerprint indicates the presence or absence of a particular substructure in the molecule of interest, and the radius r defines how large the substructures that you consider are. If you have r=3 then you consider substructures made by going up to three hops away from each atom in your molecule. This is best explained by this figure from Marcus’ post:

Continue reading →

I really hope my compounds get the green light

As a cheminformatician in a drug discovery campaign or an algorithm developer making the perfect Figure 1, when one generates a list of compounds for a given target there is a deep desire that the compounds are well received by the reviewer, be it a med chemist on the team or a peer reviewer. This is despite scientific rigour and training and is due to the time invested. So to avoid the slightest shadow of med chem grey zone, here is a hopefully handy filter against common medchem grey-zone groups.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Aider and Cheap, Free, and Local LLMs

Aider and the Future of Coding: Open-Source, Affordable, and Local LLMs

MDAnalysis: Work with dynamics trajectories of proteins

Our future health: A new UK health research programme

Walk through a cell

Out of the box RDKit-valid is an imperfect metric: a review of the KekulizeException and nitrogen protonation to correct this

The Patterns that Escape Us

Part The First: An Outrageous Claim

Drug Discovery Tools, but they’re olympic sports…

Do not forget to add your data folder to .gitignore

Tanimoto similarity of ECFPs with RDKit: Common pitfalls

I really hope my compounds get the green light