Category Archives: AI

The “AI-ntibody” Competition: benchmarking in silico antibody screening/design

We recently contributed to a communication in Nature Biotechnology detailing an upcoming competition coordinated by Specifica to evaluate the relative performance of in vitro display and in silico methods at identifying target-specific antibody binders and performing downstream antibody candidate optimisation.

Following in the footsteps of tournaments such as the Critical Assessment of Structure Prediction (CASP), which have led to substantial breakthroughs in computational methods for biomolecular structure prediction, the AI-ntibody initiative seeks to establish a periodic benchmarking exercise for in silico antibody discovery/design methods. It should help to identify the most significant breakthroughs in the space and orient future methods’ development.

Continue reading →

Navigating Hallucinations in Large Language Models: A Simple Guide

AI is moving fast, and large language models (LLMs) are at the centre of it all, doing everything from generating coherent, human-like text to tackling complex coding challenges. And this is just scratching the surface—LLMs are popping up everywhere, and their list of talents keeps growing by the day.

However, these models aren’t infallible. One of their most intriguing and concerning quirks is the phenomenon known as “hallucination” – instances where the AI confidently produces information that is fabricated or factually incorrect. As we increasingly rely on AI-powered systems in our daily lives, understanding what hallucinations are is crucial. This post briefly explores LLM hallucinations, exploring what they are, why they occur, and how we can navigate them and get the most out of our new favourite tools.

Continue reading →

Protein Property Prediction Using Graph Neural Networks

Proteins are fundamental biological molecules whose structure and interactions underpin a wide array of biological functions. To better understand and predict protein properties, scientists leverage graph neural networks (GNNs), which are particularly well-suited for modeling the complex relationships between protein structure and sequence. This post will explore how GNNs provide a natural representation of proteins, the incorporation of protein language models (PLLMs) like ESM, and the use of techniques like residual layers to improve training efficiency.

Why Graph Neural Networks are Ideal for Representing Proteins

Graph Neural Networks (GNNs) have emerged as a promising framework to fuse primary and secondary structure representation of proteins. GNNs are uniquely suited to represent proteins by modeling atoms or residues as nodes and their spatial connections as edges. Moreover, GNNs operate hierarchically, propagating information through the graph in multiple layers and learning representations of the protein at different levels of granularity. In the context of protein property prediction, this hierarchical learning can reveal important structural motifs, local interactions, and global patterns that contribute to biochemical properties.

Continue reading →

Aider and Cheap, Free, and Local LLMs

Aider and the Future of Coding: Open-Source, Affordable, and Local LLMs

The landscape of AI coding is rapidly evolving, with tools like Cursor gaining popularity for multi-file editing and copilot for AI-assisted autocomplete. However, these solutions are both closed-source and require a subscription.

This blog post will explore Aider, an open-source AI coding tool that offers flexibility, cost-effectiveness, and impressive performance, especially when paired with affordable, free, and local LLMs like DeepSeek, Google Gemini, and Ollama.

Continue reading →

Out of the box RDKit-valid is an imperfect metric: a review of the KekulizeException and nitrogen protonation to correct this

In deep learning based compound generation models the metric of fraction of RDKit-valid compounds is ubiquitous, but is problematic from the cheminformatics viewpoint as a large fraction may be driven by pyrrolic nitrogens (see below) rather than Texas carbons (carbon with 5 bonds like the Star of Texas). In RDKit, no error is more irksome that the KekulizeException or ValenceException from RDKit sanitisation. These are raised when the molecule is not correct. This would make the RDKit-valid a good metric, except for a small detail: the validity is as interpreted from the the stated implicit and explicit hydrogens and formal charges on the atoms, which most models do not assign. Therefore, a compound may not be RDKit-valid because it is actually impossible, like a Texas carbon, but in many cases it is because the formal charge or implicit hydrogen numbers of some atoms are incorrect. In both case, the major culprit is nitrogen. Herein I go through what they are and how to fix them, with a focus on aromatic nitrogens.

Continue reading →

Five-word stories about a world where AI dominates the world

“For sale: baby shoes, never worn.” ~ Ernest Hemingway??

This is a six-word story famously misattributed to Ernest Hemingway. According to Wikipedia, this story first appeared in 1906, when Hemingway was 7 years old, and later attributed to him in 1991, 30 years after his death. So, no chance it was his.

Regardless of its origin, I found this type of story very creative.

In this blog post, as the title says, I will dare to push the boundary to present 5-word stories on the topic of AI taking over the world, BUT with a humorous spin.

Continue reading →

Incorporating conformer ensembles for better molecular representation learning

Conformer ensemble of tryptophan from Seibert et. al.

The spatial or 3D structure of a molecule is particularly relevant to modeling its activity in QSAR. The 3D structural information affects molecular properties and chemical reactivities and thus it is important to incorporate them in deep learning models built for molecules. A key aspect of the spatial structure of molecules is the flexible distribution of their constituent atoms known as conformation. Given the temperature of a molecular system, the probability of each of its possible conformation is defined by its formation energy and this follows a Boltzmann distribution [McQuarrie and Simon, 1997]. The Boltzmann distribution tells us the probability of a certain confirmation given its potential energy. The different conformations of a molecule could result in different properties and activity. Therefore, it is imperative to consider multiple conformers in molecular deep learning to ensure that the notion of conformational flexibility is embedded in the model developed. The model should also be able to capture the Boltzmann distribution of the potential energy related to the conformers.

Continue reading →

Architectural highlights of AlphaFold3

DeepMind and Isomophic Labs recently published the methods behind AlphaFold3, the sequel to the famous AlphaFold2. The involvement of Isomorphic Labs signifies a shift that Alphabet is getting serious about drug design. To this end, AlphaFold3 provides a substantial improvement in the field of complex prediction, a major piece in the computational drug design pipeline.

Continue reading →

Conference summary: Generative AI in Life Science

This year I attended the second edition of Generative AI in Life Science (GenLife – https://genlife.dk/) and it was an enriching experience that I thoroughly enjoyed. Held in Copenhagen, the event brought together researchers from different areas of AI applied to the life sciences and provided a fantastic platform for networking, learning and sharing ideas. The programme included a mix of long and short talks from experts in the field, but also had a significant presence of emerging PIs, making the conference a perfect place to discover emerging groups in the field. Here I have collected some highlights of the talks I have enjoyed the most at the conference.

Continue reading →

My take on the Collaborations Workshop (CW) 2024

At the end of April, I attended the CW 2024. This yearly hybrid event organised by the Software Sustainability Institute (SSI) has been running since 2011! The event brings people together to discuss best practices and the future of software in research. This year’s event themes were (1) AI/ML tools for Science, (2) Citizen Science and (3) Environmental sustainability.

As a Research Software Engineer (RSE) working with OPIG, I felt a great curiosity to attend and find out what I could bring of use to the group, as most people work on AI/ML applications. In this blog post, I share a few bits of the event which resonated with me and I found most interesting and relevant to share with my group.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: AI

The “AI-ntibody” Competition: benchmarking in silico antibody screening/design

Navigating Hallucinations in Large Language Models: A Simple Guide

Protein Property Prediction Using Graph Neural Networks

Why Graph Neural Networks are Ideal for Representing Proteins

Aider and Cheap, Free, and Local LLMs

Aider and the Future of Coding: Open-Source, Affordable, and Local LLMs

Out of the box RDKit-valid is an imperfect metric: a review of the KekulizeException and nitrogen protonation to correct this

Five-word stories about a world where AI dominates the world

Incorporating conformer ensembles for better molecular representation learning

Architectural highlights of AlphaFold3

Conference summary: Generative AI in Life Science

My take on the Collaborations Workshop (CW) 2024