Category Archives: Proteins

Visualising and validating differences between machine learning models on small benchmark datasets

Introduction
Author

Sam Money-Kyrle

Introduction

An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is ‘better’ than another (something Pat Walters has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.

The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (TDC). These leaderboard tables do not show:

  1. whether differences in metrics between methods are statistically significant,
  2. whether methods use ensembles or single models,
  3. whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,
  4. whether methods are pre-trained or not,
  5. whether pre-trained models are supervised, self-supervised, or both,
  6. the data and tasks that pre-trained models are pre-trained on.

This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.

Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by Ash et al. (2024) sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.

Continue reading

Protein Property Prediction Using Graph Neural Networks

Proteins are fundamental biological molecules whose structure and interactions underpin a wide array of biological functions. To better understand and predict protein properties, scientists leverage graph neural networks (GNNs), which are particularly well-suited for modeling the complex relationships between protein structure and sequence. This post will explore how GNNs provide a natural representation of proteins, the incorporation of protein language models (PLLMs) like ESM, and the use of techniques like residual layers to improve training efficiency.

Why Graph Neural Networks are Ideal for Representing Proteins

Graph Neural Networks (GNNs) have emerged as a promising framework to fuse primary and secondary structure representation of proteins. GNNs are uniquely suited to represent proteins by modeling atoms or residues as nodes and their spatial connections as edges. Moreover, GNNs operate hierarchically, propagating information through the graph in multiple layers and learning representations of the protein at different levels of granularity. In the context of protein property prediction, this hierarchical learning can reveal important structural motifs, local interactions, and global patterns that contribute to biochemical properties.

Continue reading

Walk through a cell

In 2022, Maritan et al. released the first ever macromolecular model of an entire cell. The cell in question is a bacterial cell from the genus Mycoplasma. If you’re a biologist, you likely know Mycoplasma as a common cell culture contaminant.

Now, through the work of app developer Timothy Davison, you can interactively explore this cell model from the comfort of your iPhone or Apple Vision Pro. Here are three reasons why I like CellWalk:

1. It’s pretty

The visuals of CellWalk are striking. The app offers a rich depiction of the cell, allowing the user to zoom from the whole cell to individual atoms. I spent a while clicking through each protein I could see to see if I could guess what it was or what it did. Zooming out, CellWalk offers a beautiful tripartite cross section of the cell, showing first the lipid membrane, then a colourful jumble-bag of all its cellular proteins, and then finally the spaghetti-like polynucleic acids.

Tripartite cross section of a Mycoplasma cell. Screengrab taken from the CellWalk app on my phone.
Continue reading

Deliberately misfolding prions to find the golden thread.

Prion are both fascinating and terrifying. They occur naturally and have a purpose, but what that purpose is we’re still not entirely sure. Gene-knockout mice which no longer code for the prion protein do live, but they ain’t born typical.

The endogenous form of the prion protein (PrPC) can, through currently unknown mechanisms, take a different conformation, the pathogenic PrPSc. PrPSc is responsible for fatal, rapidly progressing neurodegenerative disorders which in many cases can jump species.

At OPIG, we recently discussed a remarkably rigorous series of experiments outlined in the paper “A Protein Misfolding Shaking Amplification-based method for the spontaneous generation of hundreds of bona fide prions” Whilst deliberately creating new pathogenic prions may seem and odd thing to wish to achieve, the authors aimed to determine if there was a golden thread linking “infectivity determinants, interspecies transmission barriers or the structural influence of specific amino acids”.

Continue reading

Conference summary: Generative AI in Life Science

This year I attended the second edition of Generative AI in Life Science (GenLife – https://genlife.dk/) and it was an enriching experience that I thoroughly enjoyed. Held in Copenhagen, the event brought together researchers from different areas of AI applied to the life sciences and provided a fantastic platform for networking, learning and sharing ideas. The programme included a mix of long and short talks from experts in the field, but also had a significant presence of emerging PIs, making the conference a perfect place to discover emerging groups in the field. Here I have collected some highlights of the talks I have enjoyed the most at the conference.

Continue reading

Inverse Vaccines

One of the nice things about OPIG, is that you can talk about something which is outside of your wheelhouse without feeling that the specialists in the group are going to eat your lunch. Last week, I gave an overview of the Hubbell group‘s Nature paper Synthetically glycosylated antigens for the antigen-specific suppression of established immune responses. I am not an immunologist by any stretch of the imagination, but sometimes you come across a piece of really interesting science and just want to say to people: Have you seen this, look at this, it’s really clever!

Continue reading

The stuff MDAnalysis didn’t implement: CPU Parallel HOLE conductance analysis

Some time ago, I needed to find a way to computationally estimate conductance values for every protein frame from several molecular dynamics (MD) trajectories.

In a previous post, I wrote about how to clean the resulting instant conductance timeseries from outliers. But, I never described how I generated these timeseries.

In this post, I will show how you can parallelise the computation of instant conductance given an MD trajectory. I will touch on the difficulties of this process. And why I had to implement a custom tool for it given that MDAnalysis seems to already have implemented a routine of this sort. Finally, I will provide two Python scripts that you can easily adapt to run your parallel calculations – for which I’ll provide some important notes you don’t wanna skip.

Violin plots of conductance distributions from 64 molecular dynamic trajectories with 1000 frames each.
Continue reading

What can you do with the OPIG Immunoinformatics Suite? v3.0

OPIG’s growing immunoinformatics team continues to develop and openly distribute a wide variety of databases and software packages for antibody/nanobody/T-cell receptor analysis. Below is a summary of all the latest updates (follows on from v1.0 and v2.0).

Continue reading

9th Joint Sheffield Conference on Cheminformatics

Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:

  • De Novo Design
  • Open Science
  • Chemical Space
  • Physics-based Modelling
  • Machine Learning
  • Property Prediction
  • Virtual Screening
  • Case Studies
  • Molecular Representations

It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:

Continue reading