Category Archives: Deep Learning

Protein Property Prediction Using Graph Neural Networks

Proteins are fundamental biological molecules whose structure and interactions underpin a wide array of biological functions. To better understand and predict protein properties, scientists leverage graph neural networks (GNNs), which are particularly well-suited for modeling the complex relationships between protein structure and sequence. This post will explore how GNNs provide a natural representation of proteins, the incorporation of protein language models (PLLMs) like ESM, and the use of techniques like residual layers to improve training efficiency.

Why Graph Neural Networks are Ideal for Representing Proteins

Graph Neural Networks (GNNs) have emerged as a promising framework to fuse primary and secondary structure representation of proteins. GNNs are uniquely suited to represent proteins by modeling atoms or residues as nodes and their spatial connections as edges. Moreover, GNNs operate hierarchically, propagating information through the graph in multiple layers and learning representations of the protein at different levels of granularity. In the context of protein property prediction, this hierarchical learning can reveal important structural motifs, local interactions, and global patterns that contribute to biochemical properties.

Continue reading

Incorporating conformer ensembles for better molecular representation learning

Conformer ensemble of tryptophan from Seibert et. al.

The spatial or 3D structure of a molecule is particularly relevant to modeling its activity in QSAR. The 3D structural information affects molecular properties and chemical reactivities and thus it is important to incorporate them in deep learning models built for molecules. A key aspect of the spatial structure of molecules is the flexible distribution of their constituent atoms known as conformation. Given the temperature of a molecular system, the probability of each of its possible conformation is defined by its formation energy and this follows a Boltzmann distribution [McQuarrie and Simon, 1997]. The Boltzmann distribution tells us the probability of a certain confirmation given its potential energy. The different conformations of a molecule could result in different properties and activity. Therefore, it is imperative to consider multiple conformers in molecular deep learning to ensure that the notion of conformational flexibility is embedded in the model developed. The model should also be able to capture the Boltzmann distribution of the potential energy related to the conformers.

Continue reading

Architectural highlights of AlphaFold3

DeepMind and Isomophic Labs recently published the methods behind AlphaFold3, the sequel to the famous AlphaFold2. The involvement of Isomorphic Labs signifies a shift that Alphabet is getting serious about drug design. To this end, AlphaFold3 provides a substantial improvement in the field of complex prediction, a major piece in the computational drug design pipeline.

Continue reading

The Tale of the Undead Logger

A picture of a scary-looking zombie in a lumberjack outfit holding an axe, in the middle of a forest at night, staring menacingly at the viewer.
Fear the Undead Logger all ye who enter here.
For he may strike, and drain the life out nodes that you hold dear.
Among the smouldering embers of jobs you thought long dead,
he lingers on, to terrorise, and cause you frightful dread.
But hark ye all my tale to save you from much pain,
and fight ye not anew the battles I have fought in vain.

Or simply…

… Tips and Tricks to Use When wandb Logger Just. Won’t. DIE.

The Weights and Biases Logger (illustrated above by DALL-E; admittedly with some artistic license) hardly requires introduction. It’s something of an industry standard at this point, well-regarded for the extensive (and extensible) functionality of its interactive dashboard; for advanced features like checkpointing model weights in the cloud and automating hyperparameter sweeps; and for integrating painlessly with frameworks like PyTorch and PyTorch Lightning. It simplifies your life as an ML researcher enormously by making it easy to track and compare experiments, monitor system resource usage, all while giving you very fun interactive graphs to play with.
Plot arbitrary quantities you may be logging against each other, interactively, on the fly, however you like. In Dark Mode, of course (you’re a professional, after all). Here’s a less artistic impression to give you an idea, should you have been living under a rock:

Continue reading

Dockerized Colabfold for large-scale batch predictions

Alphafold is great, however it’s not suited for large batch predictions for 2 main reasons. Firstly, there is no native functionality for predicting structures off multiple fasta sequences (although a custom batch prediction script can be written pretty easily). Secondly, the multiple sequence alignment (MSA) step is heavy and running MSAs for, say, 10,000 sequences at a tractable speed requires some serious hardware.

Fortunately, an alternative to Alphafold has been released and is now widely used; Colabfold. For many, Colabfold’s primary strength is being cloud-based and that prediction requests can be submitted on Google Colab, thereby being extremely user-friendly by avoiding local installations. However, I would argue the greatest value Colabfold brings is a massive MSA speed up (40-60 fold) by replacing HHBlits and BLAST with MMseq2. This, and the fact batches of sequences can be natively processed facilitates a realistic option for predicting thousands of structures (this could still take days on a pair of v100s depending on sequence length etc, but its workable).

In my opinion the cleanest local installation and simplest usage of Colabfold is via Docker containers, for which both a Dockerfile and pre-built docker image have been released. Unfortunately, the Docker image does not come packaged with the necessary setup_databases.sh script, which is required to build a local sequence database. By default the MSAs are run on the Colabfold public server, which is a shared resource and can only process a total of a few thousand MSAs per day.

The following accordingly outlines preparatory steps for 100% local, batch predictions (setting up the database can in theory be done in 1 line via a mount, but I was getting a weird wget permissions error so have broken it up to first fetch the file on the local):

Pull the relevant colabfold docker image (container registry):

docker pull ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2

Create a cache to store weights:

mkdir cache

Download the model weights:

docker run -ti --rm -v path/to/cache:/cache ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 python -m colabfold.download

Fetch the setup_databases.sh script

wget https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh 

Spin up a container. The container will exit as soon as the first command is run, so we need to be a bit hacky by running an infinite command in the background:

CONTAINER_ID=$(docker run -d ghcr.io/sokrypton/colabfold:1.5.5 cuda12.2.2 /bin/bash -c "tail -f /dev/null")

Copy the setup_databases.sh script to the relevant path in the container and create a databases directory:

docker cp ./setup_databases.sh $CONTAINER_ID:/usr/local/envs/colabfold/bin/ 
docker exec $CONTAINER_ID mkdir /databases

Run the setup script. This will download and prepare the databases (~2TB once extracted):

docker exec $CONTAINER_ID /usr/local/envs/colabfold/bin/setup_databases.sh /databases/ 

Copy the databases back to the host and clean up:

docker cp $CONTAINER_ID:/databases ./ 
docker stop $CONTAINER_ID
docker rm $CONTAINER_ID

You should now be at a stage where batch predictions can be run, for which I have provided a template script (uses a fasta file with multiple sequences) below. It’s worth noting that maximum search speeds can be achieved by loading the database into memory and pre-indexing, but this requires about 1TB of RAM, which I don’t have.

There are 2 key processes that I prefer to log separately, colabfold_search and colabfold_batch:

#!/bin/bash

# Define the paths for database, input FASTA, and outputs

db_path="path/to/database"
input_fasta="path/to/fasta/file.fasta"
output_path="path/to/output/directory"
log_path="path/to/logs/directory"
cache_path="path/to/weights/cache"

# Run Docker container to execute colabfold_search and colabfold_batch 

time docker run --gpus all -v "${db_path}:/database" -v "${input_fasta}:/input.fasta" -v "${output_path}:/predictions" -v "${log_path}:/logs" -v "${cache_path}:/cache"
 ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 /bin/bash -c "colabfold_search --mmseqs /usr/local/envs/colabfold/bin/mmseqs /input.fasta /database msas > /logs/search.log 2>&1 && colabfold_batch msas /predictions > /logs/batch.log 2>&1"

Pitfalls of using Pearson’s correlation for comparing model performance

Pearson’s R (correlation coefficient) is a measure of the linear correlation between two variables, giving a value between -1 and 1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation. While it’s a useful statistic for understanding the relationship between two variables, it is often used to compare the performance of two or more models. For example, imagine we had experimental values that we are predicting and several models’ predictions. Obviously, we would prefer the model with the highest Pearson’s R … or perhaps not?

Continue reading

3 approaches to linear-memory Transformers

Transformers are a very popular architecture for processing sequential data, notably text and (our interest) proteins. Transformers learn more complex patterns with larger models on more data, as demonstrated by models like GPT-4 and ESM-2. Transformers work by updating tokens according to an attention value computed as a weighted sum of all other tokens. In standard implentations this requires computing the product of a query and key matrix which requires O(N2d) computations and, problematically, O(N2) memory for a sequence of length N and an embedding size of d. To speed up Transformers, and to analyze longer sequences, several variants have been proposed which require only O(N) memory. Broadly, these can be divided into sparse methods, softmax-approximators, and memory-efficient Transformers.

Continue reading

Navigating the world of GNN layers with PyTorch Geometric

Data can often naturally be represented in a graph format and being able to directly employ a deep learning architecture on that data without finding a different representation is an appealing idea. Graph neural networks (GNNs) have become a standard part of the ML toolbox but navigating the world of different architectures available out-of-the-box can be a daunting task. A great place to start looking for architectures is with PyTorch Geometric, which provides an extensive list of readily available GNN layers and tutorials on how to use them in your standard PyTorch models. There are many things to consider when choosing a GNN layer, but the two considerations that I think are a great place to start are expressiveness and edge feature handling. In general, it is hard to predict what will work best for the task at hand and hence it’s optimal to try a wide range of different layers. This blogpost is meant as a brief introduction for what I would find useful to know before I started using GNNs, and a starting point for exploring the GNN literature.

Continue reading

Optimising Transformer Training

Training a large transformer model can be a multi-day, if not multi-week, ordeal. Especially if you’re using cloud compute, this can be a very expensive affair, not to mention the environmental impact. It’s therefore worth spending a couple days trying to optimise your training efficiency before embarking on a large scale training run. Here, I’ll run through three strategies you can take which (hopefully) shouldn’t degrade performance, while giving you some free speed. These strategies will also work for any other models using linear layers.

I wont go into too much of the technical detail of any of the techniques, but if you’d like to dig into any of them further I’d highly recommend the Nvidia Deep Learning Performance Guide.

Training With Mixed Precision

Training with mixed precision can be as simple as adding a few lines of code, depending on your deep learning framework. It also potentially provides the biggest boost to performance of any of these techniques. Training throughput can be increase by up to three-fold with little degradation in performance – and who doesn’t like free speed?

Continue reading

A Seq2Seq model for ETF forecasting

Owing to the misguided belief that I can achieve the impossible, I decided to build a model with the goal of beating the stock market.

Strap in, we’re about to get rich.

Machine learning is increasingly being employed by hedge funds to help mitigate risk and identify patterns and opportunities, whether this is for optimisation of algo trading strategies, fraud detection, high-frequency trading, or sentiment analysis. Arguably the most obvious, difficult, and naïve application of fintech ML is direct stock market forecasting – sounds like the perfect place to start.

Target

First things first, we need to decide on a stock to forecast. Volatility provides opportunities, but predictable volatility is even better. We need a security that swings in response to actual, reported events, and one whose trends roughly move somehow with other stocks – our hypothesis being that wider events in the market can be used to forecast a single security. SPDR GLD seems like a reasonable option – gold is such a popular hedge against global instability it’s price usually moves in the opposite direction to stocks such as DJIA or SP500 and moves with global disaster.

Gold price (/oz) in Pounds from 1980-2024

Continue reading