Category Archives: Code

SSH, the boss-fight level: Jupyter notebooks from compute nodes

Secure shell (SSH) is an essential tool for remote operations. However, not everything with it is smooth-sailing. Especially, when you want to do things like reverse–port-forwarding via a proxy-hump or two a Jupyter notebook to your local machine from a compute node on a no-home container . Even if it sounds less plausible than the exploits on Mr Robot, it actually can work and requires zero social-engineering or sneaking in server rooms to install Raspberry Pis while using a baseball cap as a disguise.

Continue reading →

Deploying a Flask app part I: the gunicorn WSGI server

Last year I wrote a post about deploying Flask apps with Apache/mod_wsgi when your app’s dependencies are installed in a conda environment. The year before, in the dark times, I wrote a post about the black magic invocations required to get multiple apps running stably using mod_wsgi. I’ve since moved away from mod_wsgi entirely and switched to running Flask apps from containers using the gunicorn WSGI server behind an Apache reverse proxy, which has made life immeasurably easier. In this post we’ll cover running a Flask app on localhost using gunicorn; in Part II we’ll run our app as a service using Singularity and deploy it to production using Apache as a HTTP proxy server.

Continue reading →

GitHub.dev: Just press “.”

GitHub.dev is an incredibly useful feature in GitHub which allows you to view and edit code directly on the browser as a remote VSCode session.

To access this remote VSCode session, either:

Press “.”
Change “.com” to “.dev” in the URL

This is a great way to quickly explore someone’s code without having to clone it onto your machine or go through the GitHub UI.

LightningCLI, my new best friend

If you’ve ever worked on machine learning projects, you’ll know that training models is just one aspect of the process. Code setup, configuration management, and ensuring reproducibility can also take up a lot of time. I’m a big fan of PyTorch Lightning primarily because it hides most of the boilerplate code you usually need, making your code more modular and readable. It even allows you to train your models on multiple GPUs with ease. All of this comes with the minor trade-off of learning an intuitive API, which can be easily extended to tweak any low-level details for those rare cases where the standard API falls short.

However, despite finding PyTorch Lightning incredibly useful, there’s one aspect that has always bothered me: the configuration of the model and training hyperparameters in a flexible and reproducible manner. In my view, the best approach to address this is to use configuration files for the various modules involved. These files can be easily overridden at runtime using command-line arguments or environment variables. To achieve this, I developed my own packages, configfile and argParseFromDoc, which facilitates this process.

But now, there’s a tool within the Lightning suite that offers all these features in a seamlessly integrated package. Allow me to introduce you to LightningCLI. This tool streamlines the process of hyperparameter configuration, making it both flexible and reproducible. With LightningCLI, you get the best of both worlds: the power of PyTorch Lightning and a hassle-free setup.

The core idea here is to write a config file (or several) that contains the required parameters for the trainer, the model and the dataset. This is done as yaml files with the following structure.

trainer:
  logger: true
  ...
model:
  out_dim: 10
  learning_rate: 0.02
data:
  data_dir: ./
  image_size: 256
ckpt_path: null

…

Where the yaml fields should correspond to the parameters of the PytorchLightning Trainer, and your custom Model and Data classes, that inherit from LightningModule and LightningDataModule. So a full self-contained example could be

import lightning.pytorch as pl
from lightning.pytorch.cli import LightningCLI
class MyModel(pl.LightningModule):
    def __init__(self, out_dim: int, learning_rate: float):
        super().__init__()
        self.save_hyperparameters()
        self.out_dim = out_dim
        self.learning_rate = learning_rate
        self.model = create_my_model(out_dim)
    def training_step(self, batch, batch_idx):
        out = self.model(batch.x)
        loss = self.compute_loss(out, batch.y)
        return loss
class MyDataModule(pl.LightningDataModule):
    def __init__(self, data_dir: str, image_size: int):
        super().__init__()
        self.data_dir = data_dir
        self.image_size = image_size
    def train_dataloader(self):
        return create_dataloader(self.image_size, self.data_dir)

def main():
    cli = LightningCLI(model_class=MyModel, datamodule_class=MyDataModule)
if __name__ == "__main__":
    main()

That can be run easily as

python scrip.py --config config.yaml fit

What is even better is that you can split the configuration into several config files and that the configuration files can refer to Python classes to be instantiated, making this configuration system so flexible that you can literally configure everything you can imagine.

model:
  class_path: model.MyModel2
  init_args:
    learning_rate: 0.2
    loss:
      class_path: torch.nn.CrossEntropyLoss
      init_args:
        reduction: mean

In conclusion, LightningCLI brings the convenience of configuration management, command-line flexibility, and reproducibility to your PyTorch Lightning projects. With simple yet powerful features, it’s a tool that should be part of any machine learning engineer’s toolkit.

AI Can’t Believe It’s Not Butter

Recently, I’ve been using a Convolutional Neural Network (CNN), and other methods, to predict the binding affinity of antibodies from their sequence. However, nine months ago, I applied a CNN to a far more important task – distinguishing images of butter from margarine. Please check out the GitHub link below to learn moo-re.

https://github.com/lewis-chinery/AI_cant_believe_its_not_butter

The dangers of Conda-Pack and OpenMM

If you are running lots of little jobs in SLURM and want to make use of free nodes that suddenly become available, it is helpful to have a way of rapidly shipping your environments that does not rely on installing conda or rebuilding the environment from scratch every time. This is useful with complex rebuilds where exported .yml files do not always work as expected, even when specifying exact versions and source locations.

In these situations a tool such a conda-pack becomes incredibly useful. Once you have perfected the house of cards that is your conda environment, you can use conda-pack to save that exact state as a tar.gz file.

conda-pack -n my_precious_env -o my_precious_env.tar.gz

This can provide you with a backup to be used when you accidentally delete conda from your system, or if you irreparable corrupt the environment and cannot roll back to the point in time when everything worked. These tar.gz files can also be copied to distant locations by the use of rsync or scp, unpacked, sourced and used without installing conda…

Continue reading →

Customising MCS mapping in RDKit

Finding the parts in common between two molecules appears to be a straightforward, but actually is a maze of layers. The task, maximum common substructure (MCS) searching, in RDKit is done by Chem.rdFMCS.FindMCS, which is highly customisable with lots of presets. What if one wanted to control in minute detail if a given atom X and is a match for atom Y? There is a way and this is how.

Continue reading →

Exploring the Observed Antibody Space (OAS)

The Observed Antibody Space (OAS) [1,2] is an amazing resource for investigating observed antibodies or as a resource for training antibody specific models, however; its size (over 2.4 billion unpaired and 1.5 million paired antibody sequences as of June 2023) can make it painful to work with. Additionally, OAS is extremely information rich, having nearly 100 columns for each antibody heavy or light chain, further complicating how to handle the data.

From spending a lot of time working with OAS, I wanted to share a few tricks and insights, which I hope will reduce the pain and increase the joy of working with OAS!

Continue reading →

Pairwise sequence identity and Tanimoto similarity in PDBbind

In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.

A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:

from rdkit.Chem import AllChem, MolFromMol2File
from rdkit.DataStructs import BulkTanimotoSimilarity
import numpy as np
import os

fps = []
for pdb in pdbs:
    mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2'))
    fps.append(AllChem.GetMorganFingerprint(mol, 3))

sims = []
for i in range(len(fps)):
    sims.append(BulkTanimotoSimilarity(fps[i],fps))

arr = np.array(sims)
np.savez_compressed('data/tanimoto_similarity.npz', arr)

Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Code

SSH, the boss-fight level: Jupyter notebooks from compute nodes

Deploying a Flask app part I: the gunicorn WSGI server

GitHub.dev: Just press “.”

LightningCLI, my new best friend

AI Can’t Believe It’s Not Butter

The dangers of Conda-Pack and OpenMM

Customising MCS mapping in RDKit

Exploring the Observed Antibody Space (OAS)

Pairwise sequence identity and Tanimoto similarity in PDBbind