Category Archives: Code

Some useful pandas functions

Pandas is one of the most used packages for data analysis in python. The library provides functionalities that allow to perfrom complex data manipulation operations in a few lines of code. However, as the number of functions provided is huge, it is impossible to keep track of all of them. More often than we’d like to admit we end up wiriting lines and lines of code only to later on discover that the same operation can be performed with a single pandas function.

To help avoiding this problem in the future, I will run through some of my favourite pandas functions and demonstrate their use on an example data set containing information of crystal structures in the PDB.

Continue reading

Let your library design blosum

During the lead optimisation stage of the drug discovery pipeline, we might wish to make mutations to an initially identified binding antibody to improve properties such as developability, immunogenicity, and affinity.

There are many ways we could go about suggesting these mutations including using Large Language Models e.g. ESM and AbLang, or Inverse Folding methods e.g. ProteinMPNN and AntiFold. However, some of our recent work (soon to be pre-printed) has shown that classical non-Machine Learning approaches, such as BLOSUM, could also be worth considering at this stage.

Continue reading

Converting pandas DataFrames into Publication-Ready Tables

Analysing, comparing and communicating the predictive performance of machine learning models is a crucial component of any empirical research effort. Pandas, a staple in the Python data analysis stack, not only helps with the data wrangling itself, but also provides efficient solutions for data presentation. Two of its lesser-known yet incredibly useful features are df.to_markdown() and df.to_latex(), which allow for a seamless transition from DataFrames to publication-ready tables. Here’s how you can use them!

Continue reading

What the heck are TPUs?

I recently became curious about TPUs, a specialised hardware for training Machine- and Deep-Learning models, where TPU stands for Tensor Processing Unit. This fancy chip can provide very high gains for anyone aiming to perform really massive parallelisation of AI tasks such as training, fine-tuning, and inference.

In this blog post, I will touch on what a TPU is, why it could be useful for AI applications when compared to GPUs and briefly discuss associated opportunity costs.

What’s a TPU?

Continue reading

Deploying a Flask app part II: using an Apache reverse proxy

I recently wrote about serving a Flask web application on localhost using gunicorn. This is sufficient to get an app up and running locally using a production-ready WSGI server, but we still need to add a HTTP proxy server in front to securely handle HTTP requests coming from external clients. Here we’ll cover configuring a simple reverse proxy using the Apache web server, though of course you could do the same with another HTTP server such as nginx.

Continue reading

Understanding GPU parallelization in deep learning

Deep learning has proven to be the season’s favourite for biology: every other week, an interesting biological problem is solved by clever application of neural networks. Yet, as more challenges get cracked, modern research shifts more and more in the direction of larger models — meaning that increasing computational resources are required for training. Unsurprisingly, NVIDIA, the main manufacturer of GPUs, experienced a significant jump in their stock price earlier this year.

Access to compute is not enough to train good neural networks. As soon as multiple cards enter into play, researchers need to use a completely different paradigm where data and model weights are distributed across different devices — and sometimes even different computers. Though these tools start to be crucial for successful computational biology research, they are generally unknown to researchers. Hence, in this blogpost, I would like to provide a really brief introduction to multi-GPU training.

Continue reading

SSH, the boss-fight level: Jupyter notebooks from compute nodes

Secure shell (SSH) is an essential tool for remote operations. However, not everything with it is smooth-sailing. Especially, when you want to do things like reverse–port-forwarding via a proxy-hump or two a Jupyter notebook to your local machine from a compute node on a no-home container . Even if it sounds less plausible than the exploits on Mr Robot, it actually can work and requires zero social-engineering or sneaking in server rooms to install Raspberry Pis while using a baseball cap as a disguise.

Continue reading

Deploying a Flask app part I: the gunicorn WSGI server

Last year I wrote a post about deploying Flask apps with Apache/mod_wsgi when your app’s dependencies are installed in a conda environment. The year before, in the dark times, I wrote a post about the black magic invocations required to get multiple apps running stably using mod_wsgi. I’ve since moved away from mod_wsgi entirely and switched to running Flask apps from containers using the gunicorn WSGI server behind an Apache reverse proxy, which has made life immeasurably easier. In this post we’ll cover running a Flask app on localhost using gunicorn; in Part II we’ll run our app as a service using Singularity and deploy it to production using Apache as a HTTP proxy server.

Continue reading

GitHub.dev: Just press “.”

GitHub.dev is an incredibly useful feature in GitHub which allows you to view and edit code directly on the browser as a remote VSCode session.

To access this remote VSCode session, either:

  1. Press “.”
  2. Change “.com” to “.dev” in the URL

This is a great way to quickly explore someone’s code without having to clone it onto your machine or go through the GitHub UI.

LightningCLI, my new best friend

If you’ve ever worked on machine learning projects, you’ll know that training models is just one aspect of the process. Code setup, configuration management, and ensuring reproducibility can also take up a lot of time. I’m a big fan of PyTorch Lightning primarily because it hides most of the boilerplate code you usually need, making your code more modular and readable. It even allows you to train your models on multiple GPUs with ease. All of this comes with the minor trade-off of learning an intuitive API, which can be easily extended to tweak any low-level details for those rare cases where the standard API falls short.

However, despite finding PyTorch Lightning incredibly useful, there’s one aspect that has always bothered me: the configuration of the model and training hyperparameters in a flexible and reproducible manner. In my view, the best approach to address this is to use configuration files for the various modules involved. These files can be easily overridden at runtime using command-line arguments or environment variables. To achieve this, I developed my own packages, configfile and argParseFromDoc, which facilitates this process.

But now, there’s a tool within the Lightning suite that offers all these features in a seamlessly integrated package. Allow me to introduce you to LightningCLI. This tool streamlines the process of hyperparameter configuration, making it both flexible and reproducible. With LightningCLI, you get the best of both worlds: the power of PyTorch Lightning and a hassle-free setup.

The core idea here is to write a config file (or several) that contains the required parameters for the trainer, the model and the dataset. This is done as yaml files with the following structure.

trainer:
  logger: true
  ...
model:
  out_dim: 10
  learning_rate: 0.02
data:
  data_dir: ./
  image_size: 256
ckpt_path: null

Where the yaml fields should correspond to the parameters of the PytorchLightning Trainer, and your custom Model and Data classes, that inherit from LightningModule and LightningDataModule. So a full self-contained example could be

import lightning.pytorch as pl
from lightning.pytorch.cli import LightningCLI
class MyModel(pl.LightningModule):
    def __init__(self, out_dim: int, learning_rate: float):
        super().__init__()
        self.save_hyperparameters()
        self.out_dim = out_dim
        self.learning_rate = learning_rate
        self.model = create_my_model(out_dim)
    def training_step(self, batch, batch_idx):
        out = self.model(batch.x)
        loss = self.compute_loss(out, batch.y)
        return loss
class MyDataModule(pl.LightningDataModule):
    def __init__(self, data_dir: str, image_size: int):
        super().__init__()
        self.data_dir = data_dir
        self.image_size = image_size
    def train_dataloader(self):
        return create_dataloader(self.image_size, self.data_dir)

def main():
    cli = LightningCLI(model_class=MyModel, datamodule_class=MyDataModule)
if __name__ == "__main__":
    main()

That can be run easily as

python scrip.py --config config.yaml fit

What is even better is that you can split the configuration into several config files and that the configuration files can refer to Python classes to be instantiated, making this configuration system so flexible that you can literally configure everything you can imagine.

model:
  class_path: model.MyModel2
  init_args:
    learning_rate: 0.2
    loss:
      class_path: torch.nn.CrossEntropyLoss
      init_args:
        reduction: mean

In conclusion, LightningCLI brings the convenience of configuration management, command-line flexibility, and reproducibility to your PyTorch Lightning projects. With simple yet powerful features, it’s a tool that should be part of any machine learning engineer’s toolkit.