Category Archives: Python

Deploying a Flask app part II: using an Apache reverse proxy

I recently wrote about serving a Flask web application on localhost using gunicorn. This is sufficient to get an app up and running locally using a production-ready WSGI server, but we still need to add a HTTP proxy server in front to securely handle HTTP requests coming from external clients. Here we’ll cover configuring a simple reverse proxy using the Apache web server, though of course you could do the same with another HTTP server such as nginx.

Continue reading

Understanding GPU parallelization in deep learning

Deep learning has proven to be the season’s favourite for biology: every other week, an interesting biological problem is solved by clever application of neural networks. Yet, as more challenges get cracked, modern research shifts more and more in the direction of larger models — meaning that increasing computational resources are required for training. Unsurprisingly, NVIDIA, the main manufacturer of GPUs, experienced a significant jump in their stock price earlier this year.

Access to compute is not enough to train good neural networks. As soon as multiple cards enter into play, researchers need to use a completely different paradigm where data and model weights are distributed across different devices — and sometimes even different computers. Though these tools start to be crucial for successful computational biology research, they are generally unknown to researchers. Hence, in this blogpost, I would like to provide a really brief introduction to multi-GPU training.

Continue reading

Deploying a Flask app part I: the gunicorn WSGI server

Last year I wrote a post about deploying Flask apps with Apache/mod_wsgi when your app’s dependencies are installed in a conda environment. The year before, in the dark times, I wrote a post about the black magic invocations required to get multiple apps running stably using mod_wsgi. I’ve since moved away from mod_wsgi entirely and switched to running Flask apps from containers using the gunicorn WSGI server behind an Apache reverse proxy, which has made life immeasurably easier. In this post we’ll cover running a Flask app on localhost using gunicorn; in Part II we’ll run our app as a service using Singularity and deploy it to production using Apache as a HTTP proxy server.

Continue reading

AI Can’t Believe It’s Not Butter

Recently, I’ve been using a Convolutional Neural Network (CNN), and other methods, to predict the binding affinity of antibodies from their sequence. However, nine months ago, I applied a CNN to a far more important task – distinguishing images of butter from margarine. Please check out the GitHub link below to learn moo-re.

https://github.com/lewis-chinery/AI_cant_believe_its_not_butter

Customising MCS mapping in RDKit

Finding the parts in common between two molecules appears to be a straightforward, but actually is a maze of layers. The task, maximum common substructure (MCS) searching, in RDKit is done by Chem.rdFMCS.FindMCS, which is highly customisable with lots of presets. What if one wanted to control in minute detail if a given atom X and is a match for atom Y? There is a way and this is how.

Continue reading

Exploring the Observed Antibody Space (OAS)

The Observed Antibody Space (OAS) [1,2] is an amazing resource for investigating observed antibodies or as a resource for training antibody specific models, however; its size (over 2.4 billion unpaired and 1.5 million paired antibody sequences as of June 2023) can make it painful to work with. Additionally, OAS is extremely information rich, having nearly 100 columns for each antibody heavy or light chain, further complicating how to handle the data. 

From spending a lot of time working with OAS, I wanted to share a few tricks and insights, which I hope will reduce the pain and increase the joy of working with OAS!

Continue reading

Pairwise sequence identity and Tanimoto similarity in PDBbind

In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.

A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:

from rdkit.Chem import AllChem, MolFromMol2File
from rdkit.DataStructs import BulkTanimotoSimilarity
import numpy as np
import os

fps = []
for pdb in pdbs:
    mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2'))
    fps.append(AllChem.GetMorganFingerprint(mol, 3))

sims = []
for i in range(len(fps)):
    sims.append(BulkTanimotoSimilarity(fps[i],fps))

arr = np.array(sims)
np.savez_compressed('data/tanimoto_similarity.npz', arr)

Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:

Continue reading

Checking your PDB file for clashing atoms

Detecting atom clashes in protein structures can be useful in a number of scenarios. For example if you are just about to start some molecular dynamics simulation, or if you want to check that a structure generated by a deep learning model is reasonable. It is quite straightforward to code, but I get the feeling that these sort of functions have been written from scratch hundreds of times. So to save you the effort, here is my implementation!!!

Continue reading

Unclear documentation? ChatGPT can help!

The PyMOL Python API is a useful resource for most people doing research in OPIG, whether focussed on antibodies, small molecule drug design or protein folding. However, the documentation is poorly structured and difficult to interpret without first having understood the structure of the module. In particular, the differences between use of the PyMOL command line and the API can be unclear, leading to a much longer debugging process for code than you’d like.

While I’m reluctant to continue the recent theme of ChatGPT-related posts, this is a use for ChatGPT that would have been incredibly useful to me when I was first getting to grips with the PyMOL API.

Continue reading

BRICS Decomposition in 3D

Inspired by this blog post by the lovely Kate, I’ve been doing some BRICS decomposing of molecules myself. Like the structure-based goblin that I am, though, I’ve been applying it to 3D structures of molecules, rather than using the smiles approach she detailed. I thought it may be helpful to share the code snippets I’ve been using for this: unsurprisingly, it can also be done with RDKit!

I’ll use the same example as in the original blog post, propranolol.

1DY4: CBH1 IN COMPLEX WITH S-PROPRANOLOL

First, I import RDKit and load the ligand in question:

Continue reading