Category Archives: Data Science

Working with PDB Structures in Pandas

Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.

Continue reading

A Seq2Seq model for ETF forecasting

Owing to the misguided belief that I can achieve the impossible, I decided to build a model with the goal of beating the stock market.

Strap in, we’re about to get rich.

Machine learning is increasingly being employed by hedge funds to help mitigate risk and identify patterns and opportunities, whether this is for optimisation of algo trading strategies, fraud detection, high-frequency trading, or sentiment analysis. Arguably the most obvious, difficult, and naïve application of fintech ML is direct stock market forecasting – sounds like the perfect place to start.

Target

First things first, we need to decide on a stock to forecast. Volatility provides opportunities, but predictable volatility is even better. We need a security that swings in response to actual, reported events, and one whose trends roughly move somehow with other stocks – our hypothesis being that wider events in the market can be used to forecast a single security. SPDR GLD seems like a reasonable option – gold is such a popular hedge against global instability it’s price usually moves in the opposite direction to stocks such as DJIA or SP500 and moves with global disaster.

Gold price (/oz) in Pounds from 1980-2024

Continue reading

Some useful pandas functions

Pandas is one of the most used packages for data analysis in python. The library provides functionalities that allow to perfrom complex data manipulation operations in a few lines of code. However, as the number of functions provided is huge, it is impossible to keep track of all of them. More often than we’d like to admit we end up wiriting lines and lines of code only to later on discover that the same operation can be performed with a single pandas function.

To help avoiding this problem in the future, I will run through some of my favourite pandas functions and demonstrate their use on an example data set containing information of crystal structures in the PDB.

Continue reading

Converting pandas DataFrames into Publication-Ready Tables

Analysing, comparing and communicating the predictive performance of machine learning models is a crucial component of any empirical research effort. Pandas, a staple in the Python data analysis stack, not only helps with the data wrangling itself, but also provides efficient solutions for data presentation. Two of its lesser-known yet incredibly useful features are df.to_markdown() and df.to_latex(), which allow for a seamless transition from DataFrames to publication-ready tables. Here’s how you can use them!

Continue reading

A Simple Way to Quantify the Similarity Between Two Sets of Molecules

When designing machine learning algorithms with the aim of accelerating the discovery of novel and more effective therapeutics, we often care deeply about their ability to generalise to new regions of chemical space and accurately predict the properties of molecules that are structurally or functionally dissimilar to the ones we have already explored. To evaluate the performance of algorithms in such an out-of-distribution setting, it is essential that we are able to quantify the data shift that is induced by the train-test splits that we rely on to decide which model to deploy in production.

For our recent ICML 2023 paper Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over Functions, we chose to quantify the distributional similarity between two sets of molecules through the Maximum Mean Discrepancy (MMD).

Continue reading

What can you do with the OPIG Immunoinformatics Suite? v3.0

OPIG’s growing immunoinformatics team continues to develop and openly distribute a wide variety of databases and software packages for antibody/nanobody/T-cell receptor analysis. Below is a summary of all the latest updates (follows on from v1.0 and v2.0).

Continue reading

PHinally PHunctionalising my PHigures with PHATE feat. Plotly Express.

After being recommended by a friend, I really wanted to try plotly express but I never had the inclination to read more documentation when matplotlib gives me enough grief. While experimenting with ChatGPT I finally decided to functionalise my figure making scripts. With these scripts I manage to produce figures that made people question what I had actually been doing with my time – but I promise this will be worth your time.

I have been using with dimensionality reducition techniques recently and I came across this paper by Moon et al. PHATE is a technique that represents high dimensional (ie biological) data in a way that aims to preserve connections over preserving distance and I knew I wanted to try this as soon as I saw it. Why should you care? PHATE in 3D is faster that t-SNE in 2D. It would almost be rude to not try it out.

PHATE

In my opinion PHATE (or potential of heat diffusion for affinity-based transition embedding) does have a lot going on but that the choices at each stage feel quite sensisble. It might not come as a surprise this was primarily designed to make visual inspection of data easier on the eyes.

Continue reading

9th Joint Sheffield Conference on Cheminformatics

Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:

  • De Novo Design
  • Open Science
  • Chemical Space
  • Physics-based Modelling
  • Machine Learning
  • Property Prediction
  • Virtual Screening
  • Case Studies
  • Molecular Representations

It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:

Continue reading

Machine learning strategies to overcome limited data availability

Machine learning (ML) for biological/biomedical applications is very challenging – in large part due to limitations in publicly available data (something we recently published about [1]). Substantial amounts of time and resources may be required to generate the types of data (eg protein structures, protein-protein binding affinity, microscopy images, gene expression values) required to train ML models, however.

In cases where there is sufficient data available to provide signal, but not enough for the desired performance, ML strategies can be employed:

Continue reading

Pairwise sequence identity and Tanimoto similarity in PDBbind

In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.

A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:

from rdkit.Chem import AllChem, MolFromMol2File
from rdkit.DataStructs import BulkTanimotoSimilarity
import numpy as np
import os

fps = []
for pdb in pdbs:
    mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2'))
    fps.append(AllChem.GetMorganFingerprint(mol, 3))

sims = []
for i in range(len(fps)):
    sims.append(BulkTanimotoSimilarity(fps[i],fps))

arr = np.array(sims)
np.savez_compressed('data/tanimoto_similarity.npz', arr)

Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:

Continue reading