Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.
Continue readingCategory Archives: Data Science
A Seq2Seq model for ETF forecasting
Owing to the misguided belief that I can achieve the impossible, I decided to build a model with the goal of beating the stock market.
Strap in, we’re about to get rich.
Machine learning is increasingly being employed by hedge funds to help mitigate risk and identify patterns and opportunities, whether this is for optimisation of algo trading strategies, fraud detection, high-frequency trading, or sentiment analysis. Arguably the most obvious, difficult, and naïve application of fintech ML is direct stock market forecasting – sounds like the perfect place to start.
Target
First things first, we need to decide on a stock to forecast. Volatility provides opportunities, but predictable volatility is even better. We need a security that swings in response to actual, reported events, and one whose trends roughly move somehow with other stocks – our hypothesis being that wider events in the market can be used to forecast a single security. SPDR GLD seems like a reasonable option – gold is such a popular hedge against global instability it’s price usually moves in the opposite direction to stocks such as DJIA or SP500 and moves with global disaster.
Gold price (/oz) in Pounds from 1980-2024
Continue readingSome useful pandas functions
Pandas is one of the most used packages for data analysis in python. The library provides functionalities that allow to perfrom complex data manipulation operations in a few lines of code. However, as the number of functions provided is huge, it is impossible to keep track of all of them. More often than we’d like to admit we end up wiriting lines and lines of code only to later on discover that the same operation can be performed with a single pandas function.
To help avoiding this problem in the future, I will run through some of my favourite pandas functions and demonstrate their use on an example data set containing information of crystal structures in the PDB.
Continue readingConverting pandas DataFrames into Publication-Ready Tables
Analysing, comparing and communicating the predictive performance of machine learning models is a crucial component of any empirical research effort. Pandas, a staple in the Python data analysis stack, not only helps with the data wrangling itself, but also provides efficient solutions for data presentation. Two of its lesser-known yet incredibly useful features are df.to_markdown()
and df.to_latex()
, which allow for a seamless transition from DataFrames to publication-ready tables. Here’s how you can use them!
A Simple Way to Quantify the Similarity Between Two Sets of Molecules
When designing machine learning algorithms with the aim of accelerating the discovery of novel and more effective therapeutics, we often care deeply about their ability to generalise to new regions of chemical space and accurately predict the properties of molecules that are structurally or functionally dissimilar to the ones we have already explored. To evaluate the performance of algorithms in such an out-of-distribution setting, it is essential that we are able to quantify the data shift that is induced by the train-test splits that we rely on to decide which model to deploy in production.
For our recent ICML 2023 paper Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over Functions, we chose to quantify the distributional similarity between two sets of molecules through the Maximum Mean Discrepancy (MMD).
Continue readingWhat can you do with the OPIG Immunoinformatics Suite? v3.0
OPIG’s growing immunoinformatics team continues to develop and openly distribute a wide variety of databases and software packages for antibody/nanobody/T-cell receptor analysis. Below is a summary of all the latest updates (follows on from v1.0 and v2.0).
Continue readingPHinally PHunctionalising my PHigures with PHATE feat. Plotly Express.
After being recommended by a friend, I really wanted to try plotly express but I never had the inclination to read more documentation when matplotlib gives me enough grief. While experimenting with ChatGPT I finally decided to functionalise my figure making scripts. With these scripts I manage to produce figures that made people question what I had actually been doing with my time – but I promise this will be worth your time.
I have been using with dimensionality reducition techniques recently and I came across this paper by Moon et al. PHATE is a technique that represents high dimensional (ie biological) data in a way that aims to preserve connections over preserving distance and I knew I wanted to try this as soon as I saw it. Why should you care? PHATE in 3D is faster that t-SNE in 2D. It would almost be rude to not try it out.
PHATE
In my opinion PHATE (or potential of heat diffusion for affinity-based transition embedding) does have a lot going on but that the choices at each stage feel quite sensisble. It might not come as a surprise this was primarily designed to make visual inspection of data easier on the eyes.
Continue reading9th Joint Sheffield Conference on Cheminformatics
Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:
- De Novo Design
- Open Science
- Chemical Space
- Physics-based Modelling
- Machine Learning
- Property Prediction
- Virtual Screening
- Case Studies
- Molecular Representations
It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:
Continue readingMachine learning strategies to overcome limited data availability
Machine learning (ML) for biological/biomedical applications is very challenging – in large part due to limitations in publicly available data (something we recently published about [1]). Substantial amounts of time and resources may be required to generate the types of data (eg protein structures, protein-protein binding affinity, microscopy images, gene expression values) required to train ML models, however.
In cases where there is sufficient data available to provide signal, but not enough for the desired performance, ML strategies can be employed:
Continue readingPairwise sequence identity and Tanimoto similarity in PDBbind
In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.
A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:
from rdkit.Chem import AllChem, MolFromMol2File from rdkit.DataStructs import BulkTanimotoSimilarity import numpy as np import os fps = [] for pdb in pdbs: mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2')) fps.append(AllChem.GetMorganFingerprint(mol, 3)) sims = [] for i in range(len(fps)): sims.append(BulkTanimotoSimilarity(fps[i],fps)) arr = np.array(sims) np.savez_compressed('data/tanimoto_similarity.npz', arr)
Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:
Continue reading