Category Archives: Data Science

Making your code pip installable

aka when to use a CutomBuildCommand or a CustomInstallCommand when building python packages with setup.py

Bioinformatics software is complicated, and often a little bit messy. Recently I found myself wading through a python package building quagmire and thought I could share something I learnt about when to use a custom build command and when to use a custom install command. I have also provided some information about how to copy executables to your package installation bin. **ChatGPT wrote the initial skeleton draft of this post, and I have corrected and edited.

Next time you need to create a pip installable package yourself, hopefully this can save you some time!

Continue reading

ggPlotting tips with OPIG data

Ever wondered whether opiglets keep their ketchup in the fridge or cupboard? Perhaps you’ve wanted to know how to create nice figure to display lots of information simulataniously. Publication quality figures are easy in R with the ggplot package. We may also learn some good visualisation.

Continue reading

Incorporating conformer ensembles for better molecular representation learning

Conformer ensemble of tryptophan from Seibert et. al.

The spatial or 3D structure of a molecule is particularly relevant to modeling its activity in QSAR. The 3D structural information affects molecular properties and chemical reactivities and thus it is important to incorporate them in deep learning models built for molecules. A key aspect of the spatial structure of molecules is the flexible distribution of their constituent atoms known as conformation. Given the temperature of a molecular system, the probability of each of its possible conformation is defined by its formation energy and this follows a Boltzmann distribution [McQuarrie and Simon, 1997]. The Boltzmann distribution tells us the probability of a certain confirmation given its potential energy. The different conformations of a molecule could result in different properties and activity. Therefore, it is imperative to consider multiple conformers in molecular deep learning to ensure that the notion of conformational flexibility is embedded in the model developed. The model should also be able to capture the Boltzmann distribution of the potential energy related to the conformers.

Continue reading

The Tale of the Undead Logger

A picture of a scary-looking zombie in a lumberjack outfit holding an axe, in the middle of a forest at night, staring menacingly at the viewer.
Fear the Undead Logger all ye who enter here.
For he may strike, and drain the life out nodes that you hold dear.
Among the smouldering embers of jobs you thought long dead,
he lingers on, to terrorise, and cause you frightful dread.
But hark ye all my tale to save you from much pain,
and fight ye not anew the battles I have fought in vain.

Or simply…

… Tips and Tricks to Use When wandb Logger Just. Won’t. DIE.

The Weights and Biases Logger (illustrated above by DALL-E; admittedly with some artistic license) hardly requires introduction. It’s something of an industry standard at this point, well-regarded for the extensive (and extensible) functionality of its interactive dashboard; for advanced features like checkpointing model weights in the cloud and automating hyperparameter sweeps; and for integrating painlessly with frameworks like PyTorch and PyTorch Lightning. It simplifies your life as an ML researcher enormously by making it easy to track and compare experiments, monitor system resource usage, all while giving you very fun interactive graphs to play with.
Plot arbitrary quantities you may be logging against each other, interactively, on the fly, however you like. In Dark Mode, of course (you’re a professional, after all). Here’s a less artistic impression to give you an idea, should you have been living under a rock:

Continue reading

Comparing pose and affinity prediction methods for follow-up designs from fragments

In any task in the realm of virtual screening, there need to be many filters applied to a dataset of ligands to downselect the ‘best’ ones on a number of parameters to produce a manageable size. One popular filter is if a compound has a physical pose and good affinity as predicted by tools such as docking or energy minimisation. In my pipeline for downselecting elaborations of compounds proposed as fragment follow-ups, I calculate the pose and ΔΔG by energy minimizing the ligand with atom restraints to matching atoms in the fragment inspiration. I either use RDKit using its MMFF94 forcefield or PyRosetta using its ref2015 scorefunction, all made possible by the lovely tool Fragmenstein.

With RDKit as the minimizer the protein neighborhood around the ligand is fixed and placements take on average 21s whereas with PyRosetta placements, they take on average 238s (and I can run placements in parallel luckily). I would ideally like to use RDKit as the placement method since it is so fast and I would like to perform 500K within a few days but, I wanted to confirm that RDKit is ‘good enough’ compared to the slightly more rigorous tool PyRosetta (it allows residues to relax and samples more conformations with the longer runtime I think).

Continue reading

Conference Summary: MGMS Adaptive Immune Receptors Meeting 2024

On 5th April 2024, over 60 researchers braved the train strikes and gusty weather to gather at Lady Margaret Hall in Oxford and engage in a day full of scientific talks, posters and discussions on the topic of adaptive immune receptor (AIR) analysis!

Continue reading

How can FemTech help close the gender health gap?

An excellent previous blog post from Sarah [1] describes the gender data gap and touches on the fact that women experience poorer healthcare outcomes. This arises from, amongst other things, the historical exclusion of women from clinical trials and this idea of the ‘male default’, where, for example, drug dosages and diagnostic thresholds are benchmarked against men, or even surgical instruments are designed to fit male hands [2]. I thought I would follow up on Sarah’s blog post and discuss how FemTech can help to close this gender health gap.

Continue reading

Working with PDB Structures in Pandas

Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.

Continue reading

A Seq2Seq model for ETF forecasting

Owing to the misguided belief that I can achieve the impossible, I decided to build a model with the goal of beating the stock market.

Strap in, we’re about to get rich.

Machine learning is increasingly being employed by hedge funds to help mitigate risk and identify patterns and opportunities, whether this is for optimisation of algo trading strategies, fraud detection, high-frequency trading, or sentiment analysis. Arguably the most obvious, difficult, and naïve application of fintech ML is direct stock market forecasting – sounds like the perfect place to start.

Target

First things first, we need to decide on a stock to forecast. Volatility provides opportunities, but predictable volatility is even better. We need a security that swings in response to actual, reported events, and one whose trends roughly move somehow with other stocks – our hypothesis being that wider events in the market can be used to forecast a single security. SPDR GLD seems like a reasonable option – gold is such a popular hedge against global instability it’s price usually moves in the opposite direction to stocks such as DJIA or SP500 and moves with global disaster.

Gold price (/oz) in Pounds from 1980-2024

Continue reading

Some useful pandas functions

Pandas is one of the most used packages for data analysis in python. The library provides functionalities that allow to perfrom complex data manipulation operations in a few lines of code. However, as the number of functions provided is huge, it is impossible to keep track of all of them. More often than we’d like to admit we end up wiriting lines and lines of code only to later on discover that the same operation can be performed with a single pandas function.

To help avoiding this problem in the future, I will run through some of my favourite pandas functions and demonstrate their use on an example data set containing information of crystal structures in the PDB.

Continue reading