Tag Archives: Python

Sort and Slice Tutorial – An alternative to extended connectivity fingerprints

Background¶

Sort and Slice (SNS) was developed by a former OPIGlet, Markus, as a method for improving Extended Connectivity Fingerprints (ECFPs) by overcoming bit collisions. ECFPs are a form of topological fingerprint which denote the absence and presence of circular substructures in a molecule. The steps for deriving an ECFP from a molecule are as follows:

Identifier assignment:

Each atom in the molecule is assigned an initial numerical identifier; this is typically generated by hashing a tuple of atomic properties called Daylight atomic invariants into a 32-bit integer. These properties are:
1. Number of non-hydrogen neighbours.
2. Valence – number of neighbouring hydrogens.
3. Atomic number.
4. Atomic mass.
5. Atomic charge.
6. Number of hydrogen neighbours.
7. Ring membership.*
*Ring membership is an additional property that is often used but is not one of the original Daylight atomic invariants.

Continue reading →

Easy Python job queues with RQ

Job queueing is an important consideration for a web application, especially one that needs to play nice and share resources with other web applications. There are lots of options out there with varying levels of complexity and power, but for a simple pure Python job queue that just works, RQ is quick and easy to get up and running.

RQ is a Python job queueing package designed to work out of the box, using a Redis database as a message broker (the bit that allows the app and workers to exchange information about jobs). To use it, you just need a redis-server installation and the rq module in your python environment.

Continue reading →

Comparing pose and affinity prediction methods for follow-up designs from fragments

In any task in the realm of virtual screening, there need to be many filters applied to a dataset of ligands to downselect the ‘best’ ones on a number of parameters to produce a manageable size. One popular filter is if a compound has a physical pose and good affinity as predicted by tools such as docking or energy minimisation. In my pipeline for downselecting elaborations of compounds proposed as fragment follow-ups, I calculate the pose and ΔΔG by energy minimizing the ligand with atom restraints to matching atoms in the fragment inspiration. I either use RDKit using its MMFF94 forcefield or PyRosetta using its ref2015 scorefunction, all made possible by the lovely tool Fragmenstein.

With RDKit as the minimizer the protein neighborhood around the ligand is fixed and placements take on average 21s whereas with PyRosetta placements, they take on average 238s (and I can run placements in parallel luckily). I would ideally like to use RDKit as the placement method since it is so fast and I would like to perform 500K within a few days but, I wanted to confirm that RDKit is ‘good enough’ compared to the slightly more rigorous tool PyRosetta (it allows residues to relax and samples more conformations with the longer runtime I think).

Continue reading →

Pyrosetta for RFdiffusion

I will not lie: I often struggle to find a snippet of code that did something in PyRosetta or I spend hours facing a problem caused by something not working as I expect it to. I recently did a tricky project involving RFdiffusion and I kept slipping on the PyRosetta side. So to make future me, others, and ChatGTP5 happy, here are some common operations to make working with PyRosetta for RFdiffusion easier.

Continue reading →

Quickly (and lazily) scale your data processing in Python

Do you use pandas for your data processing/wrangling? If you do, and your code involves any data-heavy steps such as data generation, exploding operations, featurization, etc, then it can quickly become inconvenient to test your code.

Inconvenient compute times (>tens of minutes). Perhaps fine for a one-off, but over repeated test iterations your efficiency and focus will take a hit.
Inconvenient memory usage. Perhaps your dataset is too large for memory, or loads in but then causes an OOM error during a mid-operation memory spike.

Continue reading →

Mapping derivative compounds to parent hits

Whereas it is easy to say in a paper “Given the HT-Sequential-ITC results, 42 led to 113, a substituted decahydro-2,6-methanocyclopropa[f]indene”, it is frequently rather trickier algorithmically figure out which atoms map to which. In Fragmenstein, for the placement route, for example, a lot goes on behind the scenes, yet for some cases human provided mapping may be required. Here I discuss how to get the mapping from Fragmenstein and what goes on behind the scenes.

Continue reading →

Using JAX and Haiku to build a Graph Neural Network

JAX

Last year, I had an opportunity to delve into the world of JAX whilst working at InstaDeep. My first blopig post seems like an ideal time to share some of that knowledge. JAX is an experimental Python library created by Google’s DeepMind for applying accelerated differentiation. JAX can be used to differentiate functions written in NumPy or native Python, just-in-time compile and execute functions on GPUs and TPUs with XLA, and mini-batch repetitious functions with vectorization. Collectively, these qualities place JAX as an ideal candidate for accelerated deep learning research [1].

JAX is inspired by the NumPy API, making usage very familiar for any Python user who has already worked with NumPy [2]. However, unlike NumPy, JAX arrays are immutable; once they are assigned in memory they cannot be changed. As such, JAX includes specific syntax for index manipulation. In the code below, we create a JAX array and change the $1^{st}$ element to a $4$ :

Continue reading →

Under-rated or overlooked, these libraries might be helpful.

Discovering a library that massively simplifies the exact thing you just did right after you’ve finished doing the thing you needed to do has to be one of the top 14 worst things about writing code. You might think it’s a part of the life we’ve all chosen, but it doesn’t have to be. Beyond the popular libraries you already know lies a treasure trove of under appreciated packages waiting to be wielded. Being the saint I am, I’ve scoured the depths of pypi.org to find some underrated and hopefully useful packages to make your life a little easier.

Continue reading →

Working with PDB Structures in Pandas

Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.

Continue reading →

Taking Equivariance in deep learning for a spin?

I recently went to Sheh Zaidi‘s brilliant introduction to Equivariance and Spherical Harmonics and I thought it would be useful to cement my understanding of it with a practical example. In this blog post I’m going to start with serotonin in two coordinate frames, and build a small equivariant neural network that featurises it.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Tag Archives: Python

Sort and Slice Tutorial – An alternative to extended connectivity fingerprints

Background¶

Easy Python job queues with RQ

Comparing pose and affinity prediction methods for follow-up designs from fragments

Pyrosetta for RFdiffusion

Quickly (and lazily) scale your data processing in Python

Mapping derivative compounds to parent hits

Using JAX and Haiku to build a Graph Neural Network

JAX

Under-rated or overlooked, these libraries might be helpful.

Working with PDB Structures in Pandas

Taking Equivariance in deep learning for a spin?