Category Archives: Python

Pyrosetta for RFdiffusion

I will not lie: I often struggle to find a snippet of code that did something in PyRosetta or I spend hours facing a problem caused by something not working as I expect it to. I recently did a tricky project involving RFdiffusion and I kept slipping on the PyRosetta side. So to make future me, others, and ChatGTP5 happy, here are some common operations to make working with PyRosetta for RFdiffusion easier.

Continue reading

Exploring multilingual programming

Python is a prominent language in the ML and scientific computing space, and for good reason. Python is easy-to-learn and readable, and it offers a vast selection of libraries such as NumPy for numerical computation, Pandas for data manipulation, SciPy for scientific computing, TensorFlow, and PyTorch for deep learning, along with RDKit and Open Babel for cheminformatics. It is understandably an appealing choice for developers and researchers alike. However, a closer look at many common Python libraries reveals their foundations in C++. 

Revisiting C++ Advantages

Many of Python libraries including TensorFlow, PyTorch, and RDKit are all heavily-reliant on C++. C++ allows developers to manage memory and CPU resources more effectively than Python, making it a good choice when handling large volumes of data at a fast pace. A previous post on this blog discusses C++’s speed, its utility in GPU programming through CUDA, and the complexities of managing its libraries. Despite the steeper learning curve and verbosity compared to Python, the performance benefits of C++ are undeniable, especially in contexts where execution speed and memory management are critical.

Rust: A New Contender for High-Performance Computing

Continue reading

Quickly (and lazily) scale your data processing in Python

Do you use pandas for your data processing/wrangling? If you do, and your code involves any data-heavy steps such as data generation, exploding operations, featurization, etc, then it can quickly become inconvenient to test your code.

  • Inconvenient compute times (>tens of minutes). Perhaps fine for a one-off, but over repeated test iterations your efficiency and focus will take a hit.
  • Inconvenient memory usage. Perhaps your dataset is too large for memory, or loads in but then causes an OOM error during a mid-operation memory spike.
Continue reading

Mapping derivative compounds to parent hits

Whereas it is easy to say in a paper “Given the HT-Sequential-ITC results, 42 led to 113, a substituted decahydro-2,6-methanocyclopropa[f]indene”, it is frequently rather trickier algorithmically figure out which atoms map to which. In Fragmenstein, for the placement route, for example, a lot goes on behind the scenes, yet for some cases human provided mapping may be required. Here I discuss how to get the mapping from Fragmenstein and what goes on behind the scenes.

Continue reading

Using JAX and Haiku to build a Graph Neural Network


JAX

Last year, I had an opportunity to delve into the world of JAX whilst working at InstaDeep. My first blopig post seems like an ideal time to share some of that knowledge. JAX is an experimental Python library created by Google’s DeepMind for applying accelerated differentiation. JAX can be used to differentiate functions written in NumPy or native Python, just-in-time compile and execute functions on GPUs and TPUs with XLA, and mini-batch repetitious functions with vectorization. Collectively, these qualities place JAX as an ideal candidate for accelerated deep learning research [1].

JAX is inspired by the NumPy API, making usage very familiar for any Python user who has already worked with NumPy [2]. However, unlike NumPy, JAX arrays are immutable; once they are assigned in memory they cannot be changed. As such, JAX includes specific syntax for index manipulation. In the code below, we create a JAX array and change the 1^{st} element to a 4:

Continue reading

Under-rated or overlooked, these libraries might be helpful.

Discovering a library that massively simplifies the exact thing you just did right after you’ve finished doing the thing you needed to do has to be one of the top 14 worst things about writing code. You might think it’s a part of the life we’ve all chosen, but it doesn’t have to be. Beyond the popular libraries you already know lies a treasure trove of under appreciated packages waiting to be wielded. Being the saint I am, I’ve scoured the depths of pypi.org to find some underrated and hopefully useful packages to make your life a little easier.

Continue reading

Plotext: The Matplotlib Lookalike That Breaks Free from X Servers

Imagine this: you’ve spent days computing intricate analyses, and now it’s time to bring your findings to life with a nice plot. You fire up your cluster job, scripts hum along, and… matplotlib throws an error, demanding an X server it can’t find. Frustration sets in. What a waste of computation! What happened? You just forgot to add the -X to your ssh command, or it may be just that X forwarding is not allowed in your cluster. So you will need to rerun your scripts, once you have modified them to generate a file that you can copy to your local machine rather than plotting it directly.

But wait! Plotext to the rescue! This Python package provides an interface nearly identical to matplotlib, allowing you to seamlessly transition your plotting code without sacrificing functionality. But why choose Plotext over the familiar matplotlib? The key lies in its text-based backend. This means it is just printing characters in your console to generate the plots, making it ideal for cluster environments where X servers are often absent or restricted. What do those plots look like? Here is an example:

Continue reading

Working with PDB Structures in Pandas

Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.

Continue reading

The stuff MDAnalysis didn’t implement: CPU Parallel HOLE conductance analysis

Some time ago, I needed to find a way to computationally estimate conductance values for every protein frame from several molecular dynamics (MD) trajectories.

In a previous post, I wrote about how to clean the resulting instant conductance timeseries from outliers. But, I never described how I generated these timeseries.

In this post, I will show how you can parallelise the computation of instant conductance given an MD trajectory. I will touch on the difficulties of this process. And why I had to implement a custom tool for it given that MDAnalysis seems to already have implemented a routine of this sort. Finally, I will provide two Python scripts that you can easily adapt to run your parallel calculations – for which I’ll provide some important notes you don’t wanna skip.

Violin plots of conductance distributions from 64 molecular dynamic trajectories with 1000 frames each.
Continue reading

Taking Equivariance in deep learning for a spin?

I recently went to Sheh Zaidi‘s brilliant introduction to Equivariance and Spherical Harmonics and I thought it would be useful to cement my understanding of it with a practical example. In this blog post I’m going to start with serotonin in two coordinate frames, and build a small equivariant neural network that featurises it.

Continue reading