Through our work in OPIG, many of our projects come in the form of code bases written in Python. These can be many different things like databases, machine learning models, and other software tools. Often, the user interface for these tools is developed as both a web app and a command line application. Here, I will discuss one of my favourite tools for testing command-line applications: prysk!
Continue readingCategory Archives: Python
MDAnalysis: Work with dynamics trajectories of proteins
For a long time crystallographers and subsequently the authors of AlphaFold2 had you believe that proteins are a static group of atoms written to a .pdb file. Turns out this was a HOAX. If you don’t want to miss out on the latest trend of working with dynamic structural ensembles of proteins this blog post is exactly right for you. MDAnalysis is a python package which as the name says was designed to analyse molecular dyanmics simulation and lets you work with trajectories of protein structures easily.
Continue readingOut of the box RDKit-valid is an imperfect metric: a review of the KekulizeException and nitrogen protonation to correct this
In deep learning based compound generation models the metric of fraction of RDKit-valid compounds is ubiquitous, but is problematic from the cheminformatics viewpoint as a large fraction may be driven by pyrrolic nitrogens (see below) rather than Texas carbons (carbon with 5 bonds like the Star of Texas). In RDKit, no error is more irksome that the KekulizeException
or ValenceException
from RDKit sanitisation. These are raised when the molecule is not correct. This would make the RDKit-valid a good metric, except for a small detail: the validity is as interpreted from the the stated implicit and explicit hydrogens and formal charges on the atoms, which most models do not assign. Therefore, a compound may not be RDKit-valid because it is actually impossible, like a Texas carbon, but in many cases it is because the formal charge or implicit hydrogen numbers of some atoms are incorrect. In both case, the major culprit is nitrogen. Herein I go through what they are and how to fix them, with a focus on aromatic nitrogens.
Making your code pip installable
aka when to use a CutomBuildCommand or a CustomInstallCommand when building python packages with setup.py
Bioinformatics software is complicated, and often a little bit messy. Recently I found myself wading through a python package building quagmire and thought I could share something I learnt about when to use a custom build command and when to use a custom install command. I have also provided some information about how to copy executables to your package installation bin. **ChatGPT wrote the initial skeleton draft of this post, and I have corrected and edited.
Next time you need to create a pip installable package yourself, hopefully this can save you some time!
Continue readingMemory-mapped files for efficient data processing
Memory management is a key concern when working with large datasets. Many researchers and developers will load entire datasets into memory for processing. Although this is a straightforward approach that allows for quick access and manipulation of data, it has its drawbacks. When the dataset size approaches or exceeds the available physical memory, performance degrades rapidly due to excessive swapping, leading to increased latency and reduced throughput. Memory-mapped files are an alternative strategy to access and manipulate large datasets without the need to load them fully into memory.
A background on memory-mapped Files
Memory mapping is the process of mapping a file or a portion of a file directly into virtual memory. This mapping establishes a one-to-one correspondence between the file’s contents on disk and specific addresses in the process’s memory space. Instead of relying on traditional I/O operations, such as read()
an write()
, which involve copying data between kernel space and user space, the process can access the file’s contents directly through memory addresses. Then, page faults are used to determine which chunks to load into physical memory. However, this chunks are significantly smaller than the whole file contents. This direct access reduces overhead and can significantly speed up data processing, especially for large files or applications that require high-throughput I/O operations.
Sort and Slice Tutorial – An alternative to extended connectivity fingerprints
Background¶
Sort and Slice (SNS) was developed by a former OPIGlet, Markus, as a method for improving Extended Connectivity Fingerprints (ECFPs) by overcoming bit collisions. ECFPs are a form of topological fingerprint which denote the absence and presence of circular substructures in a molecule. The steps for deriving an ECFP from a molecule are as follows:
Identifier assignment:
Each atom in the molecule is assigned an initial numerical identifier; this is typically generated by hashing a tuple of atomic properties called Daylight atomic invariants into a 32-bit integer. These properties are:
- Number of non-hydrogen neighbours.
- Valence – number of neighbouring hydrogens.
- Atomic number.
- Atomic mass.
- Atomic charge.
- Number of hydrogen neighbours.
- Ring membership.*
*Ring membership is an additional property that is often used but is not one of the original Daylight atomic invariants.
Continue reading
Reproducible publishing
I’m a big fan of Jupyter Notebooks. They’re a great way to document and explain code, and even better, you can run this code when connected to an appropriate kernel.
What if you want to work on something larger than a notebook? Say a chapter or even a whole book, with Python, R, Observable JS, or Julia code? Enter Quarto. You can combine Jupyter notebooks and/or plain text markdown to publish production quality articles, presentations, dashboards, website, blogs and books in HTML, PDF, Microsoft Word, ePub, and other formats. Quarto can also connect to publishing platforms like Posit Connect, Confluence Cloud, and others.
Continue readingThe Tale of the Undead Logger
Or simply…
… Tips and Tricks to Use When wandb
Logger Just. Won’t. DIE.
The Weights and Biases Logger (illustrated above by DALL-E; admittedly with some artistic license) hardly requires introduction. It’s something of an industry standard at this point, well-regarded for the extensive (and extensible) functionality of its interactive dashboard; for advanced features like checkpointing model weights in the cloud and automating hyperparameter sweeps; and for integrating painlessly with frameworks like PyTorch and PyTorch Lightning. It simplifies your life as an ML researcher enormously by making it easy to track and compare experiments, monitor system resource usage, all while giving you very fun interactive graphs to play with.
Plot arbitrary quantities you may be logging against each other, interactively, on the fly, however you like. In Dark Mode, of course (you’re a professional, after all). Here’s a less artistic impression to give you an idea, should you have been living under a rock:
Easy Python job queues with RQ
Job queueing is an important consideration for a web application, especially one that needs to play nice and share resources with other web applications. There are lots of options out there with varying levels of complexity and power, but for a simple pure Python job queue that just works, RQ is quick and easy to get up and running.
RQ is a Python job queueing package designed to work out of the box, using a Redis database as a message broker (the bit that allows the app and workers to exchange information about jobs). To use it, you just need a redis-server installation and the rq module in your python environment.
Continue readingInteractive visualization of protein–ligand complexes with Py3Dmol
I recently had a problem where I wanted to provide an interactive visualization of multiple different protein–ligand complexes, requiring minimal setup by the user, allowing them to zoom in and out and change the visualization style, without just providing multiple PDB files or a PyMOL session.
Continue reading