Category Archives: Technical

OpenMM Setup: Start Simulating Proteins in 5 Minutes

Molecular dynamics (MD) simulations are a good way to explore the dynamical behaviour of a protein you might be interested in. One common problem is that they often have a relatively steep learning curve when using most MD engines.

What if you just want to run a simple, one-off simulation with no fancy enhanced sampling methods? OpenMM Setup is a useful tool for exactly this. It is built on the open-source OpenMM engine and provides an easy to install (via conda) GUI that can have you running a simulation in less than 5 minutes. Of course, running a simulation requires careful setting of parameters and being familiar with best practices and while this is beyond the scope of this post, there are many guides out there that can easily be found. Now on to the good stuff: using OpenMM Setup!

When you first run OpenMM Setup, you’ll be greeted by a browser window asking you to choose a structure to use. This can be a crystal structure or a model. Remember, sometimes these will have problems that need fixing like missing density or charged, non-physiological termini that would lead to artefacts, so visual inspection of the input is key! You can then choose the force field and water model you want to use, and tell OpenMM to do some cleaning up of the structure. Here I am running the simulation on hen egg-white lysozyme:

Continue reading →

How to prepare a molecule for RDKit

RDKit is very fussy when it comes to inputs in SDF format. Using the SDMolSupplier, we get a significant rate of failure even on curated datasets such as the PDBBind refined set. Pymol has no such scruples, and with that, I present a function which has proved invaluable to me over the course of my DPhil. For reasons I have never bothered to explore, using pymol to convert from sdf, into mol2 and back to sdf format again (adding in missing hydrogens along the way) will almost always make a molecule safe to import using RDKit:

from pathlib import Path
from pymol import cmd

def py_mollify(sdf, overwrite=False):
    """Use pymol to sanitise an SDF file for use in RDKit.

    Arguments:
        sdf: location of faulty sdf file
        overwrite: whether or not to overwrite the original sdf. If False,
            a new file will be written in the form <sdf_fname>_pymol.sdf
            
    Returns:
        Original sdf filename if overwrite == False, else the filename of the
        sanitised output.
    """
    sdf = Path(sdf).expanduser().resolve()
    mol2_fname = str(sdf).replace('.sdf', '_pymol.mol2')
    new_sdf_fname = sdf if overwrite else str(sdf).replace('.sdf', '_pymol.sdf')
    cmd.load(str(sdf))
    cmd.h_add('all')
    cmd.save(mol2_fname)
    cmd.reinitialize()
    cmd.load(mol2_fname)
    cmd.save(str(new_sdf_fname))
    return new_sdf_fname

Making pwd redundant

I’m going to keep this one brief, because I am mid-confirmation-and-paper-writing madness. I have seen too many people – both beginners and seasoned veterans – wandering around their Linux filesystem blindfolded:

Isn’t it hideous?

Whenever you want to see where you are, you have to execute pwd (present working directory), which will print your absolute location to stdout. If you have many terminals open at the same time, it is easy to lose track of where you are, and every other command becomes pwd; surely, I hear you cry, there has to be a better way!

Well, fear not! With a little tinkering with ~/.bashrc, we can display the working directory as part of the special PS1 environment variable, responsible for how your username and computer are displayed above. Putting the following at the top of ~/.bashrc

me=`id | awk -F\( '{print $2}' | awk -F\) '{print $1}'`
export PS1="`uname -n |  /bin/sed 's/\..*//'`{$me}:\$PWD$ "

… saving, and starting a new termanal window results in:

Much better!

I haven’t used pwd in 3 years.

How to estimate the inestimable

Back-of-the-envelope calculations are one of our chief tools as scientists. When you spend most of your time wondering if your latest measurement is correct, having a tool to check if the numbers make sense is simply priceless. If you are lucky, a good estimate might just avoid a costly or laborious measurement — this is very common in disciplines like chemical engineering, which a friend described as “the art of estimating numbers and plugging them into some variation of Bernoulli’s continuity equation”. Unsurprisingly, these Fermi problems are now common interview questions at major consultancy and tech companies, and have even started to go viral.

Last week, I thought I would ask my biochemistry students to solve a back-of-the-envelope problem as part of their tutorial work. Disguised as an enzyme catalysis problem, I asked them to estimate the energy of a single hydrogen bond. Needless to say, they were puzzled. Some of them asked if I had forgotten to include some information in the problem sheet. For some reason, Fermi problems seem to be less common in chemistry and biology that they are in physics of engineering. Of course, estimating the energy of a hydrogen bond is in many ways much harder than guessing the number of ping pong balls that fit a Boeing 747. Nobody has seen a hydrogen bond in the flesh. And our minds struggle to grasp the vast numbers present at the molecular level. Nevertheless, guesstimates are incredibly useful

Continue reading →

New Antibody Therapeutic INNs will no longer end in “-mab”!

Happy 2022, Blopiggers!

My first post of the year is about another major change to the way the World Health Organisation will be assigning “International Non-proprietary Name”s (INNs) to antibody-based therapeutics. I haven’t seen this publicised widely, so I thought I’d share it here as it is an important consideration for anyone mining or exploiting this data.

Continue reading →

A logical brain teaser to derail your afternoon

Brain teasers have a strange power. For many they evoke nothing more than a mild and transient sense of curiosity. But for a certain subset of people they create an irresistible intellectual temptation which even needs to actively be avoided at times as not to completely derail conversations and take over whole afternoons.

For better or worse, I am in the camp of people who are highly susceptible to brain teasers. I just love them too much. More than once in my lifetime I had to ask a friend not to tell me about a particular brain teaser they had heard about because I knew it would inevitably take over my mind and send me down an almost hypnotic spiral of thoughts whose only escape would be finding the solution.

While brain teasers can admittedly turn into ridiculously powerful distractions for some of us, they are not necessarily a waste of time. They have high recreational value and help the mind to enter a playful and creative state. They serve as mental gymnastics to directly train logical thinking skills, and logical thinking is arguably one of the most powerful transferable skills that exists. And last but not least, brain teasers are canonically used nowadays in job interviews at some of the worlds top employers (Google, Facebook, Microsoft, prestigious hedge funds, …).

In this post, I will present one of my favourite brain teasers to see if I can get you hooked. It is a slightly modified and self-contained version of the so-called pirate game. You can find the solution at the end of the page. Enjoy responsibly! Continue reading →

Getting the PDB structures of compounds in ChEMBL

Recently I was dealing with a set of compounds with known target activities from the ChEMBL database, and I wanted to find out which of them also had PDB crystal structures in complex with that target.

Referencing this manually is very easy for cases where we are interested in 2-3 compounds, but for any larger number, using the ChEMBL and PDB web services greatly reduces the number of clicks.

Continue reading →

A handful of lesser known python libraries

There are more python libraries than you can shake a stick at, but here are a handful that don’t get much love and may save you some brain power, compute time or both.

Fire is a library which turns your normal python functions into command-line utilities without requiring more than a couple of additional lines of copy-and-paste code. Being able to immediately access your functions from the command line is amazingly helpful when you’re making quick and dirty utilities and saves needing to reach for the nuclear approach of using getopt.

Continue reading →

Uniformly sampled 3D rotation matrices

It’s not as simple as you’d think.

If you want to skip the small talk, the code is at the bottom. Sampling 2D rotations uniformly is simple: rotate by an angle from the uniform distribution $\theta \sim U(0, 2\pi)$ . Extending this idea to 3D rotations, we could sample each of the three Euler angles from the same uniform distribution $\phi, \theta, \psi \sim U(0, 2\pi)$ . This, however, gives more probability density to transformations which are clustered towards the poles:

Sampling Euler angles uniformly does not give an even distribution across the sphere.

In Fast Random Rotation Matrices (James Avro, 1992), a method for uniform random 3D rotation matrices is outlined, the main steps being:

Continue reading →

AlphaFold 2 is here: what’s behind the structure prediction miracle

Nature has now released that AlphaFold 2 paper, after eight long months of waiting. The main text reports more or less what we have known for nearly a year, with some added tidbits, although it is accompanied by a painstaking description of the architecture in the supplementary information. Perhaps more importantly, the authors have released the entirety of the code, including all details to run the pipeline, on Github. And there is no small print this time: you can run inference on any protein (I’ve checked!).

Have you not heard the news? Let me refresh your memory. In November 2020, a team of AI scientists from Google DeepMind indisputably won the 14^th Critical Assessment of Structural Prediction competition, a biennial blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally but not publicly released. Their results were so astounding, and the problem so central to biology, that it took the entire world by surprise and left an entire discipline, computational biology, wondering what had just happened.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Technical

OpenMM Setup: Start Simulating Proteins in 5 Minutes

How to prepare a molecule for RDKit

Making pwd redundant

How to estimate the inestimable

New Antibody Therapeutic INNs will no longer end in “-mab”!

A logical brain teaser to derail your afternoon

Getting the PDB structures of compounds in ChEMBL

A handful of lesser known python libraries

Uniformly sampled 3D rotation matrices

AlphaFold 2 is here: what’s behind the structure prediction miracle