Category Archives: Code

Packaging with Conda

If you are as happy for the big snake as I am, you have probably wondered how you can create a Conda package with your amazing code. Fear not, in the following text you will learn how to make others go;

conda install -c coolperson amazingcode

Roughly, the only thing needed to create a Conda package, is a ‘meta.yaml’ file specific for your code. This file contains all the metadata needed to create your package and is highly customizable. While this means the meta.yaml can be written to allow your Conda package to work on any operating system and with any dependencies (doesn’t have to be python) it can be annoying to write from scratch (here is a guide for manually writing this file). Since we just want to create a simple Conda package, we will in this guide avoid fiddling around with the meta.yaml file and instead create the file based on a PyPI package. This will also give you a nice template, if you later need to adapt your meta.yaml file.
Note: Conda packages can also be made from GitHub repositories, which is likely favorable in most cases, but it also requires some manual work on the meta.yaml.

1. Create a PyPI package of your code

Continue reading

Using normalized SuCOS scores.

If you are working in cheminformatics or utilise protein-ligand docking, then you should be aware of the SuCOS score, an open-source shape and chemical feature overlap metric designed by a former member of OPIG: Susan Leung.

The metric compares the 3D conformers of two ligands based on their shape overlap as well as their chemical feature overlap using the RDKit toolkit. Leung et al. show that SuCOS is able to select fewer false positives and false negatives when doing re-docking studies than other scoring metrics such as RMSD or Protein Ligand Interaction Fingerprints (PLIF) similarity scores and performs better at differentiating actives from decoys when tested on the DUD-E dataset.

Most importantly, SuCOS was designed with fragment based drug discovery in focus, where a smaller fragment ligand is elaborated or combined with other fragments to create a larger molecule, with hopefully stronger binding affinity. Unlike for example RMSD, SuCOS is able to quickly calculate an overlap score between a small fragment and a larger molecule, giving chemists an idea on how the fragment elaboration might interact with the protein. However, the original SuCOS algorithm was not normalized and could create scores of > 1 for some cases.

I’ve uploaded a normalised version of the original SuCOS algorithm as a GitHub fork of Susan’s original repository. You can find the normalised SuCOS algorithm here.

Hopefully this is helpful for anyone using the SuCOS algorithm and for all docking enthusiasts who are interested in an alternative way to evaluate their docked poses.

Monty Python

Every now and then I decide to overthink a problem I thought I understood and get confused – last week, it was the Monty Hall problem. 

For those unfamiliar with the thought experiment, the basic premise is that you are on a game show and are presented with three doors. Behind one of the doors is a car, while behind the other two are goats. 

With zero initial information, you make a guess as to which door you think the car is behind (we assume you have enough goats already). Before looking behind your chosen door, the host opens one of the remaining two doors and reveals a goat. The host then asks you if you would like to change your guess. What should you do? 

Continue reading

Getting the PDB structures of compounds in ChEMBL

Recently I was dealing with a set of compounds with known target activities from the ChEMBL database, and I wanted to find out which of them also had PDB  crystal structures in complex with that target.

Referencing this manually is very easy for cases where we are interested in 2-3 compounds, but for any larger number, using the ChEMBL and PDB web services greatly reduces the number of clicks.

Continue reading

Command-Line Interfaces (CLIs), argparse.ArgumentParser and some of my tricks.

Command-Line Interfaces (CLIs) are one of the best ways of providing your programs with useful parameters to customize their execution. If you are not familiar with CLI, in this blog post we will introduce them. Let’s say that you have a program that reads a file, computes something, and then, writes the results into another file. The simplest way of providing those arguments would be:

$ python mycode.py my/inputFile my/outputFile
### mycode.py ###
def doSomething(inputFilename):
    with open(inputFilename) as f:
        return len(f.readlines())

if __name__ == "__main__":
    #Notice that the order of the arguments is important
    inputFilename = sys.argv[1]
    outputFilename = sys.argv[2]

    with open(outputFilename, "w") as f:
        f.write( doSomething(inputFilename))
Continue reading

Multiple Testing: What is it, why is it bad and how can we avoid it?

P-values play a central role in the analysis of many scientific experiments. But, in 2015, the editors of the Journal of Basic and Applied Social Psychology prohibited the usage of p-values in their journal. The primary reason for the ban was the proliferation of results obtained by so-called ‘p-hacking’, where a researcher tests a range of different hypotheses and publishes the ones which attain statistical significance while discarding the others. In this blog post, we’ll show how this can lead to spurious results and discuss a few things you can do to avoid engaging in this nefarious practice.

The Basics: What IS a p-value?

Under a Hypothesis Testing framework, a p-value associated with a dataset is defined as the probability of observing a result that is at least as extreme as the observed one, assuming that the null hypothesis is true. If the probability of observing such an event is extremely small, we conclude that it is unlikely the null hypothesis is true and reject it.

But therein lies the problem. Just because the probability of something is small, that doesn’t make it impossible. Using the standard significance test threshold of 0.05, even if the null hypothesis is true, there is a 5% chance of obtaining a p-value below the significance threshold and therefore rejecting it. Such false positives are an inescapable part of research; there’s always a possibility that the subset you were working with isn’t representative of the global data and sometimes we take the wrong decision even though we analysed the data in a perfectly rigorous fashion.

Continue reading

How to interact with small molecules in Jupyter Notebooks

The combination of Python and the cheminformatics toolkit RDKit has opened up so many ways to explore chemistry on a computer. Jupyter — named for the three languages, Julia, Python, and R — ties interactivity and visualization together, creating wonderful environments (Notebooks and JupyterLab) to carry out, share and reproduce research, including:

“data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.”

—https://jupyter.org

At this year’s annual RDKit UGM (User Group Meeting), Cédric Bouysset shared a tutorial explaining how to create a grid of molecules that you can interact with, using his “mols2grid“:

Continue reading

List comprehension: an elegant Python feature inspired by mathematical set theory

Even though I have now deeply entered into the fascinating world of statistical machine learning and computational chemistry, my original background is very much in pure mathematics. Having spent some of my intellectually formative years in this highly purified and abstract universe, I still love to think in terms of sets, ordered tuples and well-defined functions whenever I have the luxury of being able to do so. This might be why list comprehension is one of my favourite features in Python.

List comprehension allows you to efficiently map a function over a list using elegant notation inspired by mathematical set theory. Let us first consider a (mathematical) set

A := \{1, 3, 7 \}.

Continue reading

Watch out when using PDBbind!

Now that PDBbind 2020 has been released, I want to draw some attention to an issue with using the SDF files that are supplied in the PDBbind refined set 2020.

Normally, SDF files save the chirality information of compounds in the atom block of the file which is shown belowas a snipped of the full sdf file for the ligand of PDB entry 4qsv. The column that defines chirality is marked in red.

As you can see, all columns shown here are 0. The SDF files supplied by PDBbind for some reason do NOT encode chirality information explicitly. This will be a problem when using RDKit to read the molecule and transform it into a smiles string. By using the following commands to read the ligand for 4qsv from PDBBind 2020 and write a SMILES string, we get:

Continue reading

Lessons in Scientific Code Deployment

So, I recently deployed my first piece of scientific code. Well, sort of. I made a github with instructions on how to download, install and run it.

And then everyone broke it.

So, now having been on tech support duty for a few weeks, it seemed like a good idea to have a think about what I’ve learned.

Now, there is a big preface to this: the first and most important thing I learned is that I should do some reading on how to do this well. I have not yet done that reading, so this post isn’t so much going to offer any advice as catalogue my mistakes. Mistakes that will probably look extremely silly to anyone who has any familiarity with deployment, but might be interesting to anyone who doesn’t.

A surprising number of people really don’t want to touch the command line

Being a programmer who spends the vast majority of their time on the command line, invoking programs from there is very natural. As such, I very much underestimated the obstacle that even installing anaconda, a few packages, and cloning the source code would be. Even with instructions to copy and paste. 

The issue is, if anything goes wrong, there is a good chance they don’t know whether it is my code or their environment breaking, which probably means they need to contact me about it (more on environments later). 

Really, I probably could have saved myself an awful lot of support by making it an installable, and more with a gui to guide people through using the program.

Python is a pain

So, the first thing I learned was something I’d kind of been warned about: deploying python code is a pain in the butt. Especially to people who aren’t familiar with python, managing python environments is both tricky and overwhelming easy to break code with. Run a python script from the wrong environment and it is going to fail: if you are lucky with a failure to import a module, if you are unlucky with a cryptic error due to say changes between various python versions.
Speaking of python versions, developing in 3.9 and not testing in 3.7 then telling people to install that can result in a surprising number of surprisingly difficult bugs.

The instructions weren’t clear enough

Scientific code I think generally caters an awful lot to expert users, people who really understand the model and even are willing to open the source code to figure out the implementation.

My first stab at documentation managed to not be clear enough to the people who didn’t want to touch the command line and those who were willing to open the source code because they wanted to do something spicy.

So yeah, good documentation is an acquired skill.

Distributed computing is a nightmare

In principle, distribution is terrific: get a library that will allow you to reduce running arbitrary python code on multiple nodes to a simple map-like interface. On big clusters, like a lot of scientists use, this can mean speed ups from 10 to even 1000 times.

The only problem is, everyone’s cluster is a special snowflake, and you can’t access most of them to fix things. This can make iteration with a non-programmer painfully slow. 

Libraries don’t help as much as I’d have thought either: indeed, my experience of Dask and Dask Jobqueue has been a consistently uphill battle. From the fact that my workload likes individual nodes sharing lots of memory and a few cpus to some truly arcane errors (one that broke in the msgpack code), I have generally considered (and even started) writing my own code to do this.

Active development doesn’t reach people

Code that is being updated several times a day in response to bugfixes can be great – but if people aren’t pulling and installing it, no-one is going to benefit. I’m seriously tempted to write some code to either auto-update on running or at least let folk know it has been updated.

Summary

In summary, a lot went wrong in my first stab at this. Very much come to appreciate a good deployment is an artform, and I’ve got an awful lot of reading to do. In particular, the above problem areas really have eaten a lot of time that probably could have been used doing actual science with the code, so there is a good incentive to get it right.