Category Archives: Code

Python’s Data Classes

When writing code, you have inevitably needed to store data throughout your pipeline. In these cases you store your value, list or data frame as a variable to easily use it elsewhere in your code. However, sometimes your data has an awkward form, consisting of a number of different length lists or data of different types and sizes. While it is still doable to work with, and using tuples or dictionaries can help, accessing different elements in your data quickly becomes messy and it is less intuitive what your code is actually doing.

To solve the above stated problem, data classes were introduced as a new feature in Python 3.7. A data class is a regular Python class, but with certain methods already implemented for you. This makes them easy to create and removes a lot of boilerplate (repeated code) making them simpler, more intuitive and pretty. Further, as data classes are part of the standard library, you can directly import it without needing to install any external dependencies (noice).

With the sales pitch out of the way, let us look at how we can use data classes.

from dataclasses import dataclass
from typing import Any

@dataclass
class Antibody:
    vgene: str
    jgene: None
    sequence: Any = 'EVQ'

Continue reading →

Tracking Changes in LaTeX

Tracking changes in Microsoft Word is easy – we just click the ‘Track Changes’ button. It requires a little more work in LaTeX but here is a quick guide to doing it as painlessly as possible!

First, we want our original document to be stored in a *.tex file in an Overleaf project (as shown below)

Continue reading →

Solving WORDLE with grep

People seem to have become obsessed with wordle, just like they became obsessed with sudoku. After my initial burst of “oh a new game!” had waned, I was left thinking “my time is precious and this is exactly what we have computers for”. With this in mind, below is my quick and dirty way of solving these. I’m sure the regexp gurus amongst you will have a more elegant solution.

Step 1: Make sure you’ve got /usr/share/dict/words installed. This is just a huge list of words in a specific language and for me, this required installing the British words list.

sudo apt-get install wbritish

Step 2: Go to wordle

Step 3: Pick a random 5-letter word as your starting point. This is where grep and /usr/share/dict/words comes in:

Continue reading →

snakeMAKE better workflows with your code

When developing your pipeline for processing, annotating and/or analyzing data, you will probably find yourself needing to continuously re-run it, as you play around with your code. This can become a problem when working with long pipelines, large datasets and cpu’s begging you not to run some pieces of code again.

Luckily, you are not the first one to have been annoyed by this and other related struggles. Some people were actually so annoyed that they created Snakemake. Snakemake can be used to create workflows and help solve problems, such as the one mentioned above. This is done using a Snakefile, which helps you split your pipeline into “rules”. To illustrate how this helps you create a better workflow, we will be looking at the example below.

Continue reading →

Packaging with Conda

If you are as happy for the big snake as I am, you have probably wondered how you can create a Conda package with your amazing code. Fear not, in the following text you will learn how to make others go;

conda install -c coolperson amazingcode

Roughly, the only thing needed to create a Conda package, is a ‘meta.yaml’ file specific for your code. This file contains all the metadata needed to create your package and is highly customizable. While this means the meta.yaml can be written to allow your Conda package to work on any operating system and with any dependencies (doesn’t have to be python) it can be annoying to write from scratch (here is a guide for manually writing this file). Since we just want to create a simple Conda package, we will in this guide avoid fiddling around with the meta.yaml file and instead create the file based on a PyPI package. This will also give you a nice template, if you later need to adapt your meta.yaml file.
Note: Conda packages can also be made from GitHub repositories, which is likely favorable in most cases, but it also requires some manual work on the meta.yaml.

1. Create a PyPI package of your code

Continue reading →

Using normalized SuCOS scores.

If you are working in cheminformatics or utilise protein-ligand docking, then you should be aware of the SuCOS score, an open-source shape and chemical feature overlap metric designed by a former member of OPIG: Susan Leung.

The metric compares the 3D conformers of two ligands based on their shape overlap as well as their chemical feature overlap using the RDKit toolkit. Leung et al. show that SuCOS is able to select fewer false positives and false negatives when doing re-docking studies than other scoring metrics such as RMSD or Protein Ligand Interaction Fingerprints (PLIF) similarity scores and performs better at differentiating actives from decoys when tested on the DUD-E dataset.

Most importantly, SuCOS was designed with fragment based drug discovery in focus, where a smaller fragment ligand is elaborated or combined with other fragments to create a larger molecule, with hopefully stronger binding affinity. Unlike for example RMSD, SuCOS is able to quickly calculate an overlap score between a small fragment and a larger molecule, giving chemists an idea on how the fragment elaboration might interact with the protein. However, the original SuCOS algorithm was not normalized and could create scores of > 1 for some cases.

I’ve uploaded a normalised version of the original SuCOS algorithm as a GitHub fork of Susan’s original repository. You can find the normalised SuCOS algorithm here.

Hopefully this is helpful for anyone using the SuCOS algorithm and for all docking enthusiasts who are interested in an alternative way to evaluate their docked poses.

Monty Python

Every now and then I decide to overthink a problem I thought I understood and get confused – last week, it was the Monty Hall problem.

For those unfamiliar with the thought experiment, the basic premise is that you are on a game show and are presented with three doors. Behind one of the doors is a car, while behind the other two are goats.

With zero initial information, you make a guess as to which door you think the car is behind (we assume you have enough goats already). Before looking behind your chosen door, the host opens one of the remaining two doors and reveals a goat. The host then asks you if you would like to change your guess. What should you do?

Continue reading →

Getting the PDB structures of compounds in ChEMBL

Recently I was dealing with a set of compounds with known target activities from the ChEMBL database, and I wanted to find out which of them also had PDB crystal structures in complex with that target.

Referencing this manually is very easy for cases where we are interested in 2-3 compounds, but for any larger number, using the ChEMBL and PDB web services greatly reduces the number of clicks.

Continue reading →

Command-Line Interfaces (CLIs), argparse.ArgumentParser and some of my tricks.

Command-Line Interfaces (CLIs) are one of the best ways of providing your programs with useful parameters to customize their execution. If you are not familiar with CLI, in this blog post we will introduce them. Let’s say that you have a program that reads a file, computes something, and then, writes the results into another file. The simplest way of providing those arguments would be:

$ python mycode.py my/inputFile my/outputFile

### mycode.py ###
def doSomething(inputFilename):
    with open(inputFilename) as f:
        return len(f.readlines())

if __name__ == "__main__":
    #Notice that the order of the arguments is important
    inputFilename = sys.argv[1]
    outputFilename = sys.argv[2]

    with open(outputFilename, "w") as f:
        f.write( doSomething(inputFilename))

Continue reading →

Multiple Testing: What is it, why is it bad and how can we avoid it?

P-values play a central role in the analysis of many scientific experiments. But, in 2015, the editors of the Journal of Basic and Applied Social Psychology prohibited the usage of p-values in their journal. The primary reason for the ban was the proliferation of results obtained by so-called ‘p-hacking’, where a researcher tests a range of different hypotheses and publishes the ones which attain statistical significance while discarding the others. In this blog post, we’ll show how this can lead to spurious results and discuss a few things you can do to avoid engaging in this nefarious practice.

The Basics: What IS a p-value?

Under a Hypothesis Testing framework, a p-value associated with a dataset is defined as the probability of observing a result that is at least as extreme as the observed one, assuming that the null hypothesis is true. If the probability of observing such an event is extremely small, we conclude that it is unlikely the null hypothesis is true and reject it.

But therein lies the problem. Just because the probability of something is small, that doesn’t make it impossible. Using the standard significance test threshold of 0.05, even if the null hypothesis is true, there is a 5% chance of obtaining a p-value below the significance threshold and therefore rejecting it. Such false positives are an inescapable part of research; there’s always a possibility that the subset you were working with isn’t representative of the global data and sometimes we take the wrong decision even though we analysed the data in a perfectly rigorous fashion.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Code

Python’s Data Classes

Tracking Changes in LaTeX

Solving WORDLE with grep

snakeMAKE better workflows with your code

Packaging with Conda

1. Create a PyPI package of your code

Using normalized SuCOS scores.

Monty Python

Getting the PDB structures of compounds in ChEMBL

Command-Line Interfaces (CLIs), argparse.ArgumentParser and some of my tricks.

Multiple Testing: What is it, why is it bad and how can we avoid it?

The Basics: What IS a p-value?