Category Archives: Hints and Tips

Sharing Data Responsibly: The FAIR Principles

So you’ve submitted your paper, made your code publicly available, and maybe even provided documentation to ensure somebody can reproduce your work. But what about the data your work is based on? Is that readily available to your readers, too?

Maybe it’s too large to put on GitHub alongside your code. Maybe it’s sensitive, or subject to GDPR restrictions, so you can’t just stick a download link on your website. Maybe it’s in a proprietary format that needs non-open software to read. There are many reasons sharing data can be less straightforward than sharing code, and often it’s not entirely clear what ‘best practices’ are for a given situation. Data management is a complicated topic, and to do it justice would require far more than a quick blog post. Instead, I’d like to focus on a single source of guidance that serves as a useful starting point for thinking about responsible data management: the FAIR principles.

Continue reading →

How to prepare a molecule for RDKit

RDKit is very fussy when it comes to inputs in SDF format. Using the SDMolSupplier, we get a significant rate of failure even on curated datasets such as the PDBBind refined set. Pymol has no such scruples, and with that, I present a function which has proved invaluable to me over the course of my DPhil. For reasons I have never bothered to explore, using pymol to convert from sdf, into mol2 and back to sdf format again (adding in missing hydrogens along the way) will almost always make a molecule safe to import using RDKit:

from pathlib import Path
from pymol import cmd

def py_mollify(sdf, overwrite=False):
    """Use pymol to sanitise an SDF file for use in RDKit.

    Arguments:
        sdf: location of faulty sdf file
        overwrite: whether or not to overwrite the original sdf. If False,
            a new file will be written in the form <sdf_fname>_pymol.sdf
            
    Returns:
        Original sdf filename if overwrite == False, else the filename of the
        sanitised output.
    """
    sdf = Path(sdf).expanduser().resolve()
    mol2_fname = str(sdf).replace('.sdf', '_pymol.mol2')
    new_sdf_fname = sdf if overwrite else str(sdf).replace('.sdf', '_pymol.sdf')
    cmd.load(str(sdf))
    cmd.h_add('all')
    cmd.save(mol2_fname)
    cmd.reinitialize()
    cmd.load(mol2_fname)
    cmd.save(str(new_sdf_fname))
    return new_sdf_fname

Making pwd redundant

I’m going to keep this one brief, because I am mid-confirmation-and-paper-writing madness. I have seen too many people – both beginners and seasoned veterans – wandering around their Linux filesystem blindfolded:

Isn’t it hideous?

Whenever you want to see where you are, you have to execute pwd (present working directory), which will print your absolute location to stdout. If you have many terminals open at the same time, it is easy to lose track of where you are, and every other command becomes pwd; surely, I hear you cry, there has to be a better way!

Well, fear not! With a little tinkering with ~/.bashrc, we can display the working directory as part of the special PS1 environment variable, responsible for how your username and computer are displayed above. Putting the following at the top of ~/.bashrc

me=`id | awk -F\( '{print $2}' | awk -F\) '{print $1}'`
export PS1="`uname -n |  /bin/sed 's/\..*//'`{$me}:\$PWD$ "

… saving, and starting a new termanal window results in:

Much better!

I haven’t used pwd in 3 years.

Non-linear Dependence? Mutual Information to the Rescue!

We are all familiar with the idea of a correlation. In the broadest sense of the word, a correlation can refer to any kind of dependence between two variables. There are three widely used tests for correlation:

Spearman’s r: Used to measure a linear relationship between two variables. Requires linear dependence and each marginal distribution to be normal.
Pearson’s ρ: Used to measure rank correlations. Requires the dependence structure to be described by a monotonic relationship
Kendall’s 𝛕: Used to measure ordinal association between variables.

While these three measures give us plenty of options to work with, they do not work in all cases. Take for example the following variables, Y1 and Y2. These might be two variables that vary in a concerted manner.

Perhaps we suspect that a state change in Y1 leads to a state change in Y2 or vice versa and we want to measure the association between these variables. Using the three measures of correlation, we get the following results:

Continue reading →

How to Blopig

Blopig has a wealth of knowledge, with everything from a Bayesian answer to the question “should ketchup be stored in the fridge?“* to the Nobel-Prize-Winner-approved analysis of AlphaFold2. Blopig runs on WordPress and uses blocks, components for adding different types of content to a post. These are blocks like paragraphs, headers, images, image galleries, and videos. Here are some hints and tips for getting the most out of WordPress.

One of the first blocks worth mentioning is the “Read More…” block…

Continue reading →

Python’s Data Classes

When writing code, you have inevitably needed to store data throughout your pipeline. In these cases you store your value, list or data frame as a variable to easily use it elsewhere in your code. However, sometimes your data has an awkward form, consisting of a number of different length lists or data of different types and sizes. While it is still doable to work with, and using tuples or dictionaries can help, accessing different elements in your data quickly becomes messy and it is less intuitive what your code is actually doing.

To solve the above stated problem, data classes were introduced as a new feature in Python 3.7. A data class is a regular Python class, but with certain methods already implemented for you. This makes them easy to create and removes a lot of boilerplate (repeated code) making them simpler, more intuitive and pretty. Further, as data classes are part of the standard library, you can directly import it without needing to install any external dependencies (noice).

With the sales pitch out of the way, let us look at how we can use data classes.

from dataclasses import dataclass
from typing import Any

@dataclass
class Antibody:
    vgene: str
    jgene: None
    sequence: Any = 'EVQ'

Continue reading →

Tracking Changes in LaTeX

Tracking changes in Microsoft Word is easy – we just click the ‘Track Changes’ button. It requires a little more work in LaTeX but here is a quick guide to doing it as painlessly as possible!

First, we want our original document to be stored in a *.tex file in an Overleaf project (as shown below)

Continue reading →

Solving WORDLE with grep

People seem to have become obsessed with wordle, just like they became obsessed with sudoku. After my initial burst of “oh a new game!” had waned, I was left thinking “my time is precious and this is exactly what we have computers for”. With this in mind, below is my quick and dirty way of solving these. I’m sure the regexp gurus amongst you will have a more elegant solution.

Step 1: Make sure you’ve got /usr/share/dict/words installed. This is just a huge list of words in a specific language and for me, this required installing the British words list.

sudo apt-get install wbritish

Step 2: Go to wordle

Step 3: Pick a random 5-letter word as your starting point. This is where grep and /usr/share/dict/words comes in:

Continue reading →

Simplify your life with SLURM and sync

For my first blog post of the year, we’re talking about SLURM, everyone’s favorite job manager. If like me, you have the joy of running a literal boat-load of jobs with all kinds of parameters and command-line arguments you’ll know there are a few tips and tricks that make the process of managing these tasks and results as painless as possible. Now, I do expect most people reading this will already be aware of these tricks but for those who don’t, I hope this is helpful. After all, it’s impossible to know what you don’t know you need to know, you know? Any alternatives, improvements, or suggestions are welcome!

Array Jobs

Job arrays are perfect for the times you want to run the same job several times with slight differences each time. Imagine you need to repeat a job 10 times with slightly different arguments with each run. Rather than submit 10 (slightly different) batch scripts you can submit 1 script with all the information needed to complete all 10 jobs.

Continue reading →

snakeMAKE better workflows with your code

When developing your pipeline for processing, annotating and/or analyzing data, you will probably find yourself needing to continuously re-run it, as you play around with your code. This can become a problem when working with long pipelines, large datasets and cpu’s begging you not to run some pieces of code again.

Luckily, you are not the first one to have been annoyed by this and other related struggles. Some people were actually so annoyed that they created Snakemake. Snakemake can be used to create workflows and help solve problems, such as the one mentioned above. This is done using a Snakefile, which helps you split your pipeline into “rules”. To illustrate how this helps you create a better workflow, we will be looking at the example below.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Hints and Tips

Sharing Data Responsibly: The FAIR Principles

How to prepare a molecule for RDKit

Making pwd redundant

Non-linear Dependence? Mutual Information to the Rescue!

How to Blopig

Python’s Data Classes

Tracking Changes in LaTeX

Solving WORDLE with grep

Simplify your life with SLURM and sync

Array Jobs

snakeMAKE better workflows with your code