Category Archives: Code

Hidden Markov Models in Python: A simple Hidden Markov Model with Known Emission Matrix fitted with hmmlearn

The Hidden Markov Model

Consider a sensor which tells you whether it is cloudy or clear, but is wrong with some probability. Now, the weather *is* cloudy or clear, we could go and see which it was, so there is a “true” state, but we only have noisy observations on which to attempt to infer it.  

We might model this process (with the assumption of sufficiently precious weather), and attempt to make inferences about the true state of the weather over time, the rate of change of the weather and how noisy our sensor is by using a Hidden Markov Model. 

The Hidden Markov Model describes a hidden Markov Chain which at each step emits an observation with a probability that depends on the current state. In general both the hidden state and the observations may be discrete or continuous.

But for simplicity’s sake let’s consider the case where both the hidden and observed spaces are discrete. Then, the Hidden Markov Model is parameterised by two matrices: 

Continue reading

Code that I am grateful for

To address some of the karmic imbalance created by computational scientists complaining about other people’s code, I am listing here some (not all) of other people’s code that I love.

IgBLAST

IgBLAST is a sequence alignment tool for immunoglobulin sequences implemented in the NCBI C++ toolkit – it applies the classic BLAST algorithm to searching immunoglobulin germline gene databases. It always impresses me how quickly it works. The paper is here, and the authors are Jian Ye, Ning Ma, Thomas L. Madden and James M. Ostell.

Continue reading

Singularity: a guide for the bewildered bioinformatician

Have you ever worked with a piece of software that is awfully difficult to set up? That legacy code written on FORTRAN 77, that other one that requires significant modifications to compile, or any of those that require a long-winded bash script with a thousand dependencies (which you also have to install!). Would it not be helpful if, when that red-eyed PhD student, that one that just spent three months writing up their thesis, says that they absolutely must use that server where you have installed all your stuff, you could just relocate to another one without trouble? Well, you may be able to do that now. You just need to use containerization.

The idea behind containerization is rather simple. The best way to ensure anyone can reproduce your work is to, well, ship your entire system to whomever needs to use it. You could, for example, pack up your desktop in a box, and ship it to your collaborators anywhere in the world. Unfortunately, this idea is quite unpractical, not only because of tedious logistics (ever had to deal with customs?), but also because suddenly you won’t be able to run your own pipeline. However, it is a good enough thought that at some point made a clever engineer wonder whether there was a way to ship an entire system without physically delivering the computer. And that’s exactly what they designed.

40ft x 8ft (9ft 6") One trip high cube shipping container bl
Best way to make sure your collaborators on the other side of the world can run your pipeline — just pack your desktop in one of these, and ship it away!
Continue reading

C++ python bindings in 5 minutes

You don’t even need to use CMake!

Most of the time, we can use libraries like numpy (which is largely written in C) to speed up our calculations, which works when we are dealing with matrices or vectors – but sometimes loops are unavoidable. In those instances, it would be nice if we could use a compiled language such as C++ to remove the bottleneck.

This can be achieved extremely easily using pybind11, which enables us to export C++ functions and classes as importable python objects. We can do all of this very easily, without using CMake, using pybind11’s Pybind11Extension class, along with a modified setup.py. Pybind11 can be compiled from source or installed using:

pip install pybind11
Continue reading

Better understanding of correlation

Although correlation is often used as the linear relationship between two sets of points, I will in the following text use it more broadly to mean any relationship between two sets of points.

You have tasked yourself with finding the correlation between the different features in your dataset. Your purpose could be to remove highly correlated features or just improve your understanding of your data. Nonetheless, calculating and using the Pearson Correlation Coefficient (PCC) or the Spearman’s rank Correlation Coefficient (SCC) to get an overview of the correlations might be the first thing that comes to your mind.

Unfortunately, both of these are limited to linear (PCC) or monotonic (SCC) relationships. In datasets with many and complex features, many of them will be highly correlated, just not linearly (or monotonic). Instead these correlations can be non-linear which, as seen in the third row in the below figure, does not get detected with PCC.

Figure: PCC of different sets of x and y points. https://en.wikipedia.org/wiki/Correlation_and_dependence
Continue reading

ORDER!: Returning bond order information to your docked poses

John Bercow Order Remix - YouTube

Common docking software, such as AutoDock Vina or AutoDock 4, require the ligand and receptor files to be converted into the PDBQT format. Once a correct pose has been identified, the pose will be produced also as a .pdbqt file.

Continue reading

Plotly for interactive 3D plotting

An recently wrote a post on how to use the seaborn library. I really like seaborn and use it a lot for 2D plots. However, recently I have been dealing with 3D data and have found plotly to be best. When used in a jupyter notebook, it allows you to easily generate 3D interactive plots. This is extremely useful to visualize structural data.

Continue reading

Calculating symmeterised small molecule RMSDs using graph automorphisms in python with GEMMI and NetworkX

When a ring flips, how do we calculate RMSD?

This surprisingly simple question leads to a very interesting problem! If we take a benzene molecule, say, and rotate it 180 degrees, then we have the exact same molecule, but if we have a data structure in which our atoms are labelled, and we apply the same transformation to the atomic positions, the numbering does not reflect that symmetry. If we were then naively to calculate the RMSD it would be huge, despite the fact that the molecule is, chemically speaking, identical.

How can we make our RMSD calculations reflect these symmetries?

Continue reading