Automated intermolecular interaction detection using the ODDT Python Module

Detecting intermolecular interactions is often one of the first steps when assessing the binding mode of a ligand. This usually involves the human researcher opening up a molecular viewer and checking the orientations of the ligand and protein functional groups, sometimes aided by the viewer’s own interaction detecting functionality. For looking at single digit numbers of structures, this approach works fairly well, especially as more experienced researchers can spot cases where the automated interaction detection has failed. When analysing tens or hundreds of binding sites, however, an automated way of detecting and recording interaction information for downstream processing is needed. When I had to do this recently, I used an open-source Python module called ODDT (Open Drug Discovery Toolkit, its full documentation can be found here).

My use case was fairly standard: starting with a list of holo protein structures as pdb files and their corresponding ligands in .sdf format, I wanted to detect any hydrogen bonds between a ligand and its native protein crystal structure. Specifically, I needed the number and name of the the interacting residue, its chain ID, and the name of the protein atom involved in the interaction. A general example on how to do this can be found in the ODDT documentation. Below, I show how I have used the code on PDB structure 1a9u.

Continue reading →

The Smallest Allosteric System

Allostery is still a badly understood but very general mechanism in the protein world. In principle, an allosteric event occurs when a ligand (small or big) binds to a certain site of a protein and something (activity or function) changes at a different, distant site. A well-known example would be G-protein-coupled receptors that transport such an allosteric signal even across a membrane. But it does not have to be that far apart. As part of the Protein Folding and Dynamics series, I have recently watched a talk by Peter Hamm (Zurich) who presented work on an allosteric system that I thought was very interesting because it was small and most importantly, controllable.

PDZ domains are peptide-binding domains, often part of multi-domain proteins. For the work presented the researchers used the PDZ3 domain which is a bit special and has an additional (third) C-terminal α-helix (α3-helix) which is packing to the other side of the binding pocket. Previous work (Petit et al. 2009) had shown that removal of the α3-helix had changed ligand affinity but not PDZ structure, major changes were of an entropic nature instead. Peter Hamm’s group linked an azobenzene-derived photoswitch to that α3-helix; in its cis configuration stabilizing the α3-helix and destabilising in trans (see Figure 1).

Figure 1: PDZ3 domain (purple) and photoswitch (red) have different affinities for the peptide ligand (green), depending on the photoswitch’s isomerisation state (and temperature). From Bozovic, O., Jankovic, B. & Hamm, P. Sensing the allosteric force. *Nat Commun* **11,** 5841 (2020). https://doi.org/10.1038/s41467-020-19689-7

Continue reading →

How do I do regression when my predictors have multicollinearity?

A quick summary of the key idea of principal components regression (PCR), its advantages and extensions.

Sometimes we find ourselves in a dire situation. We have measured some response y and a set of predictors W. Unfortunately, W is a wide but short matrix, say 10×100 or worse 10×100000. We’ve made only 10 observations. Standard regression is simply not going to work, because W is singular. Some would say p is bigger than n.

So what can we do? Many of us would jump to LASSO or ridge regression. However, there is another way that is often overlooked.

Continue reading →

Safety and sexism: the heroic stubbornness of Frances Oldham Kelsey

With covid-19 vaccine rollouts well underway the world over, the subject of clinical trials has been a focal point of discussion lately. Of course clinical trials are applicable to every drug, not just vaccines, and the class of molecules on which my own work focuses includes perhaps one of the most famous case studies of why clinical trials are necessary: thalidomide.

The teratogenic effects in unborn infants of this seemingly innocuous small molecule are well documented and infamous. But at the time of its initial use a treatment for morning sickness in the mid twentieth century, little was known about its mechanism of action. Only within the last 20 years has the molecular glue-type nature of thalidomide and its analogues (collectively known as immunomodulatory imide drugs, or IMIDs) become apparent. Armed with this knowledge, we know not only understand how thalidomide works in useful situations (such as curing cancer), but also how it exhibits its less desirable effects (recruiting SALL4 to the E3 ligase cereblon, leading to SALL4’s degradation and subsequent embryogenesis havoc).

Continue reading →

Hosting multiple Flask apps using Apache/mod_wsgi

A common way of deploying a Flask web application in a production environment is to use an Apache server with the mod_wsgi module, which allows Apache to host any application that supports Python’s Web Server Gateway Interface (WSGI), making it quick and easy to get an application up and running. In this post, we’ll go through configuring your Apache server to host multiple Python apps in a stable manner, including how to run apps in daemon mode and avoiding hanging processes due to Python C extensions not working well with Python sub-interpreters (I’m looking at you, numpy).

Continue reading →

Hidden Markov Models in Python: A simple Hidden Markov Model with Known Emission Matrix fitted with hmmlearn

The Hidden Markov Model

Consider a sensor which tells you whether it is cloudy or clear, but is wrong with some probability. Now, the weather *is* cloudy or clear, we could go and see which it was, so there is a “true” state, but we only have noisy observations on which to attempt to infer it.

We might model this process (with the assumption of sufficiently precious weather), and attempt to make inferences about the true state of the weather over time, the rate of change of the weather and how noisy our sensor is by using a Hidden Markov Model.

The Hidden Markov Model describes a hidden Markov Chain which at each step emits an observation with a probability that depends on the current state. In general both the hidden state and the observations may be discrete or continuous.

But for simplicity’s sake let’s consider the case where both the hidden and observed spaces are discrete. Then, the Hidden Markov Model is parameterised by two matrices:

Continue reading →

CAML: Courses in Applied Machine Learning

*Shameless self-promotion klaxon!! Have a look at my new website!*

I’m excited to share a project I’ve been working on for the past few months! One of the biggest challenges of working on an interdisciplinary research project is getting to grips with the core principles of the disciplines which you don’t have much formal training in. For me, that means learning the basics of Medicinal Chemistry and Structural Biology so that when someone mentions pi-stacking I don’t think they’re talking about the logistics of managing a bakery; for people coming from Bio/Chem backgrounds it can mean understanding the Maths and Statistics necessary to make sense of the different algorithms which are central to their work.

Continue reading →

Can few-shot language models perform bioinformatics tasks?

In 2019, I tried my hand at using large language models, specifically GPT-2, for text generation. In that blogpost, I used Hansard files to fine-tune the public release of GPT-2 to generate speeches by several speakers in the House of Commons (link).

In 2020, OpenAI released GPT-3, their new and improved text generation model (paper), which uses a whopping 175 billion parameters (as opposed to its predecessor’s 1.5 billion) and not only proved to be capable of state of the art performance on common text prediction benchmarks, but also generated a considerable amount of interest in the news media:

Continue reading →

Code that I am grateful for

To address some of the karmic imbalance created by computational scientists complaining about other people’s code, I am listing here some (not all) of other people’s code that I love.

IgBLAST

IgBLAST is a sequence alignment tool for immunoglobulin sequences implemented in the NCBI C++ toolkit – it applies the classic BLAST algorithm to searching immunoglobulin germline gene databases. It always impresses me how quickly it works. The paper is here, and the authors are Jian Ye, Ning Ma, Thomas L. Madden and James M. Ostell.

Continue reading →

Do antibodies care about sex?

In a recent OPIG antibody meeting, the topic of immune system differences between men and women came up. I thought this was cool and something I hadn’t read about, so what a brilliant topic for a blog most. This post is a high-level overview – I’ve listed the papers I’ve used at the bottom of this post so please consult them for more details!

Differences between males and females can lead to pretty big disparities in disease prevalence and outcomes. For example, non-reproductive cancers occur predominantly in males, whilst the majority of autoimmune disease occurs in females. Many factors may be impacting this, including environmental, genetic and hormonal influences, and much more research is required to fully understand these processes. Here I focus on sex-based biology, rather than gender, though both can influence the immune response.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Automated intermolecular interaction detection using the ODDT Python Module

The Smallest Allosteric System

How do I do regression when my predictors have multicollinearity?

Safety and sexism: the heroic stubbornness of Frances Oldham Kelsey

Hosting multiple Flask apps using Apache/mod_wsgi

Hidden Markov Models in Python: A simple Hidden Markov Model with Known Emission Matrix fitted with hmmlearn

The Hidden Markov Model

CAML: Courses in Applied Machine Learning

Can few-shot language models perform bioinformatics tasks?

Code that I am grateful for

Do antibodies care about sex?