Command-Line Interfaces (CLIs), argparse.ArgumentParser and some of my tricks.

Command-Line Interfaces (CLIs) are one of the best ways of providing your programs with useful parameters to customize their execution. If you are not familiar with CLI, in this blog post we will introduce them. Let’s say that you have a program that reads a file, computes something, and then, writes the results into another file. The simplest way of providing those arguments would be:

$ python mycode.py my/inputFile my/outputFile

### mycode.py ###
def doSomething(inputFilename):
    with open(inputFilename) as f:
        return len(f.readlines())

if __name__ == "__main__":
    #Notice that the order of the arguments is important
    inputFilename = sys.argv[1]
    outputFilename = sys.argv[2]

    with open(outputFilename, "w") as f:
        f.write( doSomething(inputFilename))

Continue reading →

Multiple Testing: What is it, why is it bad and how can we avoid it?

P-values play a central role in the analysis of many scientific experiments. But, in 2015, the editors of the Journal of Basic and Applied Social Psychology prohibited the usage of p-values in their journal. The primary reason for the ban was the proliferation of results obtained by so-called ‘p-hacking’, where a researcher tests a range of different hypotheses and publishes the ones which attain statistical significance while discarding the others. In this blog post, we’ll show how this can lead to spurious results and discuss a few things you can do to avoid engaging in this nefarious practice.

The Basics: What IS a p-value?

Under a Hypothesis Testing framework, a p-value associated with a dataset is defined as the probability of observing a result that is at least as extreme as the observed one, assuming that the null hypothesis is true. If the probability of observing such an event is extremely small, we conclude that it is unlikely the null hypothesis is true and reject it.

But therein lies the problem. Just because the probability of something is small, that doesn’t make it impossible. Using the standard significance test threshold of 0.05, even if the null hypothesis is true, there is a 5% chance of obtaining a p-value below the significance threshold and therefore rejecting it. Such false positives are an inescapable part of research; there’s always a possibility that the subset you were working with isn’t representative of the global data and sometimes we take the wrong decision even though we analysed the data in a perfectly rigorous fashion.

Continue reading →

Unraveling the role of entanglement in protein misfolding

Proteins that fail to fold correctly may populate misfolded conformations with disparate structure and function. Misfolding is the focus of intense research interest due to its putative and confirmed role in various diseases, including neurodegenerative diseases such as Parkinson’s and Alzheimer’s Diseases as well as cystic fibrosis (PMID: 16689923).

Many open questions about protein misfolding remain to be answered. For example, how do misfolded proteins evade cellular quality control mechanisms like chaperones to remain soluble but non-functional for long timescales? How long do misfolded states persist on average? How widespread is misfolding? Experiments indicate that misfolding can even be caused by synonymous mutations that alter the speed of protein translation but not the sequence of the protein produced (PMID: 23417067), introducing the additional puzzle of how the protein maintains a “memory” of its translation kinetics after synthesis is complete.

A series of four recent preprints (Preprints 1, 2, 3, and 4, see below) suggests that these questions can be answered by the partitioning of proteins into long-lived self-entangled conformations that are structurally similar to the native state but with perturbed function. Simulation of the synthesis, termination, and post-translational dynamics of a large dataset of E. coli proteins suggests that misfolding and entanglement are widespread, with two thirds of proteins misfolding some of the time (Preprint 1). Many misfolded conformations may bypass proteostasis machinery to remain soluble but non-functional due to their structural similarity to the native state. Critically, entanglement is associated with particularly long-lived misfolded states based on simulated folding kinetics.

Coarse-grain and all-atom simulation results indicate that these misfolded conformations interact with chaperones like GroEL and HtpG to a similar extent as does the native state (Preprint 2). These results suggest an explanation for why some protein always fails to refold while remaining soluble, even in the presence of multiple folding chaperones – it remains trapped in entangled conformations that resemble the native state and thus fail to recruit chaperones.

Finally, simulations indicate that changes to the translation kinetics of oligoribonuclease introduced by synonymous mutations cause a large change in its probability of entanglement at the dimerization interface (Preprint 3). These entanglements localized at the interface alter its ability to dimerize even after synthesis is complete. These simulations provide a structural explanation for how translation kinetics can have a long-timescale influence on protein behavior.

Together, these preprints suggest that misfolding into entangled conformations is a widespread phenomenon that may provide a consistent explanation for many unanswered question in molecular biology. It should be noted that entanglement is not exclusive to other types of misfolding, such as domain swapping, that may contribute to misfolding in cells. Experimental validation of the existence of entangled conformations is a critical aspect of testing this hypothesis; for comparisons between simulation and experiment, see Preprint 4.

Preprint 1: https://www.biorxiv.org/content/10.1101/2021.08.18.456613v1

Preprint 2: https://www.biorxiv.org/content/10.1101/2021.08.18.456736v1

Preprint 3: https://www.biorxiv.org/content/10.1101/2021.10.26.465867v1

Preprint 4: https://www.biorxiv.org/content/10.1101/2021.08.18.456802v1

How to interact with small molecules in Jupyter Notebooks

The combination of Python and the cheminformatics toolkit RDKit has opened up so many ways to explore chemistry on a computer. Jupyter — named for the three languages, Julia, Python, and R — ties interactivity and visualization together, creating wonderful environments (Notebooks and JupyterLab) to carry out, share and reproduce research, including:

“data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.”
—https://jupyter.org

At this year’s annual RDKit UGM (User Group Meeting), Cédric Bouysset shared a tutorial explaining how to create a grid of molecules that you can interact with, using his “mols2grid“:

Continue reading →

List comprehension: an elegant Python feature inspired by mathematical set theory

Even though I have now deeply entered into the fascinating world of statistical machine learning and computational chemistry, my original background is very much in pure mathematics. Having spent some of my intellectually formative years in this highly purified and abstract universe, I still love to think in terms of sets, ordered tuples and well-defined functions whenever I have the luxury of being able to do so. This might be why list comprehension is one of my favourite features in Python.

List comprehension allows you to efficiently map a function over a list using elegant notation inspired by mathematical set theory. Let us first consider a (mathematical) set

$A := \{1, 3, 7 \}.$

Continue reading →

Watch out when using PDBbind!

Now that PDBbind 2020 has been released, I want to draw some attention to an issue with using the SDF files that are supplied in the PDBbind refined set 2020.

Normally, SDF files save the chirality information of compounds in the atom block of the file which is shown belowas a snipped of the full sdf file for the ligand of PDB entry 4qsv. The column that defines chirality is marked in red.

As you can see, all columns shown here are 0. The SDF files supplied by PDBbind for some reason do NOT encode chirality information explicitly. This will be a problem when using RDKit to read the molecule and transform it into a smiles string. By using the following commands to read the ligand for 4qsv from PDBBind 2020 and write a SMILES string, we get:

Continue reading →

Chained or Unchained: Markov, Nekrasov and Free Will

A Markov Chain moving between two states A and B. Animation by Devin Soni

Markov chains are simple probabilistic models which model sequences of related events through time. In a Markov chain, events at the present time depend on the previous event in the sequence. The example above shows a model of a dynamical system with two states A and B and the events are either moving between states A and B, or staying put.

More formally, a Markov chain is a model of any sequence of events with the following relationship

$P(X_t=x|X_{t-1}=x_{t-1},X_{t-2}=x_{t-2},..,X_1=x_1)=P(X_t|X_{t-1})$ .

That is, the event that the sequence $\{X_t\}_{t}$ is in state $x$ at time $t$ is conditionally independent of all of its past states given its immediate past. This simple relationship between past and present provides a useful simplifying assumption to model, to a surprising degree of accuracy, many real world systems. These range from air particles diffusing through a room, to the migration patterns of insects, to the evolution of your genome, and even your web browser activity. Given their broad use in describing natural phenomena, it is very curious that Markov first invented the Markov chain to settle a dispute in Mathematical Theology, one in which the atheist Markov was pitted against the devoutly Orthodox Pavel Nekrasov.

Continue reading →

Why all academics should be on TikTok

Recently I have had the opportunity to get a closer look at the submission, review and promotion cycle for a typical academic paper. It was a great learning experience and led to an increase in the number and of research papers, news articles, and reviews I read in preparation. However, on multiple occasions, I did think “I wish I could watch a 2 min video to explain this”. That got me thinking, why couldn’t I and should I be able to?

Continue reading →

Being Brief.

This is a blog post about using fewer words.

Continue reading →

Using Singularity on Windows with WSL2

Previously on this blog, my colleagues Carlos and Eoin have extolled the many virtues of Singularity, which I will not repeat here. Instead, I’d like to talk about a rather interesting subject that was unexpectedly thrust upon me when my faithful Linux laptop started to show the early warning signs of critical existence failure: is there a good way to run a Singularity container on a pure Windows machine? It turns out that, with version 2 of the Windows Subsystem for Linux (WSL), there is.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Command-Line Interfaces (CLIs), argparse.ArgumentParser and some of my tricks.

Multiple Testing: What is it, why is it bad and how can we avoid it?

The Basics: What IS a p-value?

Unraveling the role of entanglement in protein misfolding

How to interact with small molecules in Jupyter Notebooks

List comprehension: an elegant Python feature inspired by mathematical set theory

Watch out when using PDBbind!

Chained or Unchained: Markov, Nekrasov and Free Will

Why all academics should be on TikTok

Being Brief.

Using Singularity on Windows with WSL2