Histograms are frequently used to visualize the distribution of a data set or to compare between multiple distributions. Python, via matplotlib.pyplot, contains convenient functions for plotting histograms; the default plots it generates, however, leave much to be desired in terms of visual appeal and clarity.
The two code blocks below generate histograms of two normally distributed sets using default matplotlib.pyplot.hist settings and then, in the second block, I add some lines to improve the data presentation. See the comments to determine what each individual line is doing.
Category Archives: Uncategorized
The ultimate modulefile for conda
Environment modules is a great tool for high-performance computing as it is a modular system to quickly and painlessly enable preset configurations of environment variables, for example a user may be provided with modulefile for an antiquated version of a tool and a bleeding-edge alpha version of that same tool and they can easily load whichever they wish. In many clusters the modules are created with a tool called EasyBuild, which delivered an out-of-the-box installation. This works for things like a single binary, but for conda this severely falls short as there are many many configuration changes needed.
Continue readingOn The Logic of GOing with Weisfeiler-Lehman
Recently, I was able to attend Martin Grohe’s talk on The Logic of Graph Neural Networks. Professor Grohe of RWTH Aachen University, is a titan of the fields of Logic and Complexity theory. Even so, he is modest about his achievements, and I was tickled when it was pointed out to me that the theorem he refers to as “a little complex”, one of his crowning achievements, involves a four-hundred page long book of a proof.
The theorem relates to the Weisfeiler-Lehmann (WL) algorithm, an algorithm for determining whether two graphs are equivalent (i.e. isomorphic). The algorithm has deep connections with combinatorics, complexity theory and first order logic. A system of logic that is remarkably similar to the relations present in ontologies such as the Gene Ontology (GO), which is commonly used to compare and predict protein function. Kernelised methods and other WL-based metrics present a new and possibly logically “complete” way to potentially compare the functions of proteins and infer their similarity.

COSTNET19 Conference
Last month, I attended the COSTNET19 Conference in Bilbao (Spain). This conference is organised by COSTNET, a COST Action which aims to foster international European collaboration on the emerging field of statistics of network data science. COSTNET facilitates interaction and collaboration between diverse groups of statistical network modellers, establishing a large and vibrant interconnected and inclusive community of network scientists.
Continue readingWhy you should care about startups as a researcher
I was recently awarded the EIT Health Translational Fellowship, which aims to fund DPhil projects with the goal of commercializing the research and addressing the funding gap between research and seed funding. In order to win, I had to deliver a short 5 minute startup pitch in front of a panel of investors and scientific experts to convince them that my DPhil project has impact as well as commercial viability. Besides the £5000 price, the fellowship included a week-long training course on how to improve your pitch, address pain points in your business strategy etc. I found the whole experience to be incredibly rewarding and the skills I picked up very important, even as a researcher. As a summary, this is why I think you should care about the startup world as a researcher.
Continue readingSome more Python tips and tricks
There are a few useful but often underutilised Python 3 syntactic tricks that I have picked up over the last few years; I have chosen to continue in the spirit of Mark and share them here.
Continue readingA new way of eating too much
Fresh off the pages of Therapeutic Advances in Endocrinology and Metabolism comes a warning no self-respecting sweet tooth should ignore.
“Liquorice is not just a candy,” write a team of ten from Chicago. “Life-threatening complications can occur with excess use.” Hold on to your teabags. Liquorice – the Marmite of sweets – is about to become a lot more sinister.
Continue readingTwo Tools for Systematically Compiling Ensembles of Protein Structures
In order to know how a protein works, we generally want to know its 3-dimensional structure. We then can either try to solve it ourselves (which requires considerable time, skill, and resources), or look for it in the Protein Data Bank, in case it has already been solved. The vast majority of structures in the Protein Data Bank (PDB) are solved through protein crystallography, and represent a “snapshot” of the conformational space available to our protein of interest. Continue reading
Which fragment first?
Crystallographic fragment based lead discovery is a now a routine technique, which can sample 1000’s of compounds per week. But how do we identify the most appropriate compounds to screen against our target of interest?
Continue readingAIRR community meeting
Hi everyone,
Today is the day for another blog post from me. Last month I attended an AIRR conference in Genoa, Italy (https://www.antibodysociety.org/airrc/meetings/communityiv/). It was the fourth AIRR conference, and I was nice to see lots of field-leading people participating. Compared to the last AIRR meeting almost 2 years ago, the agenda of the conference was dominated by machine learning and big data topics. In my short blog post, I will discuss two talks that covered these two exciting topics.
Continue reading