Tag Archives: Python

List comprehension: an elegant Python feature inspired by mathematical set theory

Even though I have now deeply entered into the fascinating world of statistical machine learning and computational chemistry, my original background is very much in pure mathematics. Having spent some of my intellectually formative years in this highly purified and abstract universe, I still love to think in terms of sets, ordered tuples and well-defined functions whenever I have the luxury of being able to do so. This might be why list comprehension is one of my favourite features in Python.

List comprehension allows you to efficiently map a function over a list using elegant notation inspired by mathematical set theory. Let us first consider a (mathematical) set

A := \{1, 3, 7 \}.

Continue reading

Making your python tool as easy to install as possible

Have you ever tried to use someone else’s code and spent a whole day trying to install it? Have you ever decided not to use a tool because installing it was a massive pain? Both of those have happened to me and, to be honest, it is a massive shame. The authors may spend large amounts of time developing these tools and in the end, no one uses them because they can’t get them to work. So I have decided to try and make all code I develop as easy and painless as possible to install and use.

Continue reading

Out-of-distribution generalisation and scaffold splitting in molecular property prediction

The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.

In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set X into two sets – a training set X_{\text{train}} and a test set X_{\text{test}}. The model is then subsequently trained on the examples in the training set X_{\text{train}} and afterwards its prediction abilities are measured on the untouched examples in the test set X_{\text{test}} via a suitable performance metric.

Since in this scenario the model has never seen any of the examples in X_{\text{test}} during training, its performance on X_{\text{test}} must be indicative of its performance on novel data X_{\text{new}} which it will encounter in the future. Right?

Continue reading

Hosting multiple Flask apps using Apache/mod_wsgi

A common way of deploying a Flask web application in a production environment is to use an Apache server with the mod_wsgi module, which allows Apache to host any application that supports Python’s Web Server Gateway Interface (WSGI), making it quick and easy to get an application up and running. In this post, we’ll go through configuring your Apache server to host multiple Python apps in a stable manner, including how to run apps in daemon mode and avoiding hanging processes due to Python C extensions not working well with Python sub-interpreters (I’m looking at you, numpy).

Continue reading

C++ python bindings in 5 minutes

You don’t even need to use CMake!

Most of the time, we can use libraries like numpy (which is largely written in C) to speed up our calculations, which works when we are dealing with matrices or vectors – but sometimes loops are unavoidable. In those instances, it would be nice if we could use a compiled language such as C++ to remove the bottleneck.

This can be achieved extremely easily using pybind11, which enables us to export C++ functions and classes as importable python objects. We can do all of this very easily, without using CMake, using pybind11’s Pybind11Extension class, along with a modified setup.py. Pybind11 can be compiled from source or installed using:

pip install pybind11
Continue reading

GEMMI: A Python Cookbook

General MacroMocelecular I/O, or GEMMI, is a C++ 11 header only library for low level crystalographic .

Because its header only it is certainly the easiest to access and use low level crystalographic C++ library, however GEMMI comes with python binding via Pybind11, making it arguably the easiest low level crystalographic library to access and use in python as well!

What follows is a cookbook of useful Python code that uses GEMMI to accomplish macromolecular crystalographic tasks.

Continue reading

Should scientists learn C++?

Conventional wisdom dictates that compiled languages are slow to develop, can be slow to compile, but are fast to run. Interpreted languages are easy to use and do not require compilation but have sluggish performance. Like most people in scientific computing, the first two languages I learned were C++ and Python; I use Python every day but when, if ever, would I use C++?

Continue reading

docopt for dummies

Parsing command line arguments is an annoying piece of boilerplate we all have to do. Documenting our code is either an absolutely essential part of software engineering, or a frivolous waste of research time, depending on who you ask. But what if I told you that we can combine the two? That you can handle your argument parsing simply by documenting how your code works? Well, the dream is now reality. Continue reading

Mol2vec: Finding Chemical Meaning in 300 Dimensions

Embeddings of Amino Acids

2D projections (t-SNE) of Mol2vec vectors of amino acids (bold arrows). These vectors were obtained by summing the vectors of the Morgan substructures (small arrows) present in the respective molecules (amino acids in the present example). The directions of the vectors provide a visual representation of similarities. Magnitudes reflect importance, i.e. more meaningful words. [Figure from Ref. 1]

Natural Language Processing (NLP) algorithms are usually used for analyzing human communication, often in the form of textual information such as scientific papers and Tweets. One aspect, coming up with a representation that clusters words with similar meanings, has been achieved very successfully with the word2vec approach. This involves training a shallow, two-layer artificial neural network on a very large body of words and sentences — the so-called corpus — to generate “embeddings” of the constituent words into a high-dimensional space. By computing the vector from “woman” to “queen”, and adding it to the position of “man” in this high-dimensional space, the answer, “king”, can be found.

A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed “mol2vec“.1 They also released a Python implementation, available on Samo Turk’s GitHub repository.

 

Continue reading