Category Archives: Technical

Python Handout

Many OPIGlets extensively use Jupyter (in either Notebook or Lab flavour) to prototype and present their work. However, as project progress frequently notebooks are converted into regular python files for a number of reasons, losing the notebook functionality.

Wouldn’t it be nice if we could combine some of the benefits of Jupyter notebooks (not least the ability to present both code & results naturally) with regular python files?

Enter Python Handout.

Python Handout was recently (5th August 2019) released by Danijar Hafner and allows Python scripts to be converted into handouts with Markdown comments and inline figures (see above picture).

Installation is via pip (pip3 install -U handout) and Python Handout supports python 3 scripts.

While I’ve not used Handout much (yet), I will definitely be experimenting more in the coming weeks.

A gentle primer on quantum annealing

If you have done any computational work, you must have spent some time waiting for your program to run. As an undergraduate, I expected computational biology to be all fun and games: idyllic hours passing time while the computer works hard to deliver results… well, very different from the more typical frenetically staring at the computer, wishing the program would run faster. But there is more — there are some problems that are so intrinsically expensive that, even if you had access to all the computers on Earth, it would take more than your lifetime to solve a slightly non-trivial case of them. Some examples are full configuration interaction calculations in quantum chemistry, factorisation of prime numbers, optimal planning, and a long, long, etcetera. Continue reading

Some useful tools

For my blog post this week, I thought I would share, as the title suggests, a small collection of tools and packages that I found to make my work a bit easier over the last few months (mainly python based). I might add to this list as I find new tools that I think deserve a shout-out.

Biopandas

Reading in .pdb files for processing and writing your own parser (while being a good exercise to familiarize yourself with the format) is a pain and clutters your code with boilerplate.

Luckily for us, Sebastian Raschka has written a neat package called biopandas [1] which enables quick I/O of .pdb files via the pandas DataFrame class.

Continue reading

Making the most of your CPUs when using python

Over the last decade, single-threaded CPU performance has begun to plateau, whilst the number of logical cores has been increasing exponentially.

Like it or loathe it, for the last few years, python has featured as one of the top ten most popular languages [tiobe / PYPL].   That being said however, python has an issue which makes life harder for the user wanting to take advantage of this parallelism windfall.  That issue is called the GIL (Global Interpreter Lock).  The GIL can be thought of as the conch shell from Lord of the Flies.  You have to hold the conch (GIL) for your thread to be computed.  With only one conch, no matter how beautifully written and multithreaded your code, there will still only be one thread will be executed at any point in time.

Continue reading

Graph-based Methods for Cheminformatics

In cheminformatics, there are many possible ways to encode chemical data represented by small molecules and proteins, such as SMILES, fingerprints, chemical descriptors etc. Recently, utilising graph-based methods for machine learning have become more prominent. In this post, we will explore why representing molecules as graphs is a natural and suitable encoding. Continue reading

Automated testing with doctest

One of the ways to make your code more robust to unexpected input is to develop with boundary cases in your mind. Test-driven code development begins with writing a set of unit tests for each class. These tests often includes normal and extreme use cases. Thanks to packages like doctest for Python, Mocha and Jasmine for Javascript etc., we can write and test codes with an easy format. In this blog post, I will present a short example of how to get started with doctest in Python. N.B. doctest is best suited for small tests with a few scripts. If you would like to run a system testing, look for some other packages!

Continue reading

So, you are interested in compound selectivity and machine learning papers?

At the last OPIG meeting, I gave a talk about compound selectivity and machine learning approaching to predict whether a compound might be selective. As promised, I hereby provide a list publications I would hand to a beginner in the field of compound selectivity and machine learning.  Continue reading

Maps are useful. But first, you need to build, store and read a map.

Recently we embarked on a project that required the storage of a relatively big dictionary with 10M+ key-value pairs. Unsurprisingly, Python took over two hours to build such dictionary, taking into accounts all the time for extending, accessing and writing to the dictionary, AND it eventually crashed. So I turned to C++ for help.

In C++, map is one of the ways you can store a string-key and an integer value. Since we are concerned about the data storage and access, I compared map and unordered_map.

An unordered_map stores a hash table of the keys and the mapped value; while a map is ordered. The important consideration here includes:
  • Memory: map does not have the hash table and is therefore smaller than an unordered_map.
  • Access: accessing an unordered_map takes O(1) while accessing a map takes log(n).
I have eventually chosen to go with map, because it is more memory efficient considering the small RAM size that I have access to. However, it still takes up about 8GB of RAM per object during its runtime (and I have 1800 objects to run through, each building a different dictionary). Saving these seems to open another can of worm.
In Python, we could easily use Pickle or JSON to serialise the dictionary. In C++, it’s common to use the BOOST library. There are two archival functions in BOOST: text or binary archives. Text archives are human-readable but I don’t think I am really going to open and read 10M+ lines of key-value pairs, I opted for binary archives that are machine readable and smaller. (Read more: https://stackoverflow.com/questions/1058051/boost-serialization-performance-text-vs-binary-format .)
To further compress the memory size when I save the maps, I used zlib compression. Obviously there are ready-to-use codes from these people half a year ago, which saved me debugging:
Ultimately this gets down to 96GB summing 1800 files, all done within 6 hours.