Do not forget to add your data folder to .gitignore

It is good practice not to commit a data folder to version control if the data is available elsewhere and you do not want to track changes of the data. But do not forget to also add an entry for this folder to .gitignore because otherwise git iterates over all the files in the folder when checking for file changes, which may take a long time if there are many files.

Continue reading

Tanimoto similarity of ECFPs with RDKit: Common pitfalls

A common measure for the similarity of two molecules is the Tanimoto similarity of their ECFPs (Extended Connectivity FingerPrint). However, there is no clear standard in literature for what kind of ECFPs should be used when calculating the Tanimoto similarity, and that choice can lead to substantially different results. In this post I wish to shed light on some results you should know about before you jump into your calculations.

A blog post on how ECFPs are generated was written by Marcus Dablander in 2022 so please take a look at that. In short, ECFPs have a hyperparameter called the radius r, and sometimes a fingerprint length L. Each entry in the fingerprint indicates the presence or absence of a particular substructure in the molecule of interest, and the radius r defines how large the substructures that you consider are. If you have r=3 then you consider substructures made by going up to three hops away from each atom in your molecule. This is best explained by this figure from Marcus’ post:

Continue reading

I really hope my compounds get the green light

As a cheminformatician in a drug discovery campaign or an algorithm developer making the perfect Figure 1, when one generates a list of compounds for a given target there is a deep desire that the compounds are well received by the reviewer, be it a med chemist on the team or a peer reviewer. This is despite scientific rigour and training and is due to the time invested. So to avoid the slightest shadow of med chem grey zone, here is a hopefully handy filter against common medchem grey-zone groups.

Continue reading

Roche Continents 2024

This July I had the opportunity to be part of the Roche Continents programme [1]. The programme was organised by Roche and LUMA Arles and took place in the beautiful city of Arles in the south of France. Together with 40 students from various disciplines and European universities we discussed and explored the connection between arts, science, and sustainability. The theme of the week was resourcefulness.  

For students considering applying to Roche Continents next year, I’d like to offer some insights on what to expect, as well as share a few of my personal highlights from the experience. 

Continue reading

The wider applications of nanobodies

This week, it was my turn to give the short talk at our group meeting. I chose to present a recently published paper on thermostability prediction for nanobodies. The motivation for this work, at least in part, is the need for thermostability in the diverse applications of nanobodies. At OPIG, our research primarily revolves around the therapeutic uses of nanobodies, but their potential extends beyond this. I thought it would be interesting to highlight some of these broader applications here:

Continue reading

Making your code pip installable

aka when to use a CutomBuildCommand or a CustomInstallCommand when building python packages with setup.py

Bioinformatics software is complicated, and often a little bit messy. Recently I found myself wading through a python package building quagmire and thought I could share something I learnt about when to use a custom build command and when to use a custom install command. I have also provided some information about how to copy executables to your package installation bin. **ChatGPT wrote the initial skeleton draft of this post, and I have corrected and edited.

Next time you need to create a pip installable package yourself, hopefully this can save you some time!

Continue reading

Five-word stories about a world where AI dominates the world

Creative AI writing 🤖🖊️

For sale: baby shoes, never worn.” ~ Ernest Hemingway??

This is a six-word story famously misattributed to Ernest Hemingway. According to Wikipedia, this story first appeared in 1906, when Hemingway was 7 years old, and later attributed to him in 1991, 30 years after his death. So, no chance it was his.

Regardless of its origin, I found this type of story very creative.

In this blog post, as the title says, I will dare to push the boundary to present 5-word stories on the topic of AI taking over the world, BUT with a humorous spin.

Continue reading

My CCDC Science Day Experience

In June, I had the opportunity to visit the Cambridge Crystallographic Data Centre (CCDC) for Science Day to give a lightning talk on my rotation project with OPIG. The day was packed with presentations from researchers and PhD students collaborating with the CCDC, offering a great opportunity to hear about some of the fascinating work happening there in the fields of Structural and Computational Chemistry.

We kicked off with a dinner at the University Arms in Cambridge. This was a great opportunity to meet people who were attending Science Day in a relaxed environment, complemented by the lovely food and drink.

The next day was all about the talks. The first part of the day was filled with longer talks by more senior PhD students and CCDC researchers, followed by lightning talks from first-year PhD or master’s students. These shorter presentations provided a fast-paced overview of each project.

Continue reading

Happily hallucinating (for humans)

Many of us in academia face worries about an uncertain future. As an undergraduate, exams, assignments, exchanging information via auditory and visual cues with other members of the species1, then as one moves through the pipeline there’s funding, publications, the expectation that you know something about something, what will I be when I eventually grow up2, and I haven’t even mentioned the perennial question that is, what am I going to cook tonight?!

I have faced all of these worries and more, and will no doubt continue to, but through talking to my peers, mentors and family, I’ve learnt a few lessons that have proved invaluable for me, and perhaps will be for you as well.

Continue reading

Memory-mapped files for efficient data processing

Memory management is a key concern when working with large datasets. Many researchers and developers will load entire datasets into memory for processing. Although this is a straightforward approach that allows for quick access and manipulation of data, it has its drawbacks. When the dataset size approaches or exceeds the available physical memory, performance degrades rapidly due to excessive swapping, leading to increased latency and reduced throughput. Memory-mapped files are an alternative strategy to access and manipulate large datasets without the need to load them fully into memory.


A background on memory-mapped Files

Memory mapping is the process of mapping a file or a portion of a file directly into virtual memory. This mapping establishes a one-to-one correspondence between the file’s contents on disk and specific addresses in the process’s memory space. Instead of relying on traditional I/O operations, such as read() an write(), which involve copying data between kernel space and user space, the process can access the file’s contents directly through memory addresses. Then, page faults are used to determine which chunks to load into physical memory. However, this chunks are significantly smaller than the whole file contents. This direct access reduces overhead and can significantly speed up data processing, especially for large files or applications that require high-throughput I/O operations.

Continue reading