Monthly Archives: August 2024

Happily hallucinating (for humans)

Many of us in academia face worries about an uncertain future. As an undergraduate, exams, assignments, exchanging information via auditory and visual cues with other members of the species¹, then as one moves through the pipeline there’s funding, publications, the expectation that you know something about something, what will I be when I eventually grow up², and I haven’t even mentioned the perennial question that is, what am I going to cook tonight?!

I have faced all of these worries and more, and will no doubt continue to, but through talking to my peers, mentors and family, I’ve learnt a few lessons that have proved invaluable for me, and perhaps will be for you as well.

Continue reading →

Memory-mapped files for efficient data processing

Memory management is a key concern when working with large datasets. Many researchers and developers will load entire datasets into memory for processing. Although this is a straightforward approach that allows for quick access and manipulation of data, it has its drawbacks. When the dataset size approaches or exceeds the available physical memory, performance degrades rapidly due to excessive swapping, leading to increased latency and reduced throughput. Memory-mapped files are an alternative strategy to access and manipulate large datasets without the need to load them fully into memory.

A background on memory-mapped Files

Memory mapping is the process of mapping a file or a portion of a file directly into virtual memory. This mapping establishes a one-to-one correspondence between the file’s contents on disk and specific addresses in the process’s memory space. Instead of relying on traditional I/O operations, such as read() an write(), which involve copying data between kernel space and user space, the process can access the file’s contents directly through memory addresses. Then, page faults are used to determine which chunks to load into physical memory. However, this chunks are significantly smaller than the whole file contents. This direct access reduces overhead and can significantly speed up data processing, especially for large files or applications that require high-throughput I/O operations.

Continue reading →

Converting or renaming files, whilst still maintaining the directory structure

For various reasons we might need to convert files from one format to another, for instance from lossless FLAC to MP3. For example:

ffmpeg -i lossless-audio.flac -acodec libmp3lame -ab 128k compressed-audio.mp3

This could be any conversion, but it implies that the input file and the output file are in the same directory. What if we have a carefully curated directory structure and we want to convert (or rename) every file within that structure?

find . -name “*.whateveryouneed” -exec somecommand {} \; is the tool for you.

Continue reading →

Experience at the Computational Chemistry Gordon Research Conference

This past July I had the absolute delight of attending the Computational Chemistry Gordon Research Seminar and Conference all the way in Portland, Maine. It was my first Gordon experience, which was invigorating seven-day experience with lots of great science and meeting great people!

Since pictures and videos are not allowed at GRCs as they support the presentation of unpublished results, I’ll talk more generally about the conference as a whole and the general science themes related to my work.

Continue reading →

Sort and Slice Tutorial – An alternative to extended connectivity fingerprints

Background¶

Sort and Slice (SNS) was developed by a former OPIGlet, Markus, as a method for improving Extended Connectivity Fingerprints (ECFPs) by overcoming bit collisions. ECFPs are a form of topological fingerprint which denote the absence and presence of circular substructures in a molecule. The steps for deriving an ECFP from a molecule are as follows:

Identifier assignment:

Each atom in the molecule is assigned an initial numerical identifier; this is typically generated by hashing a tuple of atomic properties called Daylight atomic invariants into a 32-bit integer. These properties are:
1. Number of non-hydrogen neighbours.
2. Valence – number of neighbouring hydrogens.
3. Atomic number.
4. Atomic mass.
5. Atomic charge.
6. Number of hydrogen neighbours.
7. Ring membership.*
*Ring membership is an additional property that is often used but is not one of the original Daylight atomic invariants.

Continue reading →

ggPlotting tips with OPIG data

Ever wondered whether opiglets keep their ketchup in the fridge or cupboard? Perhaps you’ve wanted to know how to create nice figure to display lots of information simulataniously. Publication quality figures are easy in R with the ggplot package. We may also learn some good visualisation.

Continue reading →

Conference Summary: AIRR Community Meeting VII – Learnings and Perspectives

At the start of June, we (Lewis and Benjie) attended the AIRR Community meeting in beautiful and sunny Porto, Portugal. This meeting was focused on collecting and analysing adaptive immune receptor repertoires. This comprised of two rivalling factions at the conference: the antibody (Ab) people or the T cell antigen receptor (TCR) people. The split was nearly fifty-fifty between these two topics throughout the conference. Overall, the conference was a fairly comfortable size, with approximately a hundred people in attendance, making it easy to visit all of the posters and talk with many people in your area, without feeling too niche. There was a wide variety of content formats throughout the conference including posters, scientific talks, lightning talks, software demos, and hands-on tutorials. In the following section, we highlight some of our favourite sessions to give a flavour of what this meeting entails.

Continue reading →

Incorporating conformer ensembles for better molecular representation learning

Conformer ensemble of tryptophan from Seibert et. al.

The spatial or 3D structure of a molecule is particularly relevant to modeling its activity in QSAR. The 3D structural information affects molecular properties and chemical reactivities and thus it is important to incorporate them in deep learning models built for molecules. A key aspect of the spatial structure of molecules is the flexible distribution of their constituent atoms known as conformation. Given the temperature of a molecular system, the probability of each of its possible conformation is defined by its formation energy and this follows a Boltzmann distribution [McQuarrie and Simon, 1997]. The Boltzmann distribution tells us the probability of a certain confirmation given its potential energy. The different conformations of a molecule could result in different properties and activity. Therefore, it is imperative to consider multiple conformers in molecular deep learning to ensure that the notion of conformational flexibility is embedded in the model developed. The model should also be able to capture the Boltzmann distribution of the potential energy related to the conformers.

Continue reading →

Reproducible publishing

I’m a big fan of Jupyter Notebooks. They’re a great way to document and explain code, and even better, you can run this code when connected to an appropriate kernel.

What if you want to work on something larger than a notebook? Say a chapter or even a whole book, with Python, R, Observable JS, or Julia code? Enter Quarto. You can combine Jupyter notebooks and/or plain text markdown to publish production quality articles, presentations, dashboards, website, blogs and books in HTML, PDF, Microsoft Word, ePub, and other formats. Quarto can also connect to publishing platforms like Posit Connect, Confluence Cloud, and others.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Monthly Archives: August 2024

Happily hallucinating (for humans)

Memory-mapped files for efficient data processing

Converting or renaming files, whilst still maintaining the directory structure

Experience at the Computational Chemistry Gordon Research Conference

Sort and Slice Tutorial – An alternative to extended connectivity fingerprints

Background¶

ggPlotting tips with OPIG data

Conference Summary: AIRR Community Meeting VII – Learnings and Perspectives

Incorporating conformer ensembles for better molecular representation learning

Reproducible publishing