Monthly Archives: August 2024

Happily hallucinating (for humans)

Many of us in academia face worries about an uncertain future. As an undergraduate, exams, assignments, exchanging information via auditory and visual cues with other members of the species1, then as one moves through the pipeline there’s funding, publications, the expectation that you know something about something, what will I be when I eventually grow up2, and I haven’t even mentioned the perennial question that is, what am I going to cook tonight?!

I have faced all of these worries and more, and will no doubt continue to, but through talking to my peers, mentors and family, I’ve learnt a few lessons that have proved invaluable for me, and perhaps will be for you as well.

Continue reading

Memory-mapped files for efficient data processing

Memory management is a key concern when working with large datasets. Many researchers and developers will load entire datasets into memory for processing. Although this is a straightforward approach that allows for quick access and manipulation of data, it has its drawbacks. When the dataset size approaches or exceeds the available physical memory, performance degrades rapidly due to excessive swapping, leading to increased latency and reduced throughput. Memory-mapped files are an alternative strategy to access and manipulate large datasets without the need to load them fully into memory.


A background on memory-mapped Files

Memory mapping is the process of mapping a file or a portion of a file directly into virtual memory. This mapping establishes a one-to-one correspondence between the file’s contents on disk and specific addresses in the process’s memory space. Instead of relying on traditional I/O operations, such as read() an write(), which involve copying data between kernel space and user space, the process can access the file’s contents directly through memory addresses. Then, page faults are used to determine which chunks to load into physical memory. However, this chunks are significantly smaller than the whole file contents. This direct access reduces overhead and can significantly speed up data processing, especially for large files or applications that require high-throughput I/O operations.

Continue reading

OPunting 2024

This week (2024-08-07) instead of our usual group meeting, OPIG took to the high seas. The OPIGlets pooled our resources and procured punts from many different berths. Organised by Admiral Nele, we departed from the Cherwell boathouse and shipped out the 0.5 nautical miles (3.28801867e-6 light seconds for those playing along in metric) upriver to the Vicky Arms.

Despite visiting the odd bush on the way, scurvy scallywags one and all were herded in a generally upstream direction with Matt and Eoin leading the way. With the first two punts having safely reached dry land and refuelled their ethanol fuel cells, the question remained where on earth everyone had got to. Sagely concluding they’d probably all sunk another pint was had in their honour.

Continue reading

Converting or renaming files, whilst still maintaining the directory structure

For various reasons we might need to convert files from one format to another, for instance from lossless FLAC to MP3. For example:

ffmpeg -i lossless-audio.flac -acodec libmp3lame -ab 128k compressed-audio.mp3

This could be any conversion, but it implies that the input file and the output file are in the same directory. What if we have a carefully curated directory structure and we want to convert (or rename) every file within that structure?

find . -name “*.whateveryouneed” -exec somecommand {} \; is the tool for you.

Continue reading

Experience at the Computational Chemistry Gordon Research Conference

This past July I had the absolute delight of attending the Computational Chemistry Gordon Research Seminar and Conference all the way in Portland, Maine. It was my first Gordon experience, which was invigorating seven-day experience with lots of great science and meeting great people!

Since pictures and videos are not allowed at GRCs as they support the presentation of unpublished results, I’ll talk more generally about the conference as a whole and the general science themes related to my work.

Continue reading

Sort and Slice Tutorial – An alternative to extended connectivity fingerprints

ggPlotting tips with OPIG data

Ever wondered whether opiglets keep their ketchup in the fridge or cupboard? Perhaps you’ve wanted to know how to create nice figure to display lots of information simulataniously. Publication quality figures are easy in R with the ggplot package. We may also learn some good visualisation.

Continue reading

Conference Summary: AIRR Community Meeting VII – Learnings and Perspectives

At the start of June, we (Lewis and Benjie) attended the AIRR Community meeting in beautiful and sunny Porto, Portugal. This meeting was focused on collecting and analysing adaptive immune receptor repertoires. This comprised of two rivalling factions at the conference: the antibody (Ab) people or the T cell antigen receptor (TCR) people. The split was nearly fifty-fifty between these two topics throughout the conference. Overall, the conference was a fairly comfortable size, with approximately a hundred people in attendance, making it easy to visit all of the posters and talk with many people in your area, without feeling too niche. There was a wide variety of content formats throughout the conference including posters, scientific talks, lightning talks, software demos, and hands-on tutorials. In the following section, we highlight some of our favourite sessions to give a flavour of what this meeting entails.

Continue reading

Incorporating conformer ensembles for better molecular representation learning

Conformer ensemble of tryptophan from Seibert et. al.

The spatial or 3D structure of a molecule is particularly relevant to modeling its activity in QSAR. The 3D structural information affects molecular properties and chemical reactivities and thus it is important to incorporate them in deep learning models built for molecules. A key aspect of the spatial structure of molecules is the flexible distribution of their constituent atoms known as conformation. Given the temperature of a molecular system, the probability of each of its possible conformation is defined by its formation energy and this follows a Boltzmann distribution [McQuarrie and Simon, 1997]. The Boltzmann distribution tells us the probability of a certain confirmation given its potential energy. The different conformations of a molecule could result in different properties and activity. Therefore, it is imperative to consider multiple conformers in molecular deep learning to ensure that the notion of conformational flexibility is embedded in the model developed. The model should also be able to capture the Boltzmann distribution of the potential energy related to the conformers.

Continue reading

Reproducible publishing

I’m a big fan of Jupyter Notebooks. They’re a great way to document and explain code, and even better, you can run this code when connected to an appropriate kernel.

What if you want to work on something larger than a notebook? Say a chapter or even a whole book, with Python, R, Observable JS, or Julia code? Enter Quarto. You can combine Jupyter notebooks and/or plain text markdown to publish production quality articles, presentations, dashboards, website, blogs and books in HTML, PDF, Microsoft Word, ePub, and other formats. Quarto can also connect to publishing platforms like Posit Connect, Confluence Cloud, and others.

Continue reading