Author Archives: Adelaide Punt

Memory-mapped files for efficient data processing

Memory management is a key concern when working with large datasets. Many researchers and developers will load entire datasets into memory for processing. Although this is a straightforward approach that allows for quick access and manipulation of data, it has its drawbacks. When the dataset size approaches or exceeds the available physical memory, performance degrades rapidly due to excessive swapping, leading to increased latency and reduced throughput. Memory-mapped files are an alternative strategy to access and manipulate large datasets without the need to load them fully into memory.


A background on memory-mapped Files

Memory mapping is the process of mapping a file or a portion of a file directly into virtual memory. This mapping establishes a one-to-one correspondence between the file’s contents on disk and specific addresses in the process’s memory space. Instead of relying on traditional I/O operations, such as read() an write(), which involve copying data between kernel space and user space, the process can access the file’s contents directly through memory addresses. Then, page faults are used to determine which chunks to load into physical memory. However, this chunks are significantly smaller than the whole file contents. This direct access reduces overhead and can significantly speed up data processing, especially for large files or applications that require high-throughput I/O operations.

Continue reading

Exploring multilingual programming

Python is a prominent language in the ML and scientific computing space, and for good reason. Python is easy-to-learn and readable, and it offers a vast selection of libraries such as NumPy for numerical computation, Pandas for data manipulation, SciPy for scientific computing, TensorFlow, and PyTorch for deep learning, along with RDKit and Open Babel for cheminformatics. It is understandably an appealing choice for developers and researchers alike. However, a closer look at many common Python libraries reveals their foundations in C++. 

Revisiting C++ Advantages

Many of Python libraries including TensorFlow, PyTorch, and RDKit are all heavily-reliant on C++. C++ allows developers to manage memory and CPU resources more effectively than Python, making it a good choice when handling large volumes of data at a fast pace. A previous post on this blog discusses C++’s speed, its utility in GPU programming through CUDA, and the complexities of managing its libraries. Despite the steeper learning curve and verbosity compared to Python, the performance benefits of C++ are undeniable, especially in contexts where execution speed and memory management are critical.

Rust: A New Contender for High-Performance Computing

Continue reading