Memory-mapped files for efficient data processing

Memory management is a key concern when working with large datasets. Many researchers and developers will load entire datasets into memory for processing. Although this is a straightforward approach that allows for quick access and manipulation of data, it has its drawbacks. When the dataset size approaches or exceeds the available physical memory, performance degrades rapidly due to excessive swapping, leading to increased latency and reduced throughput. Memory-mapped files are an alternative strategy to access and manipulate large datasets without the need to load them fully into memory.


A background on memory-mapped Files

Memory mapping is the process of mapping a file or a portion of a file directly into virtual memory. This mapping establishes a one-to-one correspondence between the file’s contents on disk and specific addresses in the process’s memory space. Instead of relying on traditional I/O operations, such as read() an write(), which involve copying data between kernel space and user space, the process can access the file’s contents directly through memory addresses. Then, page faults are used to determine which chunks to load into physical memory. However, this chunks are significantly smaller than the whole file contents. This direct access reduces overhead and can significantly speed up data processing, especially for large files or applications that require high-throughput I/O operations.

In the context of Python, a common approach to memory-mapped files is the mmap() system call. It is often used in place of malloc(). mmap() requests the operating system to map a file or device into the process’s memory. It supports lazy loading, meaning that the file’s pages are only loaded into physical memory when accessed by the process.

How memory-mapped files work in Python

The mmap() system call requires several parameters to establish a memory-mapped file:

  • addr – An address for the operating system indicating where to start the virtual mapping. If set to NULL, the kernel will choose an appropriate address.
  • length – Specifies the length of the mapping in bytes.
  • prot – Defines the protection level of the mapped memory, such as read, write, or execute permissions.
  • flags – Determines various options for the mapping, such as whether it is backed by a file or anonymous (not backed by a file).
    fd: The file descriptor of the file to be mapped. For anonymous mappings, this is set to -1.
  • offset – Indicates the starting point within the file for the mapping, which must be aligned with the system’s page size.

mmap() returns a pointer to the mapped area, allowing the process to access the file as if it were part of its own memory space. The figure above, from materials provided in the COMP 321 course on Operating Systems at Rice University, demonstrates the mmap() system call in action. It shows how a file on disk is mapped into the process’s virtual memory space, establishing a direct correspondence between the file’s bytes and specific memory addresses.

Pros and cons of memory mapping

Memory-mapped files offer several advantages, the first being lazy loading. Pages of these file are only loaded into memory when accessed, which conserves both memory and processing time, especially in applications where only a portion of the file is needed at any given time. By avoiding the overhead associated with multiple system calls and the need to copy data between kernel space and user space, memory mapping can lead to substantial performance improvements. This is particularly beneficial for large files or files that are frequently accessed, as the process interacts with the file’s content as though it were in memory.

When working with multiple processes that need to access the same data, memory-mapped files allow them to do so from the same location. This is particularly advantageous in server environments.Finally, memory-mapping provides more versatile memory allocation than alternatives like malloc(). It can be used in signal handlers and allows for dynamic memory management, such as allocating memory in the middle of the heap or freeing memory at any point.


However, memory-mapping also has some disadvantages. Memory mappings must align with the system’s page boundaries, which can lead to wasted space if the file size is not a multiple of the page size. Additionally, extensive use of memory-mapped files in systems with limited address space can lead to fragmentation, complicating memory management. There is also some overhead associated with maintaining these mappings within the kernel, though this is often outweighed by the performance benefits.

Implementing memory-mapped files in Python

Python’s mmap module provides a straightforward interface for working with memory-mapped files. This can be particularly useful when handling large datasets, such as the QM9 dataset from PyTorch Geometric. This example demonstrates how to map a large file into memory, read SMILES strings and their corresponding property values, modify them, and then update the file—all without loading the entire dataset into RAM.

import mmap
import os
import struct
from torch_geometric.datasets import QM9

# Load the QM9 dataset
dataset = QM9(root=os.path.join(os.path.dirname(os.path.realpath(__file__)), '..', 'data', 'QM9'))

# Define the file name and size
filename = 'qm9_smiles_properties.dat'
filesize = 1024 * 1024 * 100  # 100 MB for example

# Create a file of the given size
with open(filename, 'wb') as f:
    f.seek(filesize - 1)
    f.write(b'\x00')

# Open and memory-map the file
with open(filename, 'r+b') as f:
    mmapped_file = mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_WRITE)

    # Write SMILES string and property value to the memory-mapped file
    for index, data in enumerate(dataset):
        smiles_string = data.smiles  # SMILES string from QM9 dataset
        property_value = data.y.item()  # Property value (e.g., U0 or any target)

        smiles_binary = smiles_string.encode('utf-8')
        property_value_binary = struct.pack('d', property_value)

        entry = (
            struct.pack('I', len(smiles_binary)) + smiles_binary +
            property_value_binary
        )

        # Write the entry to the memory-mapped file
        mmapped_file.write(entry)
        mmapped_file.flush()

    # Retrieve and update an entry
    mmapped_file.seek(0)  # Reset pointer to the beginning

    while mmapped_file.tell() < filesize:
        # Read the length of the SMILES string
        smiles_len = struct.unpack('I', mmapped_file.read(4))[0]
        smiles_binary = mmapped_file.read(smiles_len)
        property_value_binary = mmapped_file.read(8)

        # Decode SMILES string and unpack property value
        smiles_string = smiles_binary.decode('utf-8')
        property_value = struct.unpack('d', property_value_binary)[0]

        # Adjust property value (multiply by 2)
        modified_property_value = property_value * 2
        modified_property_value_binary = struct.pack('d', modified_property_value)

        # Seek back to where the property value was stored
        mmapped_file.seek(-8, os.SEEK_CUR)
        mmapped_file.write(modified_property_value_binary)
        mmapped_file.flush()

    # Close the memory-mapped file
    mmapped_file.close()

In this example, the PyTorch Geometric QM9 dataset is loaded. A 100 MB file is created as a placeholder for storing the data. This file is then memory-mapped, allowing it to be accessed as if it were a part of the process’s memory. The SMILES string and corresponding property value (e.g., the U0 energy from the QM9 dataset) are encoded and written to the memory-mapped file. Each entry consists of the length of the SMILES string, the SMILES string itself, and the property value. After writing, the code reads back the SMILES string and property value, modifies the property value (doubling it), and writes the modified value back to the file. Changes are flushed to disk after each write operation and finally the memory-mapped file is closed.


Author