Big Compute doesnt want you to know this! Maximising GPU Usage with CUDA MPS

Accelerating Simulations with CUDA MPS: An OpenMM Implementation Guide

Introduction

High-performance molecular dynamics simulations often require running concurrent simulations on GPUs. However, traditional GPU resource allocation can lead to inefficient utilization when running multiple processes, with users often resorting to using multiple GPUs to achieve this. While parrallelsing across nodes can improve time to solution, many processes require coordination and hence communication which quickly becomes a bottleneck. This is exacerbated with more powerful hardware as internal node communication for a single simulation on a single GPU can also become a bottleneck. This problem has been addressed for CPU parrallelism with multiprocessing and multithreading but previously this was challenging to do this efficiently on GPUs.

NVIDIA’s Multi-Process Service (MPS) offers a solution by enabling efficient and easy sharing of GPU resources among multiple processes with just a few commands. In this blog post, we’ll explore how to implement CUDA MPS with Python multiprocessing and OpenMM to accelerate molecular dynamics simulations.

What is CUDA MPS?

CUDA Multi-Process Service (MPS) is a feature that allows multiple CUDA processes to share a single GPU efficiently. Instead of the default time-slicing approach, MPS enables concurrent execution of multiple CUDA kernels from different processes, leading to improved GPU utilization and higher throughput.

Key benefits of MPS include:
– Better GPU utilization through concurrent kernel execution
– Reduced kernel launch latency
– Improved performance for multi-process applications
– Efficient memory management across processes

This is in contrast to CUDA Multi Instance GPU (MIG) which also allows concurrent GPU access, however MIG places restrictions on the size and number of each slice which may limit performance where processes are compute limited rather than communication limited.

What do you need?

  • NVIDIA MPS Capable GPU (Any NVIDIA GPU with compute level >SM7.0
  • The MPS Daemon to be launchable by you
  • a CUDA and multiprocessing compatible program (Such as OpenMM)

Performance Impact (from NVIDIA GROMACS Benchmarks)


Based on the benchmark data shown in the GROMACS tests:
– For a smaller system (RNAse, 24K atoms), MPS improved throughput by up to 3.5x compared to default scheduling
– For a larger system (ADH Dodec, 96K atoms), MPS showed approximately 1.8x improvement
– The performance gains were most significant when running multiple 16-32 simulations per GPU

Figure 1. Scaling total throughput on a DGX A100 server with the number of simulations per GPU for the RNAse Cubic (left) and ADH Dodec (right) test cases. Shown are results using MPS (open triangles), MIG combined with MPS (closed triangles), and none (open circles).

Implementation Guide


1. Setting Up CUDA MPS

First, we need to configure the CUDA MPS environment. Here’s a Python implementation for checking the MPS context:

Caution! This does not activate MPS, it simply checks that MPS is installed in the system. Do check the output of nvidia-smi in the logs to confirm that MPS is enabled.

def check_cuda_mps_status():
    """Check if CUDA MPS is properly configured and running."""
    try:
        # Verify CUDA availability
        subprocess.run(["nvidia-smi"], check=True, capture_output=True)
        
        # Check MPS environment variable
        mps_pipe = os.getenv('CUDA_MPS_PIPE_DIRECTORY')
        if not mps_pipe:
            print("Warning: CUDA_MPS_PIPE_DIRECTORY not set")
            return False
            
        # Verify MPS control file
        control_file = os.path.join(mps_pipe, "control")
        if not os.path.exists(control_file):
            print("Warning: MPS control file not found")
            return False
            
        return True
        
    except subprocess.CalledProcessError:
        print("Warning: Unable to verify CUDA setup")
        return False
def initialize_cuda():
    """Initialize CUDA context for OpenMM."""
    try:
        platform = omm.Platform.getPlatformByName('CUDA')
        properties = {}
        # Optionally set specific GPU device
        # properties['CudaDeviceIndex'] = '0'
        return platform, properties
    except Exception as e:
        print(f"Error initializing CUDA: {e}")
        return None, None



2. Initializing CUDA for OpenMM

We need to properly initialize CUDA for each OpenMM process, here we are looking at a single GPU system but this can be performed across GPUs too.



3. Implementing Parallel Simulations

The key to efficient MPS utilization is proper process management. Here’s how to manage multiple simulation contexts and correctly spawn processes:

def simulation_worker(task):
    """Worker function for individual simulation processes."""
    try:
        # Initialize CUDA for this process
        platform, properties = initialize_cuda()
        if platform is None:
            raise RuntimeError("Failed to initialize CUDA platform")

        # Create simulation instance
        simulation = build_system_with_platform(
            task.prmtop_file, 
            platform, 
            properties
        )
        
        # Run simulation
        run_simulation(simulation, task)
        
        # Clean up
        del simulation.context
        del simulation
        
        return True
    except Exception as e:
        print(f"Error in simulation: {str(e)}")
        return False

def run_parallel_simulations(simulation_tasks, max_concurrent=4):
    """Run multiple simulations in parallel with MPS."""
    # Use 'spawn' method for clean CUDA state
    ctx = multiprocessing.get_context('spawn')
    
    with ctx.Pool(processes=max_concurrent) as pool:
        results = pool.map(simulation_worker, simulation_tasks)
    return results



4. Memory Management and Optimization

When running multiple simulations with MPS, careful memory management is crucial. It is important to check that GPU memory as well as CPU memory constraints!

A) Determine Optimal Concurrency:

def get_optimal_concurrent_simulations():
    try:
        gpu_info = subprocess.check_output([
            'nvidia-smi', 
            '--query-gpu=memory.total', 
            '--format=csv,noheader,nounits'
        ])
        gpu_memory = int(gpu_info)
        # Conservative estimate for CUDA overhead
        max_concurrent = min(4, gpu_memory // 1000)
        return max_concurrent
    except:
        return 2  # Default fallback


B) Clean Up Resources:
– Always delete OpenMM contexts after use
– Implement proper error handling
– Monitor GPU memory usage

Example Slurm Script for HPC Environments

This is the most important part! The same steps apply if running on a personal machine but here’s a sample Slurm script for setting up CUDA MPS on an HPC cluster:
Please note that the number of CPUs you request must be larger than the number of processes you wish to spawn.

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16

# Set up CUDA MPS
export CUDA_MPS_PIPE_DIRECTORY=${TMPDIR}/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=${TMPDIR}/nvidia-log
mkdir -p ${CUDA_MPS_PIPE_DIRECTORY}
mkdir -p ${CUDA_MPS_LOG_DIRECTORY}

# Start MPS daemon
nvidia-cuda-mps-control -d

# Run your Python script
python your_simulation_script.py

# Clean up MPS
echo quit | nvidia-cuda-mps-control

Best Practices and Considerations


1. Process Creation:
– Use the ‘spawn’ method for multiprocessing to ensure clean CUDA states
– Avoid fork() on Unix systems when using CUDA
– Delete the CUDA simulation context for OpenMM!


2. Error Handling:
– Implement robust error handling for GPU operations
– Monitor and log GPU errors
– Handle process failures gracefully


3. Resource Management:
– Monitor GPU memory usage
– Implement proper cleanup procedures
– Use context managers when possible


4. Performance Optimization:
– Adjust the number of concurrent processes based on system capabilities
– Monitor DRAM utilization!
– Balance CPU and GPU workloads

Conclusion


CUDA MPS provides an easy way to maximise use of modern NVIDIA GPUs. By allowing for efficient GPU usage among multiple processes. When properly implemented with Python multiprocessing and OpenMM, it can lead to significant performance improvements and is not limited to Molecular Dynamics!

The key to success is careful implementation of process management, proper CUDA initialization, and robust error handling. While the setup requires some additional complexity, the performance benefits make it very worthwhile for many molecular dynamics applications.

Remember to always monitor system resources and adjust parameters based on your specific hardware and simulation requirements, specifically GPU and CPU memory. With proper implementation, CUDA MPS can significantly improve the throughput of workflows by maximising usage of GPU resources.

References


– NVIDIA CUDA MPS Documentation and GROMACS Example
– OpenMM Python Multiprocessing Documentation
– Claude to design the context manager

Author