According to GitHub’s Open Source Survey, Python has officially become the world’s most popular programming language in 2024 – ultimately surpassing JavaScript. Due to its exceptional popularity, NVIDIA announced Python support for its CUDA toolkit at last year’s GTC conference, marking a major leap in the accessibility of GPU computing. With the latest update (https://nvidia.github.io/cuda-python/latest/) and for the first time, developers can write Python code that runs directly on NVIDIA GPUs without the need for intermediate C or C++ code.
Historically tied to C and C++, CUDA has found its way into Python code through third-party wrappers and libraries. Now, the arrival of native support means a smoother, more intuitive experience.
This paradigm shift opens the door for millions of Python programmers – including our scientific community – to build powerful AI and scientific tools without having to switch languages or learn legacy syntax.
Bringing Python natively into the CUDA ecosystem wasn’t just a matter of wrapping GPU kernels in Python syntax. NVIDIA took a full-stack approach – rethinking everything from the compiler layer to high-level libraries – to make CUDA feel like a natural extension of Python.
The CUDA toolkit includes a wide range of components: compilers, host runtimes, SDKs, libraries, developer tools, and pre-packaged algorithms. In other words, CUDA truly Pythonic, NVIDIA had to touch every part of this stack.
This meant that developers could now write a kernel, integrate it into PyTorch or NumPy workflows, and seamlessly call Pythonic libraries, all from within Python, without switching contexts or calling external compilers. To accomplish this, NVIDIA used just-in-time (JIT) compilation. There’s no traditional compiler step; everything happens in-process, reducing dependencies and simplifying the development flow without the need for command-line compilers.
At the heart of the redesign is CUDA Core, which preserves Python’s execution flow while incorporating JIT compilation under the hood, allowing developers to stay fully in-process with their code – no command-line compilers or context switching required.
A new programming model, called the CuTile interface, is being developed for Pythonic CUDA and will be released later this year, but NVIDIA has also started building basic Python bindings, including a runtime compiler and a set of core libraries such as cuPyNumeric – a drop-in replacement for NumPy. With a single import change, existing NumPy code can run on a GPU instead of a CPU.
A key component that demonstrates the power of this shift is NVMath (https://github.com/NVIDIA/nvmath-python), a new-released library that provides unified high-level interfaces for both host-side and device-side computations. By allowing library calls to be merged, it delivers significant performance gains while keeping the code readable and Pythonic. While more details can be found on the main documentation page, I wanted to give an example to show the easy way to implement it:
import cupy as cp ## cuda numpy/scipy library import nvmath # new native python CUDA algebra implementation # Prepare sample input data.<br>n, m, k = 123, 456, 789 a = cp.random.rand(n, k) b = cp.random.rand(k, m) # Perform the multiplication. result = nvmath.linalg.advanced.matmul(a, b) # Synchronize the default stream, since by default the execution is non-blocking for GPU cp.cuda.get_current_stream().synchronize() # print result and check if the result is a cupy array as well. print(f"multiplication result: {result}") print(f"Inputs types: {type(a)} and {type(b)} and result type: {type(result)}.")
The advantage of NVMarth is also its flexibility, it does not only work with libraries created for GPU usage, but also with libraries created for running on CPU, i.e. it always returns the result of the same type as its inputs: if the inputs were Numpy array, the result will be a Numpy array as well, as in the following example:
import numpy as np
import nvmath
# Prepare sample input data.
m, n, k = 123, 456, 789
a = np.random.rand(m, k)
b = np.random.rand(k, n)
# Perform the multiplication.
result = nvmath.linalg.advanced.matmul(a, b)
print(f"multiplication result: {result}")
print(f"Inputs types: {type(a)} and {type(b)} and result type: {type(result)}.")
More information about MVmath can be found on Github and its official documentation: https://docs.nvidia.com/cuda/nvmath-python/
To conclude, the new native Pythonic implementation will make it easy to implement CUDA in protein models and create new architectures for biochemistry applications. By embracing Python from the ground up, NVIDIA is opening the door to a new and easier generation of GPU-powered innovation, and the future of GPU computing isn’t just fast, it’s Pythonic! and frankly, I can’t wait to use it!