On the eastern side of Oxfordshire are the Cotswolds, a pleasant hill range with a curious etymology: the hills of the goddess Cuda (maybe, see footnote). Cuda is a powerful yet wrathful goddess, and to be in her good side it does feel like druidry. The first druidic test is getting software to work: the wild magic makes the rules of this test change continually. Therefore, I am writing a summary of what works as of Late 2023.
Expired documentation
When searching online documentation for Cuda-enabled software, check the publication date. It is a rapidly changing field.
ChatGTP gives outdated information.
The top example of this is pip install tensorflow-gpu
which is already included with pip install tensorflow>=2.12
. In fact, the major change is a recent trend to include everything in the pip installer —CUDA toolkit included.
Check what the newest official documentation says first, then when stuff is failing, check newer blogs, then older blogs.
The Nvidia documentation can be tricky to navigate. Say I am “interested” in learning more about lifecycle cadence. Google gives me https://docs.nvidia.com/datacenter/tesla/drivers/#lifecycle. I notice the Tesla URL, try ampere or hopper: 404s. I look at the top right for the date, June 2023. So it is a misleading URL to confuse novice druids.
Drivers
This blog post will be discussing software using the Cuda toolkit, not driver installation. But a few maxims need airing.
As everything with software, do not install the latest version, install the version that will have longest lifetime.
Parenthetically, always double check compatibilities when installing drivers/Cuda (the high druid council of Cuda calls it a “support matrix”). Ignoring the complication of the video driver, Nvidia GPU requires a driver, which have a 3 digit release number, e.g. R520, and Cuda platform, which have a version and subversion (currently 12.3), which is paired with the driver, e.g. R520 goes with Cuda 11.8. Traditionally there was a ±3 release/subversion wiggle room: this lifecycle cadence gets quoted dogmatically. This is no longer really true as far as I can tell, and it is more fruitful to always double check.
About the lifetime. A driver Release
has a Branch Designation
. These can be Long Term Support Branch
, New Feature Branch
, Evaluation/ Developer Branch
and Production Branch
. Do check what your chosen branch designation is.
This sounds obvious, but does happen: Unfortunately for me, on a cluster I use, the sys admin had to use an image created by a central sys admin who had chosen a production branch, which was then classed as a development branch, which means Cuda issues galore —I am hoping this will go away when CentOS 7 is buried. Hoping.
Cuda Compat
Nvidia at the end of 2022, introduced Cuda 12.0 and with it the cuda-compat package. This has its own support matrix: https://docs.nvidia.com/deploy/cuda-compatibility/#use-the-right-compat-package.
This (conda install conda-forge::cuda-compat
) might be the trick you need.
Clean-up in aisle five
Often when you install something you get an incompatible version issue. This is caused generally by conda / pip packages that were made lazily and simply use a frozen dependency list: numpy, OpenSSL and six are three common packages that end up yoyoing in versions and cause damage. In the case of OpenSSL one has to rm -rf
the package as it will have broken pip. While the 2to3 module six indicates the offending code is proper ancient. Conda has a handy --freeze-installed
to stop this, pip does not. With cuda-dependent installation, some developers are lazy or want to make your life easier by installing cuda toolkit (and CuDNN) for you. This will undoubtably cause issues.
For example, say you conda installed your ideal version of Cuda toolkit, and then a conda package from a piece of software you want to use. Pytorch then tells you have incompatible driver and CuDNN versions…
RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (8, 4, 4) but found runtime version (8, 0, 5). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.
The $LD_LIBRARY_PATH option actually works in PyTorch and openMM, but not with tensorflow or jax, unless you import pytorch first and then tensorflow —I will admit I used this horrible trick a few times.
The first step is establish what is going wrong. OpenMM works if pytorch works, so I have used that as my verbose proxy.
# What cudatoolkit is installed? conda list cudatoolkit # Nvidia version? conda list cuda-toolkit # CuDNN conda list cudnn # Where are the so file? find $CONDA_PREFIX -name "libdevice.*"
The latter library is a key Cuda toolkit library, which allows bytecode to be compiled to GPU basically: there should really be one as the CuDNN etc issues are due to different versions.
Here is a what I got from a nasty installation:
$CONDA_PREFIX/lib/python3.10/site-packages/triton/third_party/cuda/lib/libdevice.10.bc
$CONDA_PREFIX/lib/python3.10/site-packages/jaxlib/cuda/nvvm/libdevice/libdevice.10.bc
$CONDA_PREFIX/lib/libdevice.10.bc
$CONDA_PREFIX/nvvm/libdevice/libdevice.10.bc
Triton is what pytorch uses, and the last two are installed by Conda and by Pip.
There are three ways of dealing with this:
- The LD_LIBRARY_PATH or
- the sledgehammer way,
- the arduous way.
Sledgehammer
The most common suggestion is to remove the troublesome environment and start again. I agree this is elegant and the installation instructions are replicable, but I am meant to be doing science not installing software elegantly —Heck, I want to write about chemistry and not Cuda.
Simply take note of the libdevice versions and remove the relevant parent folder. Don’t change the extension (say to nvvm-bk
) as that will not fool anyone.
In the above example, the folder nvvm
appears twice: NVIDIA Virtual Machine is a bridge of sorts to GPU and it the nvvm
folder there will likely be a nvvm/bin/cicc
, which is an internal low level compiler —not to be confused with the high-level (=normal usage) one, nvcc
, which is commonly used to see the installed version of Cuda Toolkit via nvcc --version
(the apt/dnf installation will be in /usr/local/cuda/bin/nvcc
). The other file of note in the folder, nvvm/lib64/libdevice.so, is more low-level still and I do not believe has caused me incompatibility issues… but to be safe the whole repeated folder can go.
Environment variables
The above is a bit brutal as one cannot go back. The $LD_LIBRARY_PATH environment variable most likely has or ought to the following colon separate paths:
- $CONDA_PREFIX/lib
- $CONDA_PREFIX
- /usr/local/cuda/compat —discussed above
- /usr/local/nvidia/lib
- /usr/local/nvidia/lib64
- /.singularity.d/libs —if you are in a singularity container with the -nv flag
The virtual directory /proc/driver/nvidia
has the drivers not shared library files. Same obviously for /dev/nvidia0
and friends.
Prepending a new location will give it priority. Therefore, add $CONDA_PREFIX/lib
or a specific folder in /lib/python👾.👾/site-packages
and try out the offending module.
For most cases, this will fix all issues:
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$CONDA_PREFIX
# or more permanently:
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$CONDA_PREFIX
Note that after a module has been imported setting LD_LIBRARY_PATH will not do anything as the library is already loaded. So if this is done in say a Jupyter notebook, do it first (also what happens in Jupyter inline magic (!export FOO=foo
) is not kept as it is not the same process, hence the tedious os.environ
).
import os, sys os.environ['LD_LIBRARY_PATH'] = ':'.join([os.environ['CONDA_PREFIX'] + f'/lib/python{sys.version_info.major}.{sys.version_info.minor}/site-packages', os.environ['CONDA_PREFIX']+'/lib']) import torch assert torch.cuda.is_available() print(torch.cuda.device_count(), torch.cuda.get_device_name(0)) print(f'Using CuDNN: {torch.backends.cudnn.enabled} ({torch.backends.cudnn.m.version()})') device = torch.device("cuda") # Create a random tensor and transfer it to the GPU x = torch.rand(5, 3).to(device) print("A random tensor:", x) y = x * x print("After calculation:", y) print("Calculated on:", y.device)
For tensorflow this environment variable may be ignored so set also $CUDA_HOME or adding to $XLA_FLAGS
an entry --xla_gpu_cuda_data_dir=👾👾👾
.
import os, sys os.environ['LD_LIBRARY_PATH'] = ':'.join([os.environ['CONDA_PREFIX'] + f'/lib/python{sys.version_info.major}.{sys.version_info.minor}/site-packages', os.environ['CONDA_PREFIX']+'/lib']) import tensorflow as tf print(sys.version_info) print(tf.config.list_physical_devices('GPU')) print('CUDA build:', tf.test.is_built_with_cuda()) print("CUDA version:", tf.sysconfig.get_build_info()["cuda_version"]) print("cuDNN version:", tf.sysconfig.get_build_info()["cudnn_version"]) print("CUDA library paths:", tf.sysconfig.get_lib()) a = tf.constant([[1.0, 2.0], [3.0, 4.0]]) b = tf.constant([[1.0, 1.0], [0.0, 1.0]]) c = tf.matmul(a, b) print(c)
For Jax it can be $CUDA_PATH.
from jax.lib import xla_bridge assert xla_bridge.get_backend().platform != 'cpu' import jax.numpy as jnp from jax import random key = random.PRNGKey(0) x = random.normal(key, (5000, 5000), dtype=jnp.float32) print( jnp.dot(x, x.T) )
For OpenMM, if torch is happy so is OpenMM. To test the latter:
import openmm as mm plat = mm.Platform.getPlatformByName('CUDA') print( plat.getOpenMMVersion() ) # this will fail with no failures: assert plat.supportsKernels('CUDA'), f'failures: {plat.getPluginLoadFailures()}' # ditto: mm.Platform.findPlatform('CUDA') # No Platform supports all the requested kernels # however... this will show Reference, CPU and CUDA as valid?! print([mm.Platform.getPlatform(index).getName() for index in range(mm.Platform.getNumPlatforms())]) # everything is fine: print(mm.version.openmm_library_path) print(mm.pluginLoadedLibNames) # so to test one has to make an OpenMM simulation object ... simulation = mma.Simulation(modeller.topology, system, integrator) platform: mm.Platform = simulation.context.getPlatform() assert platform.getName() == 'CUDA', platform.getPluginLoadFailures()
Once a solution is found, make sure to store the environment variable for next time the conda environment is activated via conda env config vars set LD_LIBRARY_PATH=👾👾👾
Arduous way
As mentioned the issue is dependancies of certain packages. These can be inspected on the terminal via conda search
or online at https://anaconda.org/ by searching for the name, picking one by paying attention to version, then files and the green ℹ️ symbol. To install a package without dependencies use 👾👾👾
--info--no-deps
.
There is a environment variable for conda, $CONDA_OVERRIDE_CUDA
which can be use to spefic version eg. 11.8. This will have no effect on say jax installed by pip.
Singularity
A different problem entirely is Singularity. There is a flag -nv
which adds the relevant virtual folders. This will however not add /usr/bin/nvidia-smi
, you can bind this or copy the binary over into the container assuming its got later kernel drivers. The files in /usr/local/cuda
are from the Cuda toolkit, so that is fine if its missing as Conda can add them.
Walkthrough
Say I want to install a package which has an conda yaml provided —yay, how nice!
# common fluff unset LD_LIBRARY_PATH unset CUDA_HOME unset CUDA_DIR unset XLA_FLAGS # install conda env create -f 👾👾.yml conda activate 👾👾 python 👾👾.py
Worse case scenario this failed to install because of python version (hello PyMOL…)
unset LD_LIBRARY_PATH unset CUDA_HOME unset CUDA_DIR unset XLA_FLAGS export CONDA_PREFIX conda create --name RFdiffusion -y python=3.👾 conda activate RFdiffusion conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$CONDA_PREFIX:👾👾 # 👾👾 = either /.singularity.d/libs or /usr/local/cuda/compat:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 conda env config vars set PYTHONUSERBASE=$CONDA_PREFIX # this stop pip from installing not in PATH (a yellow warning you get sometimes) conda deactivate conda activate 👾👾 conda env update --name 👾👾 --file 👾👾
Say it installed fine but pytorch is unhappy.
find $CONDA_PREFIX -name "libdevice.*" find $CONDA_PREFIX -name "libnvidia*" conda list cudatoolkit conda list cuda-toolkit conda list cudnn
Say there are multiple versions of libdevice. Anectdotally the lib/nivida version is the problem. So I delete the whole folder. Same with Jax’s nvidia folder.
The latter commands will give version numbers, the guilty parties will be obvious.
However, conda uninstallation does not accept no-deps, so say for openMM I reinstall Cuda Toolkit of the correct version that does not glitch with my drivers or with CuDNN etc. Generally I run just to add to the mix:
export CONDA_CHANNELS="nvidia/label/cuda-11.8.0" conda install -y nvidia/label/cuda-11.8.0::cuda conda install -y nvidia/label/cuda-11.8.0::cuda-toolkit conda install -y nvidia/label/cuda-11.8.0::cuda-nvrtc conda install -y nvidia/label/cuda-11.8.0::libcufile conda install -y nvidia/label/cuda-11.8.0::cuda-tools # these will be in `$CONDA_PREFIX/lib` so make sure this is set: conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$CONDA_PREFIX:👾👾 conda deactivate conda activate 👾👾
A lot of those are redundant, but you might prefer a different version (eg. 11.6.2 is nice).
Conclusion
I am sorry you are having Cuda issues. I hope this helped.
Footnote
A ‘wold’ is a forested hill (cf. ‘Wald’ in German, forest), while Cot’s is a name and might come from Cuda, a Brettonic female mother goddess, equivalent to Danu in Gaelic mythology, the other branch of Celtic culture.