If you are running lots of little jobs in SLURM and want to make use of free nodes that suddenly become available, it is helpful to have a way of rapidly shipping your environments that does not rely on installing conda or rebuilding the environment from scratch every time. This is useful with complex rebuilds where exported .yml files do not always work as expected, even when specifying exact versions and source locations.
In these situations a tool such a conda-pack becomes incredibly useful. Once you have perfected the house of cards that is your conda environment, you can use conda-pack to save that exact state as a tar.gz file.
conda-pack -n my_precious_env -o my_precious_env.tar.gz
This can provide you with a backup to be used when you accidentally delete conda from your system, or if you irreparable corrupt the environment and cannot roll back to the point in time when everything worked. These tar.gz files can also be copied to distant locations by the use of rsync or scp, unpacked, sourced and used without installing conda…
# SBATCH Script cd my_location_on_slurm_node rsync -rav leobloom@pegasus:/data/bloom/my_envs_archive/my_precious_env.tar.gz . mkdir my_env tar -xzf my_precious_env.tar.gz -C my_env source my_env/bin/activate # run conda-unpack to ensure that hard coded paths etc do not cause problems conda-unpack # Ready to run code here python go.py
Now this workflow has worked really well for me on many occasions, particularly with some complex packages that just would not build from the .yml file (things which reach a timeout or else return a conflict error that will not solve). These troublesome environments were likely undone by using too many interdependent libraries that seem to be in a constant state of change.
However, when it comes to some libraries which parallelise or are quite complex, the best practice should always be a rebuild from scratch on the machine where the code is being run. I found this out the hard way…
For months I was copying over my working environment to various locations and running my code. I repeatedly encountered the issue of some tasks inexplicably hanging, which at the time I attributed to the complexity of the code and the way I was running many jobs in parallel. Also, most numbers I had been given by others for this code’s likely runtime were based on using a GPU, the highly in-demand resources I was deliberately avoiding. Instead, I maximised the occupation of many less favoured CPU resources and used a brute force approach to kill things that inexplicably hung. In the end I got to where I needed to be…
Some months later, after I had been issued with a nice new desktop with 12 cores and a decent processor, I began running some smaller sequential jobs on this system. These were of the same ilk, but ran sequentially rather than in parallel.
However, something was different. No hanging, and what took me 30 seconds – 2mins on SLURM was taking 5 secs on my desktop… Furthermore all available cores were being maxed out, without the dips in occupancy that I had been observing when I had first tried to run on SLURM in a sequential fashion (with lots of cores per single task).
Now I had identified this problem some months ago and simply solved this by a 1 task-per-core approach on SLURM. This meant I could essentially occupy all cores, with many tasks, without worrying why some part of each task failed to parallelise when given the option. However, the intermittent hanging issue remained.
After a lot of print statement debugging and asking others for advice, the source of the problem was identified as OpenMM. Finally after giving up and walking home in the English summer rain, a much cleverer person than myself asked me whether I was sure that my conda environments were exactly the same…
To which I replied, “Ha!”
“My dearest fellow. My conda environments could not be more the same! For I use conda-pack! The very purpose of which is to allay such fears, and head off lingering doubts about environmental differences uttered by the likes of you, Sir! Have some of that my son and sit down!”
And you can probably imagine where things went after that…
Cue. Rebuild from scratch on chosen SLURM node. Total core occupancy. 5 seconds or less per job. No hangs in a 48hr period. Problem identified and definitively solved. Finish things off with an evening staring into the abyss and wondering what might have been, how much compute time I have wasted and why I have only discovered this now.
So please, do not make my mistake. If you are using tools such as OpenMM, and possibly PyTorch (although this has not caused me any issues), where there is any doubt that they need to be built on the hardware on which they will be run. Then please code up the conda builds into a SLURM script and prepare some performance benchmarks such as unit tests and timed runs. Do your checks before you launch your jobs, then sleep soundly knowing your code is running to its true potential.
Happy coding!