Suppose we need to do some interactive analysis in a Jupyter notebook, but our local machine lacks the power. We have access to a slurm cluster, but we can’t SSH from the head node to the worker node; we can only SSH from the worker node to the head node. Can we still interact with a Jupyter notebook running on the worker node? As it happens, the answer is “yes” – we just need to do some reverse SSH tunnelling.
Continue readingTag Archives: SLURM
Naga101: A Guide to Getting Started with (OPIG) Slurm Servers
Over the past months, I’ve been working with a few new members of OPIG, which left me answering (and asking) lots of questions about working with Slurm. In this blog post, I will try to cover key, practical basics to interacting with servers that are set up on Slurm.
Over the past months, I’ve been working with a few new members of OPIG, which left me answering (and asking) lots of questions about working with Slurm. In this blog post, I will try to cover key, practical basics to interacting with servers that are set up on Slurm.
Slurm is a workload manager or job scheduler for Linux, meaning that it helps with allocating resources (eg CPUs and GPUs) on a server to users’ jobs.
To note, all of the commands and files shown here are run from a so-called ‘head’ node, from which you access Slurm servers.
1. Entering an interactive session
Unlike many other servers, you cannot access a Slurm server via ‘ssh’. Instead, you can enter an interactive (or ‘debug’) session – which, in OPIG, is limited to 30 minutes – via the srun command. This is incredibly useful for copying files, setting up environments and checking that your code runs.
srun -p servername-debug --pty --nodes=1 --ntasks-per-node=1 -t 00:30:00 --wait=0 /bin/bash
2. Submitting jobs
While the srun command is easy and helpful, many of the jobs we want to run on a server will take longer than the debug queue time limit. You can submit a job, which can then run for a longer (although typically still capped) time but is not interactive, via sbatch.
Continue readingUsing SLURM a little bit more efficiently
Your research group slurmified their servers? You basically have two options now.
Either you install all your necessary things on one of the slurm nodes within an interactive session, e.g.:
srun -p funkyserver-debug --pty --nodes=1 --ntasks-per-node=1 -t 00:10:00 --wait=0 /bin/bash
and always specify this node by adding the ‘#SBATCH –nodelist=funkyserver.cpu.do.work’ line to your sbatch scripts or you set up some template scripts that will help you to install all your requirements on multiple nodes so you can enjoy the benefits of the slurm system.
Here is how I did it; comments and suggestions welcome!
Step 1: Create an sbatch template file (e.g. sbatch_job_on_server.template_sh) on the submission node that does what you want. In the ‘#SBATCH –partition’ or ‘–nodelist’ lines use a placeholder, e.g. ‘<server>’, instead of funkyserver.
For example, for installing the same conda environment on all nodes that you want to work on:
Continue reading