Using Jupyter on a remote slurm cluster

Suppose we need to do some interactive analysis in a Jupyter notebook, but our local machine lacks the power. We have access to a slurm cluster, but we can’t SSH from the head node to the worker node; we can only SSH from the worker node to the head node. Can we still interact with a Jupyter notebook running on the worker node? As it happens, the answer is “yes” – we just need to do some reverse SSH tunnelling.

A long, long time ago, I wrote a blog post about connecting to a Jupyter session running remotely. Conveniently, this all still applies, so we’re most of the way there. The difference we have now is that we might have both a gateway server and a head node to jump through, and a worker node that we can only SSH from and not to. We also need to write a script telling sbatch how to run everything for us as the sysadmins likely have sensible limits on the resources available for interactive sessions.

There are three problems we need to solve to start working with Jupyter in such a setup:

  1. SSH port forwarding through multiple hosts
  2. SSH port forwarding from the slurm worker node
  3. Writing an sbatch script to start the Jupyter server and port forward without

Port forwarding through multiple hosts

This is the simplest part. If your Jupyter session is running on port 8888 on a remote host that you can directly SSH to, you would use a single SSH tunnel:

ssh -N -f -L localhost:8888:localhost:8888 username@remotehost
  • -N tells SSH not to run any remote processes. This is useful when all we’re doing is forwarding ports
  • -f tells SSH to run in the background, so we don’t need to keep the terminal session alive
  • -L tells SSH to forward a port on a local address to a port on a local address. in this case, 8888 on localhost on our local machine to 8888 on localhost on the remote machine.

Suppose now that we can’t directly SSH to the remote host, but instead must go via a gateway server. SSH tunnels are established through all connections, with ports then foewarded through the tunnel, so we don’t need to forward ports to the jump host. Instead, using the ‘ProxyJump command, we can forward ports through a jump host as follows:

ssh -J user@gateway -N -f -L localhost:8888:localhost:8888 user@remotehost

In fact we can tunnel through any number of intermediate servers by giving SSH a list of jump hosts:

ssh -J user@jumphost1,user@jumphost2,user@jumphost3 -N -f -L localhost:8888:localhost:8888 user@remotehost

We can also define jump hosts in ~/.ssh/config to simplify things:

Host jump1
    Hostname jumphost1
    User user

Host jump2
    Hostname jumphost2
    User user
    ProxyJump jump1

Host jump3
    Hostname jumphost3
    User user
    ProxyJump jump2

Host remote
    Hostname remotehost
    User user
    ProxyJump jump3

We can also add the forwarded port to the config file if we know we’ll always use the same port. This, however, may not be a good idea in a multi-user environment where somebody else might be using the same port for something else. We can also add the -N and -f options to the config file using the SessionType and ForkAfterAuthentication options rescpectively.

Host remote
    Hostname remotehost
    User user
    ProxyJump jump3
    RemoteForward 8888
    SessionType none
    ForkAfterAuthentication yes
    

With this config file, our entire port forwarding process is reduced to simply running

ssh remote

Very convenient.

Reverse port forwarding from the worker node

Of course, this SSH tunnel only gets us as far as we are allowed to SSH. If the worker node we actually want to run Jupyter on is not accessible from the head node via SSH, we need to think outside the box. Suppose the sysadmins have very kindly allowed to SSH from the worker node to the head node for the purpose of, for example, copying data to and from the worker during a job. This allows us to use SSH on the worker node to set up a tunnel and forward ports to (or in this case from) the head node. To reverse port forward, we can run the following on the worker node:

ssh user@headnode -N -f -R localhost:8888:localhost:8888

This is exactly the same as regular port forwarding, except this time the -R option tells SSH to instead forward port 8888 on the remotehost to port 8888 on the local host instead. The net result of these two port forwarding steps is that we will have forwarded port 8888 on our local machine to port 8888 on the worker node, allowing us to interact with (for example) a Jupyter session running there.

Running Jupyter with sbatch

This raises one final question: how do we start a Jupyter session on the worker node, set up the SSH tunnel, and keep it running while we work. Assuming that we can’t simply request whatever resources we like in an interactive session, we’ll need an sbatch script that does all of this for us. Simply writing this is straightforward, but we also need to think about SSH authentication. When slurm runs the job for us, we won’t be around to enter a password when prompted by SSH, and the job will hang until it times out.

As long as our friendly neighbourhood sysadmin permits it, we can use SSH keys to set up passwordless authentication when using SSH to connect to the head node from the worker node. Setting up SSH keys is straightforward, so we’ll just outline the process here. If in doubt, as the sysadmin.

First, create a key pair on the worker node.

ssh-keygen -t ed25519

For passwordless authentication, leave the passphrase empty when prompted.

Next, copy the public key to the head node. SSH can do this for us:

ssh-copy-id user@headnode

This will prompt us for our password. Once complete, our public key will be stored on the head node and future SSH attempts from the worker node will use our private key instead of your SSH password. To test this, we simply attempt to SSH to the head node:

ssh user@headnode

If our SSH keys have been set up correctly, SSH will not prompt us for our password and will automatically connect to the head node. With this taken care of, we can finally write an sbatch script to start up a Jupyter session and port forward from the head node to the worker node. The exact configuration will depend on the resources needed, but here’s a minimal working example:

#SBATCH --cluster=cpu_cluster
#SBATCH --partition=high_priority
#SBATCH --time=08:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --gres=gpu:GeForce_GTX_1080Ti:1
#SBATCH --output=/homes/user/slurm_%j.out
#SBATCH --error=/homes/user/slurm_%j.err

# Set up project directory
mkdir -p /homes/user/myproject
cd /homes/user/my_project

# Reverse port forward
ssh -N -f -R 8888:localhost:8888 user@headnode

# Start jupyter
jupyter lab --no-browser --port=8888

Submitting this script to sbatch will allocate us eight cores, one GPU, and 128GB memory on a worker node, set up a project directory, forward port 8888 from the head node to the worker node, and finally start Jupyter lab on port 8888 on thw worker node. If we want to use a specific worker node, we can specify this with the –nodelist option. If we have Jupyter and ssh keys set up on all nodes, we can leave it to slurm to find us a suitable node to run out job.

Wrapping up

Running Jupyter on a worker node and interacting with it from a laptop might sound like a lot of work, but with a little planning and some sensible configuration we can streamline the process and make it almost as easy as running it directly on our laptop. This was written with our own HPC infrastructure in mind, but the principles are applicable anywhere you have multiple SSH hosts and a job scheduler in between you and your data analysis.

Author