Setup Slurm across Multiple Machines

Updated

To install Slurm, we need to have admin access to the machine. This post explains how I got Slurm running in multiple Linux servers. All servers are running on Ubuntu 18.04.5 LTS.

Setup Munge

First, we need to make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster. We need to create two users: slurm and munge across all servers.

$ export MUNGEUSER=3456
$ groupadd -g $MUNGEUSER munge
$ useradd  -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge  -s /sbin/nologin munge
$ export SLURMUSER=3457
$ groupadd -g $SLURMUSER slurm
$ useradd  -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm  -s /bin/bash slurm

Then, we install Munge for authentication:

$ apt install munge libmunge2 libmunge-dev

To test if munge is installed successfully:

$ munge -n | unmunge | grep STATUS
STATUS:           Success (0)

Next, we create a munge authentication key on one of the servers:

$ /usr/sbin/create-munge-key

After we generate munge authentication key, we copy the key /etc/munge/munge.key on that server to all other servers (overwrite the /etc/munge/munge.key on all other servers).

We need to setup the rights for munge accordingly on every server:

$ chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
$ chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
$ chmod 0755 /run/munge/

Then, we enable and start the munge service with (remember to not use sudo when running munge):

$ systemctl enable munge
$ systemctl start munge

You can then test whether munge works properly by executing:

munge -n                    # Generate a credential on stdout
munge -n | unmunge          # Displays information about the MUNGE key  
munge -n | ssh somehost unmunge

If everything is setup properly, you shouldn’t see any error messages.

Setup Slurm

Use apt to install slurm in Ubuntu systems (make sure all nodes have the same slurm versions):

$ apt install slurm-wlm

Next, we need to configure slurm. Since we used package manager to install slurm, the version is lower than the latest release. Thus, it’s preferably to not use the official Slurm Configuration Tool. Instead, we can find the corresponding version’s configuration tool at /usr/share/doc/slurmctld/slurm-wlm-configurator.html (we just need the slurm-wlm-configurator.html file, which might reside in different directories).

We then execute python3 -m http.server and open the local web server’s address in the browser. Open the slurm-wlm-configurator.html locally will show us the corresponding slurm configuration generator. More explanations can be found here.

After filling up the required fields in the form, we copy the generated file into /etc/slurm-llnl/slurm.conf all all nodes. Then, you can execute sinfo to check all nodes status. You can also launch jobs to see if it actually works, for example:

srun -N2 -l /bin/hostname

This should print out the hostname for all the nodes in the cluster.

Add GPU support

To add GPU support, we first create a file gres.conf in /etc/slurm-llnl/. Here is an example on one node:

Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2

Then, we add GresTypes=gpu into /etc/slurm-llnl/slurm.conf. Next, we add the GPU information to slurm.conf:

NodeName=node1 Gres=gpu:3 State=UNKNOWN