Set up Slurm across Multiple Machines
To install Slurm, we need to have admin access to the machine. This post explains how I got Slurm running in multiple Linux servers. All servers are running on Ubuntu 18.04 LTS.
Setup Munge
First, we need to make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster. We need to create two users: slurm
and munge
across all servers.
z
Then, we install Munge for authentication:
$ apt install munge libmunge2 libmunge-dev
To test if munge is installed successfully:
$ munge -n | unmunge | grep STATUS
STATUS: Success (0)
Next, we create a munge authentication key on one of the servers:
$ /usr/sbin/create-munge-key
After we generate munge authentication key, we copy the key /etc/munge/munge.key
on that server to all other servers (overwrite the /etc/munge/munge.key
on all other servers).
We need to setup the rights for munge accordingly on every server:
$ chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
$ chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
$ chmod 0755 /run/munge/
Then, we enable and start the munge service with (remember to not use sudo when running munge):
$ systemctl enable munge
$ systemctl start munge
You can then test whether munge works properly by executing:
munge -n # Generate a credential on stdout
munge -n | unmunge # Displays information about the MUNGE key
munge -n | ssh somehost unmunge
If everything is setup properly, you shouldn’t see any error messages.
Setup Slurm
Use apt
to install slurm in Ubuntu systems (make sure all nodes have the same slurm versions):
$ apt install slurm-wlm
Next, we need to configure slurm. Since we used package manager to install slurm, the version is lower than the latest release. Thus, it’s preferably to not use the official Slurm Configuration Tool. Instead, we can find the corresponding version’s configuration tool at /usr/share/doc/slurmctld/slurm-wlm-configurator.html
.
After filling up the required fields in the form, we copy the generated file into /etc/slurm-llnl/slurm.conf
on all nodes. Then, you can execute sinfo
to check all nodes status. You can also launch jobs to see if it actually works, for example:
srun -N2 -l /bin/hostname
This should print out the hostname for all the nodes in the cluster.
Add GPU support
To add GPU support, we first create a file gres.conf
in /etc/slurm-llnl/
. Here is an example on one node:
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Then, we add GresTypes=gpu
into /etc/slurm-llnl/slurm.conf
. Next, we add the GPU information to slurm.conf
:
NodeName=node1 Gres=gpu:3 State=UNKNOWN