Simple Linux Utility for Resource Management, or Slurm, is an open-source job scheduler used by high-performance computing (HPC) clusters to manage and allocate resources for jobs submitted by users. It has different features, such as job queuing, resource allocation, and job execution based on a defined scheduling policy. The official documentation helps the users for a quick hands-in with Slurm. An official cheat sheet can also be used to check the available commands.
Every following commands should be run from the master node, that is the diufrd200
, which is the entry point for the faith cluster.
Command | Description |
---|---|
sbatch <script_name> |
Submit a job script in the queue, possibly for later execution |
squeue |
Show the job queue |
scancel <job_id> |
Cancel a queued or running job |
sinfo |
Show information about nodes status in the cluster |
It is better to immediately sinfo
to check the available nodes, and their partitions.
Inside a Slurm job script (bash script), #SBATCH
directives are used to define how the job should be handled by the scheduler. These lines must begin with #SBATCH
and are placed at the top of the script. The table below shows the main options that can be used, and the complete list can be found in the official documentation. Be careful of the default values if options are not defined in the script.
Option | Description |
---|---|
--job-name=NAME |
Set the name of the job |
--nodelist=NODES |
Specify the host/node where to launch the script |
--partition=PARTITION |
Select which partition to submit (queue) to |
--ntasks=N |
Number of tasks/processes to run |
--cpus-per-task=N |
Number of CPU cores per task |
--output=FILE |
File to write standard output |
--error=FILE |
File to write standard error (optional) |
--gres=gpu:N |
Request N GPUs (if available) |
--mem=4G |
Request a specific amount of memory |
--mail-user=EMAIL |
Email address for notifications |
--mail-type=BEGIN,END,FAIL |
When to send email notifications |
More examples in here
#!/bin/bash
#SBATCH --job-name=my_nice_job
#SBATCH --nodelist=diufrd202
#SBATCH --partition=GPU
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --output=logs/%x_%j.log
#SBATCH --error=logs/%x_%j.log
#SBATCH --gres=1
#SBATCH --mem=4G
#SBATCH --mail-user=john.doe@unifr.ch
#SBATCH --mail-type=END,FAIL
python <python_script_name>
Then launch the bash script with sbatch <bash_script_name>
.
- Slurm does not show any output on the shell, which sometimes can be annoying when we want to see some progression or various outputs. We could constantly outputting the modification of the log file to the terminal with the
tail -f <log_file>
command.
More examples and details can be found in the UniFR intranet page (need to access it through the UniFR VPN)