SLURM¶
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job batch system for large and small Linux clusters.
Batch jobs are typically submitted using a batch script which is a text file containing a number of job directives and GNU/linux commands or utilities. Batch scripts are submitted to the SLURM, where they are queued awaiting free resources.
SLURM includes numerous directives, which are used to specify resource requirements and other attributes for batch and interactive jobs. SLURM directives can appear as header lines (lines that start with #SBATCH
) in a batch job script or as command-line options to the sbatch
command.
Here an example of a generic batch script:
#!/bin/bash
#SBATCH --partition=general # This is the default partition
#SBATCH --qos=regular
#SBATCH --job-name=gpu-cnn-test
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=10:00:00
#SBATCH --mem=24000
echo "Hello DIPC!"
#!/bin/bash
: run the job under the shell BASH.#SBATCH --partition=general
: send job to the general partition.#SBATCH --qos=regular
: send job to the regular QoS.#SBATCH --job-name=gpu-cnn-test
: give a name to the job.#SBATCH --cpus-per-task=1
: number of cpus/threads per task/process.#SBATCH --nodes=1
: make a reservation of 1 node.#SBATCH --ntasks-per-node=1
: number of tasks/processes per node.#SBATCH --time=100:00:00
: requested walltime inDD-HH:MM:SS
format.#SBATCH --mem=24000
: requested memory per node in MB.echo "Hello DIPC!"
: actual piece of code we want to run.
Tip
If you do not know the amount of resources in terms of memory or time your jobs are going to need, you should overestimate this values in the first runs and tweak those values up as you learn how jobs behave.
Once our batch script is prepared you can submit it as a batch job to the queue system.
Tip
If you do not redirect the output of your runs to a specific file, then the standard output will be redirected to a file that goes by the name slurm-<job_id>.log
.
You can find examples of batch scripts taylored for a particular HPC system on their devoted webpage.
List of job resources and options¶
Resource/Option | Syntax | Description |
---|---|---|
partition | --partition=general | Default partition containing all the publicly available nodes. |
qos | --qos=regular | Quality of Service of the job. |
nodes | --nodes=2 | Number of nodes for the job. |
cores/processes | --ntasks-per-node=8 | Number of cores/processes per compute node. |
threads per process | --cpus-per-task | Number of threads per process. |
memory | --mem=128000 | Memory per node in MB. |
memory | --mem=120G | Memory per node GB. |
memory | --mem-per-cpu=12000 | Memory per core in MB. |
gpu | --gres=gpu:p40:2 | Number of GPUS (and type). |
time | --time=12:00:00 | Time limit for the job in HH:MM:SS. |
job name | --job-name=my_job | Name of the job. |
output file | --output=job.out | File for the stdout. |
error file | --error=job.err | File for stderr. |
--mail-user=your@mail.com | Email address you want the notifications to be sent to. | |
email type | --mail-type=ALL | Notification type: ALL, END, START |
How to submit jobs¶
You will typically want to submit your jobs from a directory located under your /scratch
directory. This is why before anything you will need to move or copy all your necessary files to this filesystem.
sbatch
¶
To submit your job script (for example, batch_script.slurm), use sbatch
command.
$ sbatch batch_script.slurm
Submitted batch job 123
As a result of invoking the command, SLURM will return the a job tag or identifier.
Warning
By default all the environment variables from the submission environment are propagated to the launched application. If you want to override this default behaviour set the SLURM_EXPORT_ENV
environment variable to NONE
in the submission script. You could also give a list of the variables you wish to propagate: SLURM_EXPORT_ENV=PATH,EDITOR
.
salloc
¶
The salloc
command can be used to request resources be allocated without needing a batch script. Running salloc
with a list of resources will allocate said resources, create the job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.
[dilasgo@atlas-edr-login-02]$ salloc -N 1 --mem=20gb --time=02:00:00
salloc: Granted job allocation 1999
[dilasgo@atlas-edr-login-02]$ srun ls -l /scratch/dilasgo
drwx------. 5 dilasgo cc 3 Jul 6 11:29 applications
drwx------. 4 dilasgo cc 2 Jul 6 11:54 benchmarks
drwx------. 4 dilasgo cc 2 May 12 14:13 conda
drwx------. 5 dilasgo cc 3 May 4 19:35 Singularity
[dilasgo@atlas-edr-login-02]$ exit
exit
salloc: Relinquishing job allocation 1999
Warning
Any commands not invoked with srun
will be run locally on the submit node.
srun
¶
srun
starts an interactive job.
[dilasgo@atlas-edr-login-02 ~]$ srun --mem=10mb --time=00:12:00 bash -c 'echo "Hello from" `hostname`'
Hello from atlas-281
srun
will always wait for the command passed to finish before exiting, so if you start a long running process and your terminal session is closed or expired, your process will stop running your job will end. To run a non-interactive submission that will remain running after you logout, you will need to wrap your srun
command in a batch script and submit it with sbatch.
Job arrays¶
A job array script is a template to create multiple instances of a job. For example:
#SBATCH --array=1-9
Creates 9 instances of the job, one for each index 1,2,3,4,5,6,7,8,9
. Each instance is an individual job with the same requested resources. The job array template holds a fixed place in the job queue, and spawns job instances as resources become available. Each job instance has a unique index that is stored in the environment variable $SLURM_ARRAY_TASK_ID
. You can use the index however you want, in different ways.
A simple job array script is:
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --job-name=ARRAY_JOB
#SBATCH --time=00:10:00
#SBATCH --nodes=1 # nodes per instance
#SBATCH --ntasks=1 # tasks per instance
#SBATCH --array=0-9 # instance indexes
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
echo "Slurm job id is ${SLURM_JOB_ID}"
echo "Array job id is ${SLURM_ARRAY_JOB_ID}"
echo "Instance index is ${SLURM_ARRAY_TASK_ID}."
This will submit a single job which splits into 20 instances, each with 1 CPU core allocated. Users who would otherwise submit a large number of individual jobs are encouraged to use job arrays.
Managing your jobs¶
squeue
¶
squeue
command shows the status of the queue/jobs. Adding some options can enrich the output to display even more information:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123 long gpu-cnn- user1 R 1:02 1 atlas-262
124 small vasp-cu- user2 R 10:02:23 2 \[atlas-262,atlas-271\]
List all running jobs for a user:
squeue -u <username> -t RUNNING
List all pending jobs for a user:
squeue -u <username> -t PENDING
List all current jobs in the shared partition for a user:
squeue -u <username> -p xlong
sstat¶
Lists status info for a currently running job:
sstat --format=JobID,Nodelist -j <jobid>
To see all format available format options:
$ sstat --helpformat
AveCPU AveCPUFreq AveDiskRead AveDiskWrite
AvePages AveRSS AveVMSize ConsumedEnergy
ConsumedEnergyRaw JobID MaxDiskRead MaxDiskReadNode
MaxDiskReadTask MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask
MaxPages MaxPagesNode MaxPagesTask MaxRSS
MaxRSSNode MaxRSSTask MaxVMSize MaxVMSizeNode
MaxVMSizeTask MinCPU MinCPUNode MinCPUTask
Nodelist NTasks Pids ReqCPUFreq
ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov TRESUsageInAve
TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin
TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve
TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin
TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot
sacct¶
sacct
utility extracts job status and job events from accounting records. Using it can help identify where, how, a why a job failed. This tool takes a job id as a parameter.
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed,nodelist
To see all format available format options:
$ sacct --helpformat
Account AdminComment AllocCPUS AllocGRES
AllocNodes AllocTRES AssocID AveCPU
AveCPUFreq AveDiskRead AveDiskWrite AvePages
AveRSS AveVMSize BlockID Cluster
Comment Constraints ConsumedEnergy ConsumedEnergyRaw
CPUTime CPUTimeRAW DBIndex DerivedExitCode
Elapsed ElapsedRaw Eligible End
ExitCode Flags GID Group
JobID JobIDRaw JobName Layout
MaxDiskRead MaxDiskReadNode MaxDiskReadTask MaxDiskWrite
MaxDiskWriteNode MaxDiskWriteTask MaxPages MaxPagesNode
MaxPagesTask MaxRSS MaxRSSNode MaxRSSTask
MaxVMSize MaxVMSizeNode MaxVMSizeTask McsLabel
MinCPU MinCPUNode MinCPUTask NCPUS
NNodes NodeList NTasks Priority
Partition QOS QOSRAW Reason
ReqCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov
ReqCPUS ReqGRES ReqMem ReqNodes
ReqTRES Reservation ReservationId Reserved
ResvCPU ResvCPURAW Start State
Submit Suspended SystemCPU SystemComment
Timelimit TimelimitRaw TotalCPU TRESUsageInAve
TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin
TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve
TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin
TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot UID
User UserCPU WCKey WCKeyID
scancel¶
Deletes your job. It takes the job identifier as a parameter.
$ scancel <jobid>
To cancel all the jobs for a user:
scancel -u <username>
To cancel all the pending jobs for a user:
scancel -t PENDING -u <username>
seff¶
SLURM provides a tool called seff
to check the memory utilization and CPU efficiency for completed jobs. Note that for running and failed jobs, the efficiency numbers reported by seff
are not reliable so please use this tool only for successfully completed jobs:
$ seff <job_id>
How to set up email notifications¶
You may want to have SLURM send you email on certain events affecting your jobs. For example, if your job is expected to run for a long time, you might want the scheduler to tell you if it fails, or if it completes through email. You can also receive email to notify when a job starts.
Add the following to your batch script to set the email address:
#SBATCH --mail-user=your@email.org
These are the different types of notifications:
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=ALL
Job states¶
Status | Description |
---|---|
PENDING | Job is in the queue awaiting to be scheduled |
HOLD | Job is in the queue, but was put to a hold |
RUNNING | Job has been granted an allocation and is running |
COMPLETED | Job was completed successfully |
TIMEOUT | Job ran longer than the wallclock limit and was terminated |
FAILED | Job terminated with a non-zero status |
NODE FAIL | Job terminated after a compute node reported a failure |
Dependency chains¶
Job dependencies are used to defer the start of a job until some dependencies have been satisfied. Job dependencies can be defined using the --dependency
argument of the sbatch
command:
$ sbatch --dependency=<dependency_type>:<jobID> batch_script
or can also be declared within the batch script:
#!/bin/bash
#SBATCH --dependency=<dependency_type>
Available dependencies are:
after:<jobID>
job starts when job withjobID
begun execution.afterany:<jobID>
job starts when job withjobID
terminates.afterok:<jobID>
job starts when job withjobID
terminates successfully.afternotok:<jobID>
job starts when job withjobID
terminates with non-zero status.singleton
jobs starts when any previously job with the same job name and user terminates. This can be used to chain restartable jobs.
Requeuing policy¶
In case your job is running on a node that goes into a failed state for any reason (freezing, reboot, memory error, etc.), Slurm will attempt to requeue it. This can be beneficial for jobs that use checkpoints and can continue their execution from a point close to where they were aborted, but it can be dangerous otherwise, as it is likely that the outputs will be overwritten and replaced by the new output of the job that has started from scratch.
To avoid this behavior, you need to add the header #SBATCH --no-requeue
to your Slurm submit scripts.
Warning
Add #SBATCH --no-requeue
to your Slurm submit scripts to avoid your jobs for being requeued
Some useful SLURM/OS Environment Variables¶
You can use any of the following environment variables in your batch scripts:
Variable | Description |
---|---|
SLURM_SUBMIT_DIR | Name of the directory from which the user submitted the job. |
SLURM_JOB_NODELIST | Contains a list of nodes assigned to the job. |
SLURM_JOBID | Job's SLURM identifier. |
SLURM_NTASKS | Number of tasks requested. |
SLURM_CPUS_PER_TASK | Number of threads/cpus per task/process. |
USER | Contains the username of the user that submitted the job |