SLURM¶

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job batch system for large and small Linux clusters.

Batch jobs are typically submitted using a batch script which is a text file containing a number of job directives and GNU/linux commands or utilities. Batch scripts are submitted to the SLURM, where they are queued awaiting free resources.

SLURM includes numerous directives, which are used to specify resource requirements and other attributes for batch and interactive jobs. SLURM directives can appear as header lines (lines that start with #SBATCH) in a batch job script or as command-line options to the sbatch command.

Here an example of a generic batch script:

#!/bin/bash                                                          
#SBATCH --partition=general # This is the default partition
#SBATCH --qos=regular
#SBATCH --job-name=gpu-cnn-test
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=10:00:00
#SBATCH --mem=24000

echo "Hello DIPC!"

This batch script example can be read line by line as follows:

#!/bin/bash: run the job under the shell BASH.
#SBATCH --partition=general: send job to the general partition.
#SBATCH --qos=regular: send job to the regular QoS.
#SBATCH --job-name=gpu-cnn-test: give a name to the job.
#SBATCH --cpus-per-task=1: number of cpus/threads per task/process.
#SBATCH --nodes=1: make a reservation of 1 node.
#SBATCH --ntasks-per-node=1: number of tasks/processes per node.
#SBATCH --time=100:00:00: requested walltime in DD-HH:MM:SS format.
#SBATCH --mem=24000: requested memory per node in MB.
echo "Hello DIPC!": actual piece of code we want to run.

Tip

If you do not know the amount of resources in terms of memory or time your jobs are going to need, you should overestimate this values in the first runs and tweak those values up as you learn how jobs behave.

Once our batch script is prepared you can submit it as a batch job to the queue system.

Tip

If you do not redirect the output of your runs to a specific file, then the standard output will be redirected to a file that goes by the name slurm-<job_id>.log.

You can find examples of batch scripts taylored for a particular HPC system on their devoted webpage.

List of job resources and options¶

Resource/Option	Syntax	Description
partition	--partition=general	Default partition containing all the publicly available nodes.
qos	--qos=regular	Quality of Service of the job.
nodes	--nodes=2	Number of nodes for the job.
cores/processes	--ntasks-per-node=8	Number of cores/processes per compute node.
threads per process	--cpus-per-task	Number of threads per process.
memory	--mem=128000	Memory per node in MB.
memory	--mem=120G	Memory per node GB.
memory	--mem-per-cpu=12000	Memory per core in MB.
gpu	--gres=gpu:p40:2	Number of GPUS (and type).
time	--time=12:00:00	Time limit for the job in HH:MM:SS.
job name	--job-name=my_job	Name of the job.
output file	--output=job.out	File for the stdout.
error file	--error=job.err	File for stderr.
email	--mail-user=your@mail.com	Email address you want the notifications to be sent to.
email type	--mail-type=ALL	Notification type: ALL, END, START

How to submit jobs¶

You will typically want to submit your jobs from a directory located under your /scratch directory. This is why before anything you will need to move or copy all your necessary files to this filesystem.

`sbatch`¶

To submit your job script (for example, batch_script.slurm), use sbatch command.

$ sbatch batch_script.slurm
Submitted batch job 123

As a result of invoking the command, SLURM will return the a job tag or identifier.

Warning

By default all the environment variables from the submission environment are propagated to the launched application. If you want to override this default behaviour set the SLURM_EXPORT_ENV environment variable to NONE in the submission script. You could also give a list of the variables you wish to propagate: SLURM_EXPORT_ENV=PATH,EDITOR.

`salloc`¶

The salloc command can be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate said resources, create the job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.

[dilasgo@atlas-edr-login-02]$ salloc -N 1 --mem=20gb --time=02:00:00
salloc: Granted job allocation 1999
[dilasgo@atlas-edr-login-02]$ srun ls -l /scratch/dilasgo
drwx------. 5 dilasgo cc   3 Jul  6 11:29 applications
drwx------. 4 dilasgo cc   2 Jul  6 11:54 benchmarks
drwx------. 4 dilasgo cc   2 May 12 14:13 conda
drwx------. 5 dilasgo cc   3 May  4 19:35 Singularity
[dilasgo@atlas-edr-login-02]$ exit
exit
salloc: Relinquishing job allocation 1999

Warning

Any commands not invoked with srun will be run locally on the submit node.

`srun`¶

srun starts an interactive job.

[dilasgo@atlas-edr-login-02 ~]$ srun --mem=10mb --time=00:12:00 bash -c 'echo "Hello from" `hostname`'
Hello from atlas-281

srun will always wait for the command passed to finish before exiting, so if you start a long running process and your terminal session is closed or expired, your process will stop running your job will end. To run a non-interactive submission that will remain running after you logout, you will need to wrap your srun command in a batch script and submit it with sbatch.

Job arrays¶

A job array script is a template to create multiple instances of a job. For example:

#SBATCH --array=1-9

Creates 9 instances of the job, one for each index 1,2,3,4,5,6,7,8,9. Each instance is an individual job with the same requested resources. The job array template holds a fixed place in the job queue, and spawns job instances as resources become available. Each job instance has a unique index that is stored in the environment variable $SLURM_ARRAY_TASK_ID. You can use the index however you want, in different ways.

A simple job array script is:

#!/bin/bash
#SBATCH --qos=regular
#SBATCH --job-name=ARRAY_JOB
#SBATCH --time=00:10:00
#SBATCH --nodes=1              # nodes per instance
#SBATCH --ntasks=1             # tasks per instance
#SBATCH --array=0-9           # instance indexes
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

echo "Slurm job id is ${SLURM_JOB_ID}"
echo "Array job id is ${SLURM_ARRAY_JOB_ID}"
echo "Instance index is ${SLURM_ARRAY_TASK_ID}."

This will submit a single job which splits into 20 instances, each with 1 CPU core allocated. Users who would otherwise submit a large number of individual jobs are encouraged to use job arrays.

Managing your jobs¶

`squeue`¶

squeue command shows the status of the queue/jobs. Adding some options can enrich the output to display even more information:

$ squeue
JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON) 
123     long   gpu-cnn-    user1  R       1:02      1 atlas-262
124     small  vasp-cu-    user2  R   10:02:23      2 \[atlas-262,atlas-271\]

List all running jobs for a user:

squeue -u <username> -t RUNNING

List all pending jobs for a user:

squeue -u <username> -t PENDING

List all current jobs in the shared partition for a user:

squeue -u <username> -p xlong

sstat¶

Lists status info for a currently running job:

sstat --format=JobID,Nodelist -j <jobid>

To see all format available format options:

$ sstat --helpformat
AveCPU              AveCPUFreq          AveDiskRead         AveDiskWrite       
AvePages            AveRSS              AveVMSize           ConsumedEnergy     
ConsumedEnergyRaw   JobID               MaxDiskRead         MaxDiskReadNode    
MaxDiskReadTask     MaxDiskWrite        MaxDiskWriteNode    MaxDiskWriteTask   
MaxPages            MaxPagesNode        MaxPagesTask        MaxRSS             
MaxRSSNode          MaxRSSTask          MaxVMSize           MaxVMSizeNode      
MaxVMSizeTask       MinCPU              MinCPUNode          MinCPUTask         
Nodelist            NTasks              Pids                ReqCPUFreq         
ReqCPUFreqMin       ReqCPUFreqMax       ReqCPUFreqGov       TRESUsageInAve     
TRESUsageInMax      TRESUsageInMaxNode  TRESUsageInMaxTask  TRESUsageInMin     
TRESUsageInMinNode  TRESUsageInMinTask  TRESUsageInTot      TRESUsageOutAve    
TRESUsageOutMax     TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin    
TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot

sacct¶

sacct utility extracts job status and job events from accounting records. Using it can help identify where, how, a why a job failed. This tool takes a job id as a parameter.

sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed,nodelist

To see all format available format options:

$ sacct --helpformat
Account             AdminComment        AllocCPUS           AllocGRES          
AllocNodes          AllocTRES           AssocID             AveCPU             
AveCPUFreq          AveDiskRead         AveDiskWrite        AvePages           
AveRSS              AveVMSize           BlockID             Cluster            
Comment             Constraints         ConsumedEnergy      ConsumedEnergyRaw  
CPUTime             CPUTimeRAW          DBIndex             DerivedExitCode    
Elapsed             ElapsedRaw          Eligible            End                
ExitCode            Flags               GID                 Group              
JobID               JobIDRaw            JobName             Layout             
MaxDiskRead         MaxDiskReadNode     MaxDiskReadTask     MaxDiskWrite       
MaxDiskWriteNode    MaxDiskWriteTask    MaxPages            MaxPagesNode       
MaxPagesTask        MaxRSS              MaxRSSNode          MaxRSSTask         
MaxVMSize           MaxVMSizeNode       MaxVMSizeTask       McsLabel           
MinCPU              MinCPUNode          MinCPUTask          NCPUS              
NNodes              NodeList            NTasks              Priority           
Partition           QOS                 QOSRAW              Reason             
ReqCPUFreq          ReqCPUFreqMin       ReqCPUFreqMax       ReqCPUFreqGov      
ReqCPUS             ReqGRES             ReqMem              ReqNodes           
ReqTRES             Reservation         ReservationId       Reserved           
ResvCPU             ResvCPURAW          Start               State              
Submit              Suspended           SystemCPU           SystemComment      
Timelimit           TimelimitRaw        TotalCPU            TRESUsageInAve     
TRESUsageInMax      TRESUsageInMaxNode  TRESUsageInMaxTask  TRESUsageInMin     
TRESUsageInMinNode  TRESUsageInMinTask  TRESUsageInTot      TRESUsageOutAve    
TRESUsageOutMax     TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin    
TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot     UID                
User                UserCPU             WCKey               WCKeyID

scancel¶

Deletes your job. It takes the job identifier as a parameter.

$ scancel <jobid>

To cancel all the jobs for a user:

scancel -u <username>

To cancel all the pending jobs for a user:

scancel -t PENDING -u <username>

seff¶

SLURM provides a tool called seff to check the memory utilization and CPU efficiency for completed jobs. Note that for running and failed jobs, the efficiency numbers reported by seff are not reliable so please use this tool only for successfully completed jobs:

$ seff <job_id>

How to set up email notifications¶

You may want to have SLURM send you email on certain events affecting your jobs. For example, if your job is expected to run for a long time, you might want the scheduler to tell you if it fails, or if it completes through email. You can also receive email to notify when a job starts.

Add the following to your batch script to set the email address:

#SBATCH --mail-user=your@email.org

These are the different types of notifications:

#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=ALL

Job states¶

Status	Description
PENDING	Job is in the queue awaiting to be scheduled
HOLD	Job is in the queue, but was put to a hold
RUNNING	Job has been granted an allocation and is running
COMPLETED	Job was completed successfully
TIMEOUT	Job ran longer than the wallclock limit and was terminated
FAILED	Job terminated with a non-zero status
NODE FAIL	Job terminated after a compute node reported a failure

Dependency chains¶

Job dependencies are used to defer the start of a job until some dependencies have been satisfied. Job dependencies can be defined using the --dependency argument of the sbatch command:

$ sbatch --dependency=<dependency_type>:<jobID> batch_script

or can also be declared within the batch script:

#!/bin/bash
#SBATCH --dependency=<dependency_type>

Available dependencies are:

after:<jobID> job starts when job with jobID begun execution.
afterany:<jobID> job starts when job with jobID terminates.
afterok:<jobID> job starts when job with jobID terminates successfully.
afternotok:<jobID> job starts when job with jobID terminates with non-zero status.
singleton jobs starts when any previously job with the same job name and user terminates. This can be used to chain restartable jobs.

Requeuing policy¶

In case your job is running on a node that goes into a failed state for any reason (freezing, reboot, memory error, etc.), Slurm will attempt to requeue it. This can be beneficial for jobs that use checkpoints and can continue their execution from a point close to where they were aborted, but it can be dangerous otherwise, as it is likely that the outputs will be overwritten and replaced by the new output of the job that has started from scratch.

To avoid this behavior, you need to add the header #SBATCH --no-requeue to your Slurm submit scripts.

Warning

Add #SBATCH --no-requeue to your Slurm submit scripts to avoid your jobs for being requeued

Some useful SLURM/OS Environment Variables¶

You can use any of the following environment variables in your batch scripts:

Variable	Description
SLURM_SUBMIT_DIR	Name of the directory from which the user submitted the job.
SLURM_JOB_NODELIST	Contains a list of nodes assigned to the job.
SLURM_JOBID	Job's SLURM identifier.
SLURM_NTASKS	Number of tasks requested.
SLURM_CPUS_PER_TASK	Number of threads/cpus per task/process.
USER	Contains the username of the user that submitted the job

SLURM¶

List of job resources and options¶

How to submit jobs¶

sbatch¶

salloc¶

srun¶

Job arrays¶

Managing your jobs¶

squeue¶

sstat¶

sacct¶

scancel¶

seff¶

How to set up email notifications¶

Job states¶

Dependency chains¶

Requeuing policy¶

Some useful SLURM/OS Environment Variables¶

`sbatch`¶

`salloc`¶

`srun`¶

`squeue`¶