High Performance Computing | Slurm Job Monitoring FAQ

How do I check my job status?

Use squeue to view your jobs:

$ squeue
JOBID   PARTITION  NAME       USER     ST  TIME     NODES  NODELIST(REASON)
123456  compute    analysis   unityID  R   2:15:30  1      c001n01
123457  compute    preprocess unityID  PD  0:00     1      (Priority)

Key columns:

JOBID - Unique job identifier
ST - Job state (R=running, PD=pending)
TIME - Elapsed time for running jobs
NODELIST(REASON) - Node assignment or pending reason

Show more details

squeue -l

Show only running jobs

sq --run

Show only pending jobs

sq --pend

What do job states mean?

State	Code	Description
PENDING	PD	Waiting for resources or dependencies
RUNNING	R	Currently executing
COMPLETING	CG	Finishing up (epilog running)
COMPLETED	CD	Finished successfully (exit code 0)
FAILED	F	Finished with non-zero exit code
TIMEOUT	TO	Exceeded time limit
CANCELLED	CA	Cancelled by user or admin
NODE_FAIL	NF	Node failure during execution
OUT_OF_MEMORY	OOM	Exceeded memory limit

Why is my job pending?

The REASON column in sq --pend output explains why:

Reason	Meaning	Action
Priority	Other jobs have higher priority	Wait, or use `short` QOS for small jobs
Resources	Waiting for requested resources to become available	Wait, or reduce resource request
QOSMaxCpuPerUserLimit	You've hit your CPU limit for this QOS	Wait for running jobs to finish
QOSMaxJobsPerUserLimit	You've hit your job count limit	Wait for running jobs to finish
Dependency	Waiting for dependent job to complete	Wait for dependency to resolve
PartitionNodeLimit	Requesting more nodes than partition allows	Reduce node count
PartitionTimeLimit	Requested time exceeds partition limit	Reduce time or use different QOS
ReqNodeNotAvail	Required nodes are down or reserved	Remove constraint or wait
AssocGrpCPURunMinutesLimit	Account has used allocation	Contact HPC support

See Job Priority and Fairshare for details on how job priority is calculated.

Get detailed pending reason

squeue -j JOBID -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

Estimate start time

squeue -j JOBID --start

How do I see detailed job information?

The sj local Slurm command prints a concise summary of a job's resource request and key metadata — partition, QOS, node/task/CPU counts, memory, GPU request, constraints, time limit, and submit/start/end times. It reads the job from the controller while it is pending, running, or recently completed, and falls back to the accounting database for older jobs.

sj JOBID                # request + state for a job
sj JOBID_7              # an array task element

For the complete record (working directory, command, environment, every parameter), use scontrol show job:

scontrol show job JOBID

This shows all job parameters including:

Submit time and start time
Requested resources (CPUs, memory, time)
Node assignment
Working directory and command
Environment variables

Is my running job making progress?

squeue tells you a job is running, not whether it is doing any work, and seff only reports once the job has finished. To see what a job is doing right now, use the sjs local Slurm command, which summarizes the live usage counters Slurm collects on the compute nodes:

$ sjs 464645
464645  unityID, RUNNING, elapsed 4:44:21
  alloc: cpu=12,mem=48000M,node=1,billing=23

Step          NTasks  AveCPU    %CPU  MaxRSS  %Mem  MaxDiskRead  MaxDiskWrite
464645.batch  1       04:55:28  9%    9.24G   20%   181.72G      235.09G

Column	Meaning
`alloc`	What the job actually holds: cores, memory, nodes
`AveCPU`	CPU time the step has accumulated so far
`%CPU`	That CPU time as a fraction of what the allocation could have used (elapsed × allocated cores). 100% means every requested core is busy
`MaxRSS`	Peak memory used on any one node
`%Mem`	That peak as a fraction of the memory allocated per node
`MaxDiskRead` / `MaxDiskWrite`	Peak bytes read and written by the step

The job above is healthy but wasteful: %CPU of 9% on a 12-core request means it is using about one core, so 11 sit idle for the life of the job. %Mem of 20% says the 48 GB request is generous too. Both are worth fixing — a smaller request starts sooner and costs your group less. See How do I check job efficiency? for the same idea applied after a job finishes.

Making progress, or stuck?

A single snapshot cannot tell a busy job from a wedged one — both just sit there. Add -r to re-sample and show what changed since the last sample:

sjs JOBID -r            # re-sample every 30 seconds until Ctrl-C
sjs JOBID -r 60         # every 60 seconds

This adds dCPU, dRead, and dWrite columns and a Status of running or idle?:

No CPU and no I/O across an interval (idle?) is the fingerprint of a stuck job — deadlocked, or blocked waiting on a lock, a network mount, or a license server.
Little CPU but growing I/O is normal for an I/O-bound job. It is working; the disk is the bottleneck, and more cores will not help.

Slurm only updates these counters every 30 seconds, so sjs uses that as both the default and the minimum interval — sampling faster would just show the same numbers twice and look like a stall.

Array jobs

sjs 464713_64           # one array element
sjs 464713              # every running element of the array

The whole-array view adds Task and Elapsed columns. Elements start at different times, so each is measured against its own clock — an element that started a minute ago is not compared to one that has run for hours.

Which step is shown

A Slurm job is made of steps. For a typical sbatch job the work happens in the .batch step, and that is what sjs shows. If your script launches work with srun, those steps are shown instead, since .batch would then be just the script's shell sitting idle while the real work runs elsewhere. To see everything, including the always-idle .extern step:

sjs JOBID --all-steps

Only running jobs have live counters. For a job's resource request see sj; for a finished job's efficiency see seff.

Equivalent native command: sstat -j JOBID.batch. Note that sstat requires a job step, not just a job ID: plain sstat JOBID looks for step .0, which a batch job does not have, so it returns nothing and the job looks dead. sstat also does not accept array notation such as 464713_64. sjs works out the step and resolves array elements for you, and adds the %CPU / %Mem comparisons against what the job actually requested.

How do I check completed job information?

Use sacct to view completed jobs:

# Jobs from today
sacct

# Specific job
sacct -j JOBID

# Jobs from last 7 days
sacct --starttime=$(date -d '7 days ago' +%Y-%m-%d)

Useful output formats

# Show elapsed time and exit status
sacct -j JOBID --format=JobID,JobName,Elapsed,State,ExitCode

# Show resource usage
sacct -j JOBID --format=JobID,JobName,MaxRSS,MaxVMSize,CPUTime,Elapsed

Common sacct fields

Field	Description
Elapsed	Actual runtime
CPUTime	Total CPU time (cores × time)
MaxRSS	Maximum memory used
MaxVMSize	Maximum virtual memory
State	Final job state
ExitCode	Exit code (0 = success)
ReqMem	Requested memory
ReqCPUS	Requested CPUs

How do I check job efficiency?

Use seff for a quick efficiency report:

$ seff JOBID
Job ID: 123456
Cluster: cluster
User/Group: unityID/users
State: COMPLETED (exit code 0)
Cores: 8
CPU Utilized: 06:45:23
CPU Efficiency: 84.42% of 08:00:00 core-walltime
Memory Utilized: 12.5 GB
Memory Efficiency: 78.12% of 16.00 GB

Low CPU efficiency may indicate:

Requesting more cores than the code can use
I/O bottlenecks
Load imbalance in parallel code

Low memory efficiency means you could request less memory.

seff only works once a job has finished. To check the same things while a job is still running, see Is my running job making progress?

How do I check cluster availability?

The si local Slurm command prints one table of resource availability. Each row is a partition and architecture with Avail / Alloc / Total counts — by default CPU cores, summed over the deployed nodes:

$ si
Partition  Architecture  Avail  Alloc  Total
compute    Cascadelake     468    364    896
compute    Genoa           312    128    512
gpu        Genoa            72     56    128

Other useful forms:

si -p gpu               # restrict to one partition
si --gpus               # report GPUs (architecture becomes GPU model)
si --memory             # report memory in GiB
si --all                # also include down nodes (adds a Down/Drain column)

Avail counts resources on idle (no jobs) or mix (partly used) nodes that can accept a new job; Alloc is in use; Total is the full deployed capacity — idle, mix, alloc (fully allocated), and drain (being drained for maintenance) nodes. down nodes are excluded unless you add --all, which counts their capacity under Total with Avail 0 and breaks out a Down/Drain column.

Equivalent native commands: sinfo, sinfo -s, sinfo -o "%P %a %D %c %m %G". In sinfo -o output, GRES on compute nodes reports the CPU architecture as a typed resource (e.g., cpu:haswell:20, cpu:cascadelake:32, cpu:genoa:192) and on GPU nodes the GPU type (e.g., gpu:a100:4).

How do I see node details?

si --nodes prints one row per node, in name order, with the same Avail / Alloc / Total columns. A node that belongs to several partitions is listed once, with all its partitions in the Partition column. Add --gpus or --memory to switch the resource:

si --nodes              # per-node cores
si --nodes --gpus       # per-node GPUs (GPU nodes only)
si --nodes --memory     # per-node memory (GiB)
si --nodes -p gpu       # restrict to one partition
si --nodes --drain      # only drained/draining nodes, with reason

Show a specific node

scontrol show node NODENAME

Equivalent native commands: sinfo -N -l, sinfo -p gpu -o "%N %G %t %C".

Graphical view

See the cluster status page for a visual representation.

What QOS and accounts can I use?

Two local Slurm commands answer this without parsing raw sacctmgr output. sqos is usually what you want: it lists the QOS you can submit with, which partitions each is valid on, and the limits that apply (wall time, and per-user / per-job CPU, GPU, and memory caps).

$ sqos
QOS     Partitions        MaxWall      MaxTRES/User     MaxTRES/Job  GrpTRES
normal  compute           4-00:00:00   cpu=512          -            -
long    compute           10-00:00:00  cpu=512          -            -
gpu     gpu               4-00:00:00   -                -            -
short   compute_partners  02:00:00     -                -            -

Add -v to also show each QOS's priority and flags.

sa shows your underlying associations — the accounts you can charge to, the partitions, your default QOS, and the full allowed QOS list per account:

sa                      # your associations (account, partition, default QOS, allowed QOS)
sa --tree               # your place in the account hierarchy, back to root

If a job is rejected for an invalid QOS or partition, sqos will show which combinations are actually open to you. See Partitions and Resources for the full QOS / partition reference and Job Priority and Fairshare for how QOS affects scheduling.

Equivalent native command: sacctmgr show assoc user=$USER format=account,qos,maxcpus,maxnodes.

How do I cancel jobs?

Use scancel:

# Cancel specific job
scancel JOBID

# Cancel all your jobs
scancel -u $USER

# Cancel all pending jobs
scancel -u $USER -t PENDING

# Cancel jobs by name
scancel -n jobname

# Cancel array job tasks
scancel JOBID_[1-50]

Can I modify a pending job?

Yes, use scontrol update for pending jobs:

# Change time limit
scontrol update jobid=JOBID TimeLimit=4:00:00

# Change partition
scontrol update jobid=JOBID Partition=gpu

# Change QOS
scontrol update jobid=JOBID QOS=short

# Change job name
scontrol update jobid=JOBID JobName=newname

Note: You cannot increase resources beyond original request for running jobs, but you can decrease time limits.

Hold and release jobs

# Hold a pending job
scontrol hold JOBID

# Release a held job
scontrol release JOBID

Slurm Job Monitoring FAQ

Quick Links