Slurm Job Monitoring FAQ
Frequently asked questions about monitoring jobs, understanding job states, and checking cluster status.
Quick Links
- How do I check my job status?
- What do job states mean?
- Why is my job pending?
- How do I see detailed job information?
- How do I check completed job information?
- How do I check job efficiency?
- How do I check cluster availability?
- How do I see node details?
- What QOS and accounts can I use?
- How do I cancel jobs?
- Can I modify a pending job?
How do I check my job status?
Use squeue to view your jobs:
$ squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 123456 compute analysis unityID R 2:15:30 1 c001n01 123457 compute preprocess unityID PD 0:00 1 (Priority)
Key columns:
- JOBID - Unique job identifier
- ST - Job state (R=running, PD=pending)
- TIME - Elapsed time for running jobs
- NODELIST(REASON) - Node assignment or pending reason
Show more details
squeue -u $USER -l
Show only running jobs
squeue -u $USER -t RUNNING
Show only pending jobs
squeue -u $USER -t PENDING
What do job states mean?
| State | Code | Description |
|---|---|---|
| PENDING | PD | Waiting for resources or dependencies |
| RUNNING | R | Currently executing |
| COMPLETING | CG | Finishing up (epilog running) |
| COMPLETED | CD | Finished successfully (exit code 0) |
| FAILED | F | Finished with non-zero exit code |
| TIMEOUT | TO | Exceeded time limit |
| CANCELLED | CA | Cancelled by user or admin |
| NODE_FAIL | NF | Node failure during execution |
| OUT_OF_MEMORY | OOM | Exceeded memory limit |
Why is my job pending?
The REASON column in squeue output explains why:
| Reason | Meaning | Action |
|---|---|---|
| Priority | Other jobs have higher priority | Wait, or use short QOS for small jobs |
| Resources | Waiting for requested resources to become available | Wait, or reduce resource request |
| QOSMaxCpuPerUserLimit | You've hit your CPU limit for this QOS | Wait for running jobs to finish |
| QOSMaxJobsPerUserLimit | You've hit your job count limit | Wait for running jobs to finish |
| Dependency | Waiting for dependent job to complete | Wait for dependency to resolve |
| PartitionNodeLimit | Requesting more nodes than partition allows | Reduce node count |
| PartitionTimeLimit | Requested time exceeds partition limit | Reduce time or use different QOS |
| ReqNodeNotAvail | Required nodes are down or reserved | Remove constraint or wait |
| AssocGrpCPURunMinutesLimit | Account has used allocation | Contact HPC support |
See Job Priority and Fairshare for details on how job priority is calculated.
Get detailed pending reason
squeue -j JOBID -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
Estimate start time
squeue -j JOBID --start
How do I see detailed job information?
The sj helper prints a concise summary of a job's resource request and key metadata — partition, QOS, node/task/CPU counts, memory, GPU request, constraints, time limit, and submit/start/end times. It reads the job from the controller while it is pending, running, or recently completed, and falls back to the accounting database for older jobs.
sj JOBID # request + state for a job sj JOBID_7 # an array task element
For the complete record (working directory, command, environment, every parameter), use scontrol show job:
scontrol show job JOBID
This shows all job parameters including:
- Submit time and start time
- Requested resources (CPUs, memory, time)
- Node assignment
- Working directory and command
- Environment variables
How do I check completed job information?
Use sacct to view completed jobs:
# Jobs from today sacct # Specific job sacct -j JOBID # Jobs from last 7 days sacct --starttime=$(date -d '7 days ago' +%Y-%m-%d)
Useful output formats
# Show elapsed time and exit status sacct -j JOBID --format=JobID,JobName,Elapsed,State,ExitCode # Show resource usage sacct -j JOBID --format=JobID,JobName,MaxRSS,MaxVMSize,CPUTime,Elapsed
Common sacct fields
| Field | Description |
|---|---|
| Elapsed | Actual runtime |
| CPUTime | Total CPU time (cores × time) |
| MaxRSS | Maximum memory used |
| MaxVMSize | Maximum virtual memory |
| State | Final job state |
| ExitCode | Exit code (0 = success) |
| ReqMem | Requested memory |
| ReqCPUS | Requested CPUs |
How do I check job efficiency?
Use seff for a quick efficiency report:
$ seff JOBID Job ID: 123456 Cluster: cluster User/Group: unityID/users State: COMPLETED (exit code 0) Cores: 8 CPU Utilized: 06:45:23 CPU Efficiency: 84.42% of 08:00:00 core-walltime Memory Utilized: 12.5 GB Memory Efficiency: 78.12% of 16.00 GB
Low CPU efficiency may indicate:
- Requesting more cores than the code can use
- I/O bottlenecks
- Load imbalance in parallel code
Low memory efficiency means you could request less memory.
How do I check cluster availability?
The si helper gives a concise per-partition summary of how many nodes are currently free (idle or mixed), grouped by CPU architecture on the compute partitions and by GPU model on the GPU partitions:
$ si Partition........... Available Nodes compute............. 45 Cascadelake 20 Icelake 5 Genoa gpu................. 8 H100 2 A100
Other useful forms:
si -p gpu # restrict to one partition si --memory # group free nodes by total memory size si --all # include nodes in every state, not just free
"Free" counts nodes whose state is idle (available, no jobs) or mix (some cores in use, some available). The raw sinfo command lists every partition/state group, including states si hides by default: alloc (fully allocated), down (unavailable), and drain (being drained for maintenance).
Equivalent native commands: sinfo, sinfo -s, sinfo -o "%P %a %D %c %m %G". In sinfo -o output, GRES on compute nodes reports the CPU architecture as a typed resource (e.g., cpu:haswell:20, cpu:cascadelake:32, cpu:genoa:192) and on GPU nodes the GPU type (e.g., gpu:a100:4).
How do I see node details?
si --nodes prints a per-node table. By default it shows available / allocated / offline / total cores; add --memory or --gpus for per-node memory or GPU counts instead:
si --nodes # per-node cores si --nodes --memory # per-node memory (free / allocated / total) si --nodes --gpus # per-node GPUs (GPU nodes only) si --nodes -p gpu # restrict to one partition
Show a specific node
scontrol show node NODENAME
Equivalent native commands: sinfo -N -l, sinfo -p gpu -o "%N %G %t %C".
Graphical view
See the cluster status page for a visual representation.
What QOS and accounts can I use?
Two helpers answer this without parsing raw sacctmgr output. sqos is usually what you want: it lists the QOS you can submit with, which partitions each is valid on, and the limits that apply (wall time, and per-user / per-job CPU, GPU, and memory caps).
$ sqos QOS Partitions MaxWall MaxTRES/User MaxTRES/Job GrpTRES normal compute 4-00:00:00 cpu=512 - - long compute 10-00:00:00 cpu=512 - - gpu gpu 4-00:00:00 - - - short compute_partners 02:00:00 - - -
Add -v to also show each QOS's priority and flags. Pass a login (e.g. sqos alice) to see another user's QOS.
sa shows your underlying associations — the accounts you can charge to, the partitions, your default QOS, and the full allowed QOS list per account:
sa # your associations (account, partition, default QOS, allowed QOS) sa --tree # your place in the account hierarchy, back to root sa alice # another user's associations
If a job is rejected for an invalid QOS or partition, sqos will show which combinations are actually open to you. See Partitions and Resources for the full QOS / partition reference and Job Priority and Fairshare for how QOS affects scheduling.
Equivalent native command: sacctmgr show assoc user=$USER format=account,qos,maxcpus,maxnodes.
How do I cancel jobs?
Use scancel:
# Cancel specific job scancel JOBID # Cancel all your jobs scancel -u $USER # Cancel all pending jobs scancel -u $USER -t PENDING # Cancel jobs by name scancel -n jobname # Cancel array job tasks scancel JOBID_[1-50]
Can I modify a pending job?
Yes, use scontrol update for pending jobs:
# Change time limit scontrol update jobid=JOBID TimeLimit=4:00:00 # Change partition scontrol update jobid=JOBID Partition=gpu # Change QOS scontrol update jobid=JOBID QOS=short # Change job name scontrol update jobid=JOBID JobName=newname
Note: You cannot increase resources beyond original request for running jobs, but you can decrease time limits.
Hold and release jobs
# Hold a pending job scontrol hold JOBID # Release a held job scontrol release JOBID