Slurm Job Monitoring FAQ
Frequently asked questions about monitoring jobs, understanding job states, and checking cluster status.
Quick Links
- How do I check my job status?
- What do job states mean?
- Why is my job pending?
- How do I see detailed job information?
- How do I check completed job information?
- How do I check job efficiency?
- How do I check cluster availability?
- How do I see node details?
- How do I cancel jobs?
- Can I modify a pending job?
How do I check my job status?
Use squeue to view your jobs:
$ squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 123456 compute analysis unityID R 2:15:30 1 c001n01 123457 compute preprocess unityID PD 0:00 1 (Priority)
Key columns:
- JOBID - Unique job identifier
- ST - Job state (R=running, PD=pending)
- TIME - Elapsed time for running jobs
- NODELIST(REASON) - Node assignment or pending reason
Show more details
squeue -u $USER -l
Show only running jobs
squeue -u $USER -t RUNNING
Show only pending jobs
squeue -u $USER -t PENDING
What do job states mean?
| State | Code | Description |
|---|---|---|
| PENDING | PD | Waiting for resources or dependencies |
| RUNNING | R | Currently executing |
| COMPLETING | CG | Finishing up (epilog running) |
| COMPLETED | CD | Finished successfully (exit code 0) |
| FAILED | F | Finished with non-zero exit code |
| TIMEOUT | TO | Exceeded time limit |
| CANCELLED | CA | Cancelled by user or admin |
| NODE_FAIL | NF | Node failure during execution |
| OUT_OF_MEMORY | OOM | Exceeded memory limit |
Why is my job pending?
The REASON column in squeue output explains why:
| Reason | Meaning | Action |
|---|---|---|
| Priority | Other jobs have higher priority | Wait, or use short QOS for small jobs |
| Resources | Waiting for requested resources to become available | Wait, or reduce resource request |
| QOSMaxCpuPerUserLimit | You've hit your CPU limit for this QOS | Wait for running jobs to finish |
| QOSMaxJobsPerUserLimit | You've hit your job count limit | Wait for running jobs to finish |
| Dependency | Waiting for dependent job to complete | Wait for dependency to resolve |
| PartitionNodeLimit | Requesting more nodes than partition allows | Reduce node count |
| PartitionTimeLimit | Requested time exceeds partition limit | Reduce time or use different QOS |
| ReqNodeNotAvail | Required nodes are down or reserved | Remove constraint or wait |
| AssocGrpCPURunMinutesLimit | Account has used allocation | Contact HPC support |
See Job Priority and Fairshare for details on how job priority is calculated.
Get detailed pending reason
squeue -j JOBID -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
Estimate start time
squeue -j JOBID --start
How do I see detailed job information?
Use scontrol show job:
scontrol show job JOBID
This shows all job parameters including:
- Submit time and start time
- Requested resources (CPUs, memory, time)
- Node assignment
- Working directory and command
- Environment variables
How do I check completed job information?
Use sacct to view completed jobs:
# Jobs from today sacct # Specific job sacct -j JOBID # Jobs from last 7 days sacct --starttime=$(date -d '7 days ago' +%Y-%m-%d)
Useful output formats
# Show elapsed time and exit status sacct -j JOBID --format=JobID,JobName,Elapsed,State,ExitCode # Show resource usage sacct -j JOBID --format=JobID,JobName,MaxRSS,MaxVMSize,CPUTime,Elapsed
Common sacct fields
| Field | Description |
|---|---|
| Elapsed | Actual runtime |
| CPUTime | Total CPU time (cores × time) |
| MaxRSS | Maximum memory used |
| MaxVMSize | Maximum virtual memory |
| State | Final job state |
| ExitCode | Exit code (0 = success) |
| ReqMem | Requested memory |
| ReqCPUS | Requested CPUs |
How do I check job efficiency?
Use seff for a quick efficiency report:
$ seff JOBID Job ID: 123456 Cluster: cluster User/Group: unityID/users State: COMPLETED (exit code 0) Cores: 8 CPU Utilized: 06:45:23 CPU Efficiency: 84.42% of 08:00:00 core-walltime Memory Utilized: 12.5 GB Memory Efficiency: 78.12% of 16.00 GB
Low CPU efficiency may indicate:
- Requesting more cores than the code can use
- I/O bottlenecks
- Load imbalance in parallel code
Low memory efficiency means you could request less memory.
How do I check cluster availability?
Use sinfo to see partition status:
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 4-00:00:00 45 idle c[001-045] compute* up 4-00:00:00 20 mix c[046-065] compute* up 4-00:00:00 5 alloc c[066-070] gpu up 4-00:00:00 8 idle gpu[01-08] gpu up 4-00:00:00 2 mix gpu[09-10]
Node states:
- idle - Available, no jobs running
- mix - Some cores in use, some available
- alloc - Fully allocated
- down - Unavailable
- drain - Being drained for maintenance
Summary view
sinfo -s
Show available resources
sinfo -o "%P %a %D %c %m %G"
Shows partition, availability, nodes, CPUs, memory, and GRES (GPUs).
How do I see node details?
List nodes with details:
sinfo -N -l
Show specific node
scontrol show node NODENAME
Show GPU availability
sinfo -p gpu -o "%N %G %t %C"
Shows node, GRES (GPUs), state, and CPU allocation.
Graphical view
See the cluster status page for a visual representation.
How do I cancel jobs?
Use scancel:
# Cancel specific job scancel JOBID # Cancel all your jobs scancel -u $USER # Cancel all pending jobs scancel -u $USER -t PENDING # Cancel jobs by name scancel -n jobname # Cancel array job tasks scancel JOBID_[1-50]
Can I modify a pending job?
Yes, use scontrol update for pending jobs:
# Change time limit scontrol update jobid=JOBID TimeLimit=4:00:00 # Change partition scontrol update jobid=JOBID Partition=gpu # Change QOS scontrol update jobid=JOBID QOS=short # Change job name scontrol update jobid=JOBID JobName=newname
Note: You cannot increase resources beyond original request for running jobs, but you can decrease time limits.
Hold and release jobs
# Hold a pending job scontrol hold JOBID # Release a held job scontrol release JOBID