High Performance Computing | Slurm Job Monitoring FAQ

How do I check my job status?

Use squeue to view your jobs:

$ squeue -u $USER
JOBID   PARTITION  NAME       USER     ST  TIME     NODES  NODELIST(REASON)
123456  compute    analysis   unityID  R   2:15:30  1      c001n01
123457  compute    preprocess unityID  PD  0:00     1      (Priority)

Key columns:

JOBID - Unique job identifier
ST - Job state (R=running, PD=pending)
TIME - Elapsed time for running jobs
NODELIST(REASON) - Node assignment or pending reason

Show more details

squeue -u $USER -l

Show only running jobs

squeue -u $USER -t RUNNING

Show only pending jobs

squeue -u $USER -t PENDING

What do job states mean?

State	Code	Description
PENDING	PD	Waiting for resources or dependencies
RUNNING	R	Currently executing
COMPLETING	CG	Finishing up (epilog running)
COMPLETED	CD	Finished successfully (exit code 0)
FAILED	F	Finished with non-zero exit code
TIMEOUT	TO	Exceeded time limit
CANCELLED	CA	Cancelled by user or admin
NODE_FAIL	NF	Node failure during execution
OUT_OF_MEMORY	OOM	Exceeded memory limit

Why is my job pending?

The REASON column in squeue output explains why:

Reason	Meaning	Action
Priority	Other jobs have higher priority	Wait, or use `short` QOS for small jobs
Resources	Waiting for requested resources to become available	Wait, or reduce resource request
QOSMaxCpuPerUserLimit	You've hit your CPU limit for this QOS	Wait for running jobs to finish
QOSMaxJobsPerUserLimit	You've hit your job count limit	Wait for running jobs to finish
Dependency	Waiting for dependent job to complete	Wait for dependency to resolve
PartitionNodeLimit	Requesting more nodes than partition allows	Reduce node count
PartitionTimeLimit	Requested time exceeds partition limit	Reduce time or use different QOS
ReqNodeNotAvail	Required nodes are down or reserved	Remove constraint or wait
AssocGrpCPURunMinutesLimit	Account has used allocation	Contact HPC support

See Job Priority and Fairshare for details on how job priority is calculated.

Get detailed pending reason

squeue -j JOBID -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

Estimate start time

squeue -j JOBID --start

How do I see detailed job information?

Use scontrol show job:

scontrol show job JOBID

This shows all job parameters including:

Submit time and start time
Requested resources (CPUs, memory, time)
Node assignment
Working directory and command
Environment variables

How do I check completed job information?

Use sacct to view completed jobs:

# Jobs from today
sacct

# Specific job
sacct -j JOBID

# Jobs from last 7 days
sacct --starttime=$(date -d '7 days ago' +%Y-%m-%d)

Useful output formats

# Show elapsed time and exit status
sacct -j JOBID --format=JobID,JobName,Elapsed,State,ExitCode

# Show resource usage
sacct -j JOBID --format=JobID,JobName,MaxRSS,MaxVMSize,CPUTime,Elapsed

Common sacct fields

Field	Description
Elapsed	Actual runtime
CPUTime	Total CPU time (cores × time)
MaxRSS	Maximum memory used
MaxVMSize	Maximum virtual memory
State	Final job state
ExitCode	Exit code (0 = success)
ReqMem	Requested memory
ReqCPUS	Requested CPUs

How do I check job efficiency?

Use seff for a quick efficiency report:

$ seff JOBID
Job ID: 123456
Cluster: cluster
User/Group: unityID/users
State: COMPLETED (exit code 0)
Cores: 8
CPU Utilized: 06:45:23
CPU Efficiency: 84.42% of 08:00:00 core-walltime
Memory Utilized: 12.5 GB
Memory Efficiency: 78.12% of 16.00 GB

Low CPU efficiency may indicate:

Requesting more cores than the code can use
I/O bottlenecks
Load imbalance in parallel code

Low memory efficiency means you could request less memory.

How do I check cluster availability?

Use sinfo to see partition status:

$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE  NODELIST
compute*   up     4-00:00:00    45  idle   c[001-045]
compute*   up     4-00:00:00    20  mix    c[046-065]
compute*   up     4-00:00:00     5  alloc  c[066-070]
gpu        up     4-00:00:00     8  idle   gpu[01-08]
gpu        up     4-00:00:00     2  mix    gpu[09-10]

Node states:

idle - Available, no jobs running
mix - Some cores in use, some available
alloc - Fully allocated
down - Unavailable
drain - Being drained for maintenance

Summary view

sinfo -s

Show available resources

sinfo -o "%P %a %D %c %m %G"

Shows partition, availability, nodes, CPUs, memory, and GRES (GPUs).

How do I see node details?

List nodes with details:

sinfo -N -l

Show specific node

scontrol show node NODENAME

Show GPU availability

sinfo -p gpu -o "%N %G %t %C"

Shows node, GRES (GPUs), state, and CPU allocation.

Graphical view

See the cluster status page for a visual representation.

How do I cancel jobs?

Use scancel:

# Cancel specific job
scancel JOBID

# Cancel all your jobs
scancel -u $USER

# Cancel all pending jobs
scancel -u $USER -t PENDING

# Cancel jobs by name
scancel -n jobname

# Cancel array job tasks
scancel JOBID_[1-50]

Can I modify a pending job?

Yes, use scontrol update for pending jobs:

# Change time limit
scontrol update jobid=JOBID TimeLimit=4:00:00

# Change partition
scontrol update jobid=JOBID Partition=gpu

# Change QOS
scontrol update jobid=JOBID QOS=short

# Change job name
scontrol update jobid=JOBID JobName=newname

Note: You cannot increase resources beyond original request for running jobs, but you can decrease time limits.

Hold and release jobs

# Hold a pending job
scontrol hold JOBID

# Release a held job
scontrol release JOBID

Slurm Job Monitoring FAQ

Quick Links