How do I check my job status?

Use squeue to view your jobs:

$ squeue -u $USER
JOBID   PARTITION  NAME       USER     ST  TIME     NODES  NODELIST(REASON)
123456  compute    analysis   unityID  R   2:15:30  1      c001n01
123457  compute    preprocess unityID  PD  0:00     1      (Priority)

Key columns:

  • JOBID - Unique job identifier
  • ST - Job state (R=running, PD=pending)
  • TIME - Elapsed time for running jobs
  • NODELIST(REASON) - Node assignment or pending reason

Show more details

squeue -u $USER -l

Show only running jobs

squeue -u $USER -t RUNNING

Show only pending jobs

squeue -u $USER -t PENDING

What do job states mean?

StateCodeDescription
PENDINGPDWaiting for resources or dependencies
RUNNINGRCurrently executing
COMPLETINGCGFinishing up (epilog running)
COMPLETEDCDFinished successfully (exit code 0)
FAILEDFFinished with non-zero exit code
TIMEOUTTOExceeded time limit
CANCELLEDCACancelled by user or admin
NODE_FAILNFNode failure during execution
OUT_OF_MEMORYOOMExceeded memory limit

Why is my job pending?

The REASON column in squeue output explains why:

ReasonMeaningAction
PriorityOther jobs have higher priorityWait, or use short QOS for small jobs
ResourcesWaiting for requested resources to become availableWait, or reduce resource request
QOSMaxCpuPerUserLimitYou've hit your CPU limit for this QOSWait for running jobs to finish
QOSMaxJobsPerUserLimitYou've hit your job count limitWait for running jobs to finish
DependencyWaiting for dependent job to completeWait for dependency to resolve
PartitionNodeLimitRequesting more nodes than partition allowsReduce node count
PartitionTimeLimitRequested time exceeds partition limitReduce time or use different QOS
ReqNodeNotAvailRequired nodes are down or reservedRemove constraint or wait
AssocGrpCPURunMinutesLimitAccount has used allocationContact HPC support

See Job Priority and Fairshare for details on how job priority is calculated.

Get detailed pending reason

squeue -j JOBID -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

Estimate start time

squeue -j JOBID --start

How do I see detailed job information?

Use scontrol show job:

scontrol show job JOBID

This shows all job parameters including:

  • Submit time and start time
  • Requested resources (CPUs, memory, time)
  • Node assignment
  • Working directory and command
  • Environment variables

How do I check completed job information?

Use sacct to view completed jobs:

# Jobs from today
sacct

# Specific job
sacct -j JOBID

# Jobs from last 7 days
sacct --starttime=$(date -d '7 days ago' +%Y-%m-%d)

Useful output formats

# Show elapsed time and exit status
sacct -j JOBID --format=JobID,JobName,Elapsed,State,ExitCode

# Show resource usage
sacct -j JOBID --format=JobID,JobName,MaxRSS,MaxVMSize,CPUTime,Elapsed

Common sacct fields

FieldDescription
ElapsedActual runtime
CPUTimeTotal CPU time (cores × time)
MaxRSSMaximum memory used
MaxVMSizeMaximum virtual memory
StateFinal job state
ExitCodeExit code (0 = success)
ReqMemRequested memory
ReqCPUSRequested CPUs

How do I check job efficiency?

Use seff for a quick efficiency report:

$ seff JOBID
Job ID: 123456
Cluster: cluster
User/Group: unityID/users
State: COMPLETED (exit code 0)
Cores: 8
CPU Utilized: 06:45:23
CPU Efficiency: 84.42% of 08:00:00 core-walltime
Memory Utilized: 12.5 GB
Memory Efficiency: 78.12% of 16.00 GB

Low CPU efficiency may indicate:

  • Requesting more cores than the code can use
  • I/O bottlenecks
  • Load imbalance in parallel code

Low memory efficiency means you could request less memory.

How do I check cluster availability?

Use sinfo to see partition status:

$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE  NODELIST
compute*   up     4-00:00:00    45  idle   c[001-045]
compute*   up     4-00:00:00    20  mix    c[046-065]
compute*   up     4-00:00:00     5  alloc  c[066-070]
gpu        up     4-00:00:00     8  idle   gpu[01-08]
gpu        up     4-00:00:00     2  mix    gpu[09-10]

Node states:

  • idle - Available, no jobs running
  • mix - Some cores in use, some available
  • alloc - Fully allocated
  • down - Unavailable
  • drain - Being drained for maintenance

Summary view

sinfo -s

Show available resources

sinfo -o "%P %a %D %c %m %G"

Shows partition, availability, nodes, CPUs, memory, and GRES (GPUs).

How do I see node details?

List nodes with details:

sinfo -N -l

Show specific node

scontrol show node NODENAME

Show GPU availability

sinfo -p gpu -o "%N %G %t %C"

Shows node, GRES (GPUs), state, and CPU allocation.

Graphical view

See the cluster status page for a visual representation.

How do I cancel jobs?

Use scancel:

# Cancel specific job
scancel JOBID

# Cancel all your jobs
scancel -u $USER

# Cancel all pending jobs
scancel -u $USER -t PENDING

# Cancel jobs by name
scancel -n jobname

# Cancel array job tasks
scancel JOBID_[1-50]

Can I modify a pending job?

Yes, use scontrol update for pending jobs:

# Change time limit
scontrol update jobid=JOBID TimeLimit=4:00:00

# Change partition
scontrol update jobid=JOBID Partition=gpu

# Change QOS
scontrol update jobid=JOBID QOS=short

# Change job name
scontrol update jobid=JOBID JobName=newname

Note: You cannot increase resources beyond original request for running jobs, but you can decrease time limits.

Hold and release jobs

# Hold a pending job
scontrol hold JOBID

# Release a held job
scontrol release JOBID