How do I check my job status?

Use squeue to view your jobs:

$ squeue -u $USER
JOBID   PARTITION  NAME       USER     ST  TIME     NODES  NODELIST(REASON)
123456  compute    analysis   unityID  R   2:15:30  1      c001n01
123457  compute    preprocess unityID  PD  0:00     1      (Priority)

Key columns:

  • JOBID - Unique job identifier
  • ST - Job state (R=running, PD=pending)
  • TIME - Elapsed time for running jobs
  • NODELIST(REASON) - Node assignment or pending reason

Show more details

squeue -u $USER -l

Show only running jobs

squeue -u $USER -t RUNNING

Show only pending jobs

squeue -u $USER -t PENDING

What do job states mean?

StateCodeDescription
PENDINGPDWaiting for resources or dependencies
RUNNINGRCurrently executing
COMPLETINGCGFinishing up (epilog running)
COMPLETEDCDFinished successfully (exit code 0)
FAILEDFFinished with non-zero exit code
TIMEOUTTOExceeded time limit
CANCELLEDCACancelled by user or admin
NODE_FAILNFNode failure during execution
OUT_OF_MEMORYOOMExceeded memory limit

Why is my job pending?

The REASON column in squeue output explains why:

ReasonMeaningAction
PriorityOther jobs have higher priorityWait, or use short QOS for small jobs
ResourcesWaiting for requested resources to become availableWait, or reduce resource request
QOSMaxCpuPerUserLimitYou've hit your CPU limit for this QOSWait for running jobs to finish
QOSMaxJobsPerUserLimitYou've hit your job count limitWait for running jobs to finish
DependencyWaiting for dependent job to completeWait for dependency to resolve
PartitionNodeLimitRequesting more nodes than partition allowsReduce node count
PartitionTimeLimitRequested time exceeds partition limitReduce time or use different QOS
ReqNodeNotAvailRequired nodes are down or reservedRemove constraint or wait
AssocGrpCPURunMinutesLimitAccount has used allocationContact HPC support

See Job Priority and Fairshare for details on how job priority is calculated.

Get detailed pending reason

squeue -j JOBID -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

Estimate start time

squeue -j JOBID --start

How do I see detailed job information?

The sj helper prints a concise summary of a job's resource request and key metadata — partition, QOS, node/task/CPU counts, memory, GPU request, constraints, time limit, and submit/start/end times. It reads the job from the controller while it is pending, running, or recently completed, and falls back to the accounting database for older jobs.

sj JOBID                # request + state for a job
sj JOBID_7              # an array task element

For the complete record (working directory, command, environment, every parameter), use scontrol show job:

scontrol show job JOBID

This shows all job parameters including:

  • Submit time and start time
  • Requested resources (CPUs, memory, time)
  • Node assignment
  • Working directory and command
  • Environment variables

How do I check completed job information?

Use sacct to view completed jobs:

# Jobs from today
sacct

# Specific job
sacct -j JOBID

# Jobs from last 7 days
sacct --starttime=$(date -d '7 days ago' +%Y-%m-%d)

Useful output formats

# Show elapsed time and exit status
sacct -j JOBID --format=JobID,JobName,Elapsed,State,ExitCode

# Show resource usage
sacct -j JOBID --format=JobID,JobName,MaxRSS,MaxVMSize,CPUTime,Elapsed

Common sacct fields

FieldDescription
ElapsedActual runtime
CPUTimeTotal CPU time (cores × time)
MaxRSSMaximum memory used
MaxVMSizeMaximum virtual memory
StateFinal job state
ExitCodeExit code (0 = success)
ReqMemRequested memory
ReqCPUSRequested CPUs

How do I check job efficiency?

Use seff for a quick efficiency report:

$ seff JOBID
Job ID: 123456
Cluster: cluster
User/Group: unityID/users
State: COMPLETED (exit code 0)
Cores: 8
CPU Utilized: 06:45:23
CPU Efficiency: 84.42% of 08:00:00 core-walltime
Memory Utilized: 12.5 GB
Memory Efficiency: 78.12% of 16.00 GB

Low CPU efficiency may indicate:

  • Requesting more cores than the code can use
  • I/O bottlenecks
  • Load imbalance in parallel code

Low memory efficiency means you could request less memory.

How do I check cluster availability?

The si helper gives a concise per-partition summary of how many nodes are currently free (idle or mixed), grouped by CPU architecture on the compute partitions and by GPU model on the GPU partitions:

$ si
Partition........... Available Nodes
compute............. 45 Cascadelake  20 Icelake  5 Genoa
gpu................. 8 H100  2 A100

Other useful forms:

si -p gpu               # restrict to one partition
si --memory             # group free nodes by total memory size
si --all                # include nodes in every state, not just free

"Free" counts nodes whose state is idle (available, no jobs) or mix (some cores in use, some available). The raw sinfo command lists every partition/state group, including states si hides by default: alloc (fully allocated), down (unavailable), and drain (being drained for maintenance).

Equivalent native commands: sinfo, sinfo -s, sinfo -o "%P %a %D %c %m %G". In sinfo -o output, GRES on compute nodes reports the CPU architecture as a typed resource (e.g., cpu:haswell:20, cpu:cascadelake:32, cpu:genoa:192) and on GPU nodes the GPU type (e.g., gpu:a100:4).

How do I see node details?

si --nodes prints a per-node table. By default it shows available / allocated / offline / total cores; add --memory or --gpus for per-node memory or GPU counts instead:

si --nodes              # per-node cores
si --nodes --memory     # per-node memory (free / allocated / total)
si --nodes --gpus       # per-node GPUs (GPU nodes only)
si --nodes -p gpu       # restrict to one partition

Show a specific node

scontrol show node NODENAME

Equivalent native commands: sinfo -N -l, sinfo -p gpu -o "%N %G %t %C".

Graphical view

See the cluster status page for a visual representation.

What QOS and accounts can I use?

Two helpers answer this without parsing raw sacctmgr output. sqos is usually what you want: it lists the QOS you can submit with, which partitions each is valid on, and the limits that apply (wall time, and per-user / per-job CPU, GPU, and memory caps).

$ sqos
QOS     Partitions        MaxWall      MaxTRES/User     MaxTRES/Job  GrpTRES
normal  compute           4-00:00:00   cpu=512          -            -
long    compute           10-00:00:00  cpu=512          -            -
gpu     gpu               4-00:00:00   -                -            -
short   compute_partners  02:00:00     -                -            -

Add -v to also show each QOS's priority and flags. Pass a login (e.g. sqos alice) to see another user's QOS.

sa shows your underlying associations — the accounts you can charge to, the partitions, your default QOS, and the full allowed QOS list per account:

sa                      # your associations (account, partition, default QOS, allowed QOS)
sa --tree               # your place in the account hierarchy, back to root
sa alice                # another user's associations

If a job is rejected for an invalid QOS or partition, sqos will show which combinations are actually open to you. See Partitions and Resources for the full QOS / partition reference and Job Priority and Fairshare for how QOS affects scheduling.

Equivalent native command: sacctmgr show assoc user=$USER format=account,qos,maxcpus,maxnodes.

How do I cancel jobs?

Use scancel:

# Cancel specific job
scancel JOBID

# Cancel all your jobs
scancel -u $USER

# Cancel all pending jobs
scancel -u $USER -t PENDING

# Cancel jobs by name
scancel -n jobname

# Cancel array job tasks
scancel JOBID_[1-50]

Can I modify a pending job?

Yes, use scontrol update for pending jobs:

# Change time limit
scontrol update jobid=JOBID TimeLimit=4:00:00

# Change partition
scontrol update jobid=JOBID Partition=gpu

# Change QOS
scontrol update jobid=JOBID QOS=short

# Change job name
scontrol update jobid=JOBID JobName=newname

Note: You cannot increase resources beyond original request for running jobs, but you can decrease time limits.

Hold and release jobs

# Hold a pending job
scontrol hold JOBID

# Release a held job
scontrol release JOBID