GPU Jobs with Slurm
How to request and use GPU resources for CUDA, machine learning, and other GPU-accelerated workloads.
Basic GPU Job
To run a GPU job, use the gpu partition and request GPUs with --gres. You must specify the GPU type (e.g., --gres=gpu:a100:1). Untyped requests like --gres=gpu:1 will be rejected.
#!/bin/bash #SBATCH --job-name=gpu_job #SBATCH --output=gpu.out.%j #SBATCH --error=gpu.err.%j #SBATCH --partition=gpu #SBATCH --gres=gpu:a100:1 #SBATCH --ntasks=1 #SBATCH --time=02:00:00 module load cuda ./my_cuda_program
Available GPU Types
The cluster has several types of NVIDIA GPUs. Request specific types by name:
| GPU Type | Architecture | Memory | GPUs/Node | Request Syntax |
|---|---|---|---|---|
| NVIDIA H200 | Hopper | 141 GB HBM3 | 4 | --gres=gpu:h200:N |
| NVIDIA H100 | Hopper | 80 GB HBM3 | 4 | --gres=gpu:h100:N |
| NVIDIA A100 | Ampere | 40 GB HBM2e | 4 | --gres=gpu:a100:N |
| NVIDIA L40S | Ada Lovelace | 48 GB GDDR6 | 4 | --gres=gpu:l40s:N |
| NVIDIA L40 | Ada Lovelace | 48 GB GDDR6 | 4 | --gres=gpu:l40:N |
| NVIDIA A30 | Ampere | 24 GB HBM2 | 2 | --gres=gpu:a30:N |
| NVIDIA A10 | Ampere | 24 GB GDDR6 | 2 | --gres=gpu:a10:N |
| NVIDIA P100 | Pascal | 16 GB HBM2 | 2 | --gres=gpu:p100:N |
| NVIDIA RTX 2080 | Turing | 8 GB GDDR6 | 4 | --gres=gpu:rtx_2080:N |
| NVIDIA GTX 1080 | Pascal | 8 GB GDDR5X | 2 | --gres=gpu:gtx1080:N |
Choosing a GPU Type
- H200/H100: Best for large language models, transformer training, scientific computing, and workloads needing maximum memory bandwidth
- A100: Excellent for deep learning training, scientific computing, and multi-GPU workloads
- L40/L40S: Good for inference, visualization, and mixed AI/graphics workloads
- A30/A10: Suitable for smaller models, inference, and development/testing
- P100/RTX 2080/GTX 1080: Older generation; suitable for development, testing, and smaller workloads
Requesting Multiple GPUs
Multiple GPUs on one node
#SBATCH --partition=gpu #SBATCH --gres=gpu:h100:4 # 4 H100 GPUs #SBATCH --ntasks=1
Multiple nodes with GPUs
#SBATCH --partition=gpu #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --gres=gpu:l40:4 # 4 L40 GPUs per node
Short GPU Jobs
For GPU jobs under 2 hours, use the short_gpu QOS. This provides access to idle partner GPUs, giving you a larger pool of available hardware:
#SBATCH --partition=gpu #SBATCH --qos=short_gpu #SBATCH --gres=gpu:a30:1 #SBATCH --time=01:30:00
Environment Setup
CUDA Toolkit
Load the CUDA module to access NVIDIA compilers and libraries:
module load cuda # Default version module load cuda/12.2 # Specific version
Check available versions: module avail cuda
Verify GPU allocation
In your job, verify GPUs are allocated:
nvidia-smi
The CUDA_VISIBLE_DEVICES environment variable is automatically set to your allocated GPUs.
Example Scripts
PyTorch Training
#!/bin/bash #SBATCH --job-name=pytorch_train #SBATCH --output=train.out.%j #SBATCH --error=train.err.%j #SBATCH --partition=gpu #SBATCH --gres=gpu:a100:1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=64G #SBATCH --time=08:00:00 module load cuda module load anaconda3 # Activate your conda environment source activate pytorch_env # Run training python train.py --epochs 100 --batch-size 64
TensorFlow with Multiple GPUs
#!/bin/bash #SBATCH --job-name=tf_multigpu #SBATCH --output=tf.out.%j #SBATCH --error=tf.err.%j #SBATCH --partition=gpu #SBATCH --gres=gpu:a100:4 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=16 #SBATCH --mem=128G #SBATCH --time=24:00:00 module load cuda module load anaconda3 source activate tf_env python train_distributed.py
CUDA C/C++ Application
#!/bin/bash #SBATCH --job-name=cuda_app #SBATCH --output=cuda.out.%j #SBATCH --error=cuda.err.%j #SBATCH --partition=gpu #SBATCH --gres=gpu:a10:1 #SBATCH --ntasks=1 #SBATCH --time=01:00:00 module load cuda # Compile (if needed) nvcc -o myprogram myprogram.cu # Run ./myprogram
Multi-node GPU Job (MPI + CUDA)
#!/bin/bash #SBATCH --job-name=mpi_cuda #SBATCH --output=mpi_cuda.out.%j #SBATCH --error=mpi_cuda.err.%j #SBATCH --partition=gpu #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --gres=gpu:h100:4 #SBATCH --time=04:00:00 module load cuda module load openmpi srun ./my_mpi_cuda_program
GPU Memory Considerations
If your job fails with out-of-memory errors:
- Reduce batch size
- Use gradient checkpointing
- Use mixed precision training (FP16/BF16)
- Request GPUs with more memory (A100-80GB, H100, H200)
Monitor GPU memory
Add to your script to monitor GPU usage:
# Run nvidia-smi in background, logging every 30 seconds
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,memory.used,memory.total \
--format=csv -l 30 > gpu_usage.log &
Checking GPU Availability
# Show GPU nodes and state sinfo -p gpu -o "%N %G %t %C" # Show available GPUs by type sinfo -p gpu -o "%N %G"
Common Issues
| Problem | Cause | Solution |
|---|---|---|
| CUDA out of memory | Model/batch too large for GPU memory | Reduce batch size, use mixed precision, or request larger GPU |
| No CUDA-capable device | Missing --gres or wrong partition | Add --partition=gpu --gres=gpu:TYPE:N (e.g., --gres=gpu:a100:1) |
| CUDA version mismatch | Code compiled with different CUDA version | Load matching CUDA module |
| GPU not visible | CUDA_VISIBLE_DEVICES not set correctly | Don't override CUDA_VISIBLE_DEVICES in your script |