High Performance Computing | GPU Jobs with Slurm

Basic GPU Job

To run a GPU job, use the gpu partition and request GPUs with --gres. You must specify the GPU type (e.g., --gres=gpu:a100:1). Untyped requests like --gres=gpu:1 will be rejected.

#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --output=gpu.out.%j
#SBATCH --error=gpu.err.%j
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:1
#SBATCH --ntasks=1
#SBATCH --time=02:00:00

module load cuda
./my_cuda_program

Available GPU Types

The cluster has several types of NVIDIA GPUs. Request specific types by name:

GPU Type	Architecture	Memory	GPUs/Node	Request Syntax
NVIDIA H200	Hopper	141 GB HBM3	4	`--gres=gpu:h200:N`
NVIDIA H100	Hopper	80 GB HBM3	4	`--gres=gpu:h100:N`
NVIDIA A100	Ampere	40 GB HBM2e	4	`--gres=gpu:a100:N`
NVIDIA L40S	Ada Lovelace	48 GB GDDR6	4	`--gres=gpu:l40s:N`
NVIDIA L40	Ada Lovelace	48 GB GDDR6	4	`--gres=gpu:l40:N`
NVIDIA A30	Ampere	24 GB HBM2	2	`--gres=gpu:a30:N`
NVIDIA A10	Ampere	24 GB GDDR6	2	`--gres=gpu:a10:N`
NVIDIA P100	Pascal	16 GB HBM2	2	`--gres=gpu:p100:N`
NVIDIA RTX 2080	Turing	8 GB GDDR6	4	`--gres=gpu:rtx_2080:N`
NVIDIA GTX 1080	Pascal	8 GB GDDR5X	2	`--gres=gpu:gtx1080:N`

Choosing a GPU Type

H200/H100: Best for large language models, transformer training, scientific computing, and workloads needing maximum memory bandwidth
A100: Excellent for deep learning training, scientific computing, and multi-GPU workloads
L40/L40S: Good for inference, visualization, and mixed AI/graphics workloads
A30/A10: Suitable for smaller models, inference, and development/testing
P100/RTX 2080/GTX 1080: Older generation; suitable for development, testing, and smaller workloads

Requesting Multiple GPUs

Multiple GPUs on one node

#SBATCH --partition=gpu
#SBATCH --gres=gpu:h100:4         # 4 H100 GPUs
#SBATCH --ntasks=1

Multiple nodes with GPUs

#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:l40:4          # 4 L40 GPUs per node

Short GPU Jobs

For GPU jobs under 2 hours, use the short_gpu QOS. This provides access to idle partner GPUs, giving you a larger pool of available hardware:

#SBATCH --partition=gpu
#SBATCH --qos=short_gpu
#SBATCH --gres=gpu:a30:1
#SBATCH --time=01:30:00

Environment Setup

CUDA Toolkit

Load the CUDA module to access NVIDIA compilers and libraries:

module load cuda              # Default version
module load cuda/12.2         # Specific version

Check available versions: module avail cuda

Verify GPU allocation

In your job, verify GPUs are allocated:

nvidia-smi

The CUDA_VISIBLE_DEVICES environment variable is automatically set to your allocated GPUs.

Example Scripts

PyTorch Training

#!/bin/bash
#SBATCH --job-name=pytorch_train
#SBATCH --output=train.out.%j
#SBATCH --error=train.err.%j
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=08:00:00

module load cuda
module load anaconda3

# Activate your conda environment
source activate pytorch_env

# Run training
python train.py --epochs 100 --batch-size 64

TensorFlow with Multiple GPUs

#!/bin/bash
#SBATCH --job-name=tf_multigpu
#SBATCH --output=tf.out.%j
#SBATCH --error=tf.err.%j
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:4
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=24:00:00

module load cuda
module load anaconda3
source activate tf_env

python train_distributed.py

CUDA C/C++ Application

#!/bin/bash
#SBATCH --job-name=cuda_app
#SBATCH --output=cuda.out.%j
#SBATCH --error=cuda.err.%j
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a10:1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00

module load cuda

# Compile (if needed)
nvcc -o myprogram myprogram.cu

# Run
./myprogram

Multi-node GPU Job (MPI + CUDA)

#!/bin/bash
#SBATCH --job-name=mpi_cuda
#SBATCH --output=mpi_cuda.out.%j
#SBATCH --error=mpi_cuda.err.%j
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:h100:4
#SBATCH --time=04:00:00

module load cuda
module load openmpi

srun ./my_mpi_cuda_program

GPU Memory Considerations

If your job fails with out-of-memory errors:

Reduce batch size
Use gradient checkpointing
Use mixed precision training (FP16/BF16)
Request GPUs with more memory (A100-80GB, H100, H200)

Monitor GPU memory

Add to your script to monitor GPU usage:

# Run nvidia-smi in background, logging every 30 seconds
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,memory.used,memory.total \
    --format=csv -l 30 > gpu_usage.log &

Checking GPU Availability

# Show GPU nodes and state
sinfo -p gpu -o "%N %G %t %C"

# Show available GPUs by type
sinfo -p gpu -o "%N %G"

Common Issues

Problem	Cause	Solution
CUDA out of memory	Model/batch too large for GPU memory	Reduce batch size, use mixed precision, or request larger GPU
No CUDA-capable device	Missing --gres or wrong partition	Add `--partition=gpu --gres=gpu:TYPE:N` (e.g., `--gres=gpu:a100:1`)
CUDA version mismatch	Code compiled with different CUDA version	Load matching CUDA module
GPU not visible	CUDA_VISIBLE_DEVICES not set correctly	Don't override CUDA_VISIBLE_DEVICES in your script

GPU Jobs with Slurm