Basic GPU Job

To run a GPU job, use the gpu partition and request GPUs with --gres. You must specify the GPU type (e.g., --gres=gpu:a100:1). Untyped requests like --gres=gpu:1 will be rejected.

#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --output=gpu.out.%j
#SBATCH --error=gpu.err.%j
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:1
#SBATCH --ntasks=1
#SBATCH --time=02:00:00

module load cuda
./my_cuda_program

Available GPU Types

The cluster has several types of NVIDIA GPUs. Request specific types by name:

GPU Type Architecture Memory GPUs/Node Request Syntax
NVIDIA H200 Hopper 141 GB HBM3 4 --gres=gpu:h200:N
NVIDIA H100 Hopper 80 GB HBM3 4 --gres=gpu:h100:N
NVIDIA A100 Ampere 40 GB HBM2e 4 --gres=gpu:a100:N
NVIDIA L40S Ada Lovelace 48 GB GDDR6 4 --gres=gpu:l40s:N
NVIDIA L40 Ada Lovelace 48 GB GDDR6 4 --gres=gpu:l40:N
NVIDIA A30 Ampere 24 GB HBM2 2 --gres=gpu:a30:N
NVIDIA A10 Ampere 24 GB GDDR6 2 --gres=gpu:a10:N
NVIDIA P100 Pascal 16 GB HBM2 2 --gres=gpu:p100:N
NVIDIA RTX 2080 Turing 8 GB GDDR6 4 --gres=gpu:rtx_2080:N
NVIDIA GTX 1080 Pascal 8 GB GDDR5X 2 --gres=gpu:gtx1080:N

Choosing a GPU Type

  • H200/H100: Best for large language models, transformer training, scientific computing, and workloads needing maximum memory bandwidth
  • A100: Excellent for deep learning training, scientific computing, and multi-GPU workloads
  • L40/L40S: Good for inference, visualization, and mixed AI/graphics workloads
  • A30/A10: Suitable for smaller models, inference, and development/testing
  • P100/RTX 2080/GTX 1080: Older generation; suitable for development, testing, and smaller workloads

Requesting Multiple GPUs

Multiple GPUs on one node

#SBATCH --partition=gpu
#SBATCH --gres=gpu:h100:4         # 4 H100 GPUs
#SBATCH --ntasks=1

Multiple nodes with GPUs

#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:l40:4          # 4 L40 GPUs per node

Short GPU Jobs

For GPU jobs under 2 hours, use the short_gpu QOS. This provides access to idle partner GPUs, giving you a larger pool of available hardware:

#SBATCH --partition=gpu
#SBATCH --qos=short_gpu
#SBATCH --gres=gpu:a30:1
#SBATCH --time=01:30:00

Environment Setup

CUDA Toolkit

Load the CUDA module to access NVIDIA compilers and libraries:

module load cuda              # Default version
module load cuda/12.2         # Specific version

Check available versions: module avail cuda

Verify GPU allocation

In your job, verify GPUs are allocated:

nvidia-smi

The CUDA_VISIBLE_DEVICES environment variable is automatically set to your allocated GPUs.

Example Scripts

PyTorch Training

#!/bin/bash
#SBATCH --job-name=pytorch_train
#SBATCH --output=train.out.%j
#SBATCH --error=train.err.%j
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=08:00:00

module load cuda
module load anaconda3

# Activate your conda environment
source activate pytorch_env

# Run training
python train.py --epochs 100 --batch-size 64

TensorFlow with Multiple GPUs

#!/bin/bash
#SBATCH --job-name=tf_multigpu
#SBATCH --output=tf.out.%j
#SBATCH --error=tf.err.%j
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:4
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=24:00:00

module load cuda
module load anaconda3
source activate tf_env

python train_distributed.py

CUDA C/C++ Application

#!/bin/bash
#SBATCH --job-name=cuda_app
#SBATCH --output=cuda.out.%j
#SBATCH --error=cuda.err.%j
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a10:1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00

module load cuda

# Compile (if needed)
nvcc -o myprogram myprogram.cu

# Run
./myprogram

Multi-node GPU Job (MPI + CUDA)

#!/bin/bash
#SBATCH --job-name=mpi_cuda
#SBATCH --output=mpi_cuda.out.%j
#SBATCH --error=mpi_cuda.err.%j
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:h100:4
#SBATCH --time=04:00:00

module load cuda
module load openmpi

srun ./my_mpi_cuda_program

GPU Memory Considerations

If your job fails with out-of-memory errors:

  • Reduce batch size
  • Use gradient checkpointing
  • Use mixed precision training (FP16/BF16)
  • Request GPUs with more memory (A100-80GB, H100, H200)

Monitor GPU memory

Add to your script to monitor GPU usage:

# Run nvidia-smi in background, logging every 30 seconds
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,memory.used,memory.total \
    --format=csv -l 30 > gpu_usage.log &

Checking GPU Availability

# Show GPU nodes and state
sinfo -p gpu -o "%N %G %t %C"

# Show available GPUs by type
sinfo -p gpu -o "%N %G"

Common Issues

ProblemCauseSolution
CUDA out of memory Model/batch too large for GPU memory Reduce batch size, use mixed precision, or request larger GPU
No CUDA-capable device Missing --gres or wrong partition Add --partition=gpu --gres=gpu:TYPE:N (e.g., --gres=gpu:a100:1)
CUDA version mismatch Code compiled with different CUDA version Load matching CUDA module
GPU not visible CUDA_VISIBLE_DEVICES not set correctly Don't override CUDA_VISIBLE_DEVICES in your script

Further Resources