High Performance Computing | Submitting Parallel Jobs with Slurm

Read the documentation to determine the expected behavior of an application, then confirm the behavior with a short test.

General Guidelines

Serial jobs: Use --ntasks=1. Do not request multiple cores for serial code.
Memory intensive applications: Specify the maximum memory required with --mem or --mem-per-cpu. See estimating memory requirements.
Threaded applications (OpenMP): Use --ntasks=1 --cpus-per-task=N where N is the number of threads. Set OMP_NUM_THREADS in your script:
```
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./my_threaded_program
```
MPI applications: Use --ntasks=N where N is the number of MPI ranks. Use srun to launch:
```
#SBATCH --ntasks=32

srun ./my_mpi_program
```
Shared-memory parallel (single node): Add --nodes=1 to ensure all tasks run on the same node:
```
#SBATCH --nodes=1
#SBATCH --ntasks=16
```

Hybrid MPI+OpenMP: Request MPI tasks with --ntasks and threads per task with --cpus-per-task. Use --exclusive to avoid overloading nodes:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --exclusive

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./my_hybrid_program

See hybrid jobs guide for details.

Automatic threading: Some applications automatically use all available cores. You must verify the threading behavior and either:
- Set environment variables to limit threads (e.g., OMP_NUM_THREADS, MKL_NUM_THREADS)
- Request enough cores to match the threading

MPI Compilers and Libraries

MPI environments are available for compiling and running parallel applications for each ot the supported compiler environments:

Environment	Compiler	MPI Library	Module
GNU + OpenMPI	GCC 11.5	OpenMPI 4.1.8	`openmpi-gcc/openmpi4.1.8-gcc11.5.0-slurm`
Intel + Intel MPI	Intel 2025.3	Intel MPI	`PrgEnv-intel/2025.3-slurm`
NVIDIA HPC SDK	NVHPC 26.1	OpenMPI	`PrgEnv-nvidia/26.1-slurm`

GNU + OpenMPI

Use this environment for codes that compile with GCC:

#!/bin/bash
#SBATCH --job-name=mpi_gnu
#SBATCH --output=mpi.out.%j
#SBATCH --ntasks=32
#SBATCH --mem=16G
#SBATCH --time=02:00:00

module load openmpi-gcc/openmpi4.1.8-gcc11.5.0-slurm

srun ./my_mpi_program

Compile with mpicc (C), mpicxx (C++), or mpif90 (Fortran):

module load openmpi-gcc/openmpi4.1.8-gcc11.5.0-slurm
mpicc -O2 -o my_mpi_program my_mpi_program.c

Intel + Intel MPI

Use this environment for codes that benefit from Intel optimizations or require Intel compilers:

#!/bin/bash
#SBATCH --job-name=mpi_intel
#SBATCH --output=mpi.out.%j
#SBATCH --ntasks=32
#SBATCH --mem=16G
#SBATCH --time=02:00:00

module load PrgEnv-intel/2025.3-slurm

srun ./my_mpi_program

Compile with mpiicc (C), mpiicpc (C++), or mpiifort (Fortran):

module load PrgEnv-intel/2025.3-slurm
mpiicc -O2 -o my_mpi_program my_mpi_program.c

NVIDIA HPC SDK

Use this environment for GPU-accelerated codes or codes using NVIDIA compilers:

#!/bin/bash
#SBATCH --job-name=mpi_nvidia
#SBATCH --output=mpi.out.%j
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:2
#SBATCH --ntasks=2
#SBATCH --mem=32G
#SBATCH --time=02:00:00

module load PrgEnv-nvidia/26.1-slurm

srun ./my_gpu_mpi_program

Compile with nvc (C), nvc++ (C++), or nvfortran (Fortran). For MPI, use the MPI wrappers:

module load PrgEnv-nvidia/26.1-slurm
mpicc -O2 -o my_mpi_program my_mpi_program.c

Choosing an MPI Environment

GNU + OpenMPI: Good default choice; widely compatible; open source
Intel + Intel MPI: Often faster on Intel processors; includes MKL math library; better for codes using Intel-specific optimizations
NVIDIA HPC SDK: Best for GPU-accelerated MPI codes; includes CUDA-aware MPI; supports OpenACC and CUDA Fortran

Important: Use the same module for compiling and running. Mixing environments causes errors.

How to Test Your Application

1. Start an interactive session

salloc --ntasks=4 --nodes=1 --time=00:30:00

2. Run your application

./my_program &

3. Check CPU usage with top or htop

top -u $USER

Look at the %CPU column. Values over 100% indicate multiple threads. A 4-thread program shows ~400%.

4. Verify the thread/process count

ps -T -p $(pgrep -u $USER my_program) | wc -l

Common Mistakes

Mistake	Problem	Solution
Requesting multiple cores for serial code	Wastes resources, delays scheduling	Use `--ntasks=1`
Not constraining to single node for shared-memory code	Tasks may be split across nodes, causing failure or poor performance	Add `--nodes=1`
Requesting too few cores for auto-threading applications	Overloads node, affects other users	Set thread limits or request matching cores
Not using srun for MPI	May not properly launch across nodes	Use `srun ./program` instead of `./program`
Hybrid job without --exclusive	Node oversubscription, poor performance	Add `--exclusive` for hybrid jobs

Submitting Parallel Jobs with Slurm

Please examine the runtime behavior of your applications.