Read the documentation to determine the expected behavior of an application, then confirm the behavior with a short test.

General Guidelines

  • Serial jobs: Use --ntasks=1. Do not request multiple cores for serial code.
  • Memory intensive applications: Specify the maximum memory required with --mem or --mem-per-cpu. See estimating memory requirements.
  • Threaded applications (OpenMP): Use --ntasks=1 --cpus-per-task=N where N is the number of threads. Set OMP_NUM_THREADS in your script:
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=8
    
    export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
    ./my_threaded_program
    
  • MPI applications: Use --ntasks=N where N is the number of MPI ranks. Use srun to launch:
    #SBATCH --ntasks=32
    
    srun ./my_mpi_program
    
  • Shared-memory parallel (single node): Add --nodes=1 to ensure all tasks run on the same node:
    #SBATCH --nodes=1
    #SBATCH --ntasks=16
    
  • Hybrid MPI+OpenMP: Request MPI tasks with --ntasks and threads per task with --cpus-per-task. Use --exclusive to avoid overloading nodes:
    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=4
    #SBATCH --cpus-per-task=8
    #SBATCH --exclusive
    
    export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
    srun ./my_hybrid_program
    
    See hybrid jobs guide for details.
  • Automatic threading: Some applications automatically use all available cores. You must verify the threading behavior and either:
    • Set environment variables to limit threads (e.g., OMP_NUM_THREADS, MKL_NUM_THREADS)
    • Request enough cores to match the threading

MPI Compilers and Libraries

Two MPI environments are available for compiling and running parallel applications:

EnvironmentCompilerMPI LibraryModule
GNU + OpenMPI GCC 11.5 OpenMPI 4.1.8 openmpi-gcc/openmpi4.1.8-gcc11.5.0-slurm
Intel + Intel MPI Intel 2025.3 Intel MPI PrgEnv-intel/2025.3-slurm
NVIDIA HPC SDK NVHPC 26.1 OpenMPI PrgEnv-nvidia/26.1-slurm

GNU + OpenMPI

Use this environment for codes that compile with GCC:

#!/bin/bash
#SBATCH --job-name=mpi_gnu
#SBATCH --output=mpi.out.%j
#SBATCH --ntasks=32
#SBATCH --time=02:00:00

module load openmpi-gcc/openmpi4.1.8-gcc11.5.0-slurm

srun ./my_mpi_program

Compile with mpicc (C), mpicxx (C++), or mpif90 (Fortran):

module load openmpi-gcc/openmpi4.1.8-gcc11.5.0-slurm
mpicc -O2 -o my_mpi_program my_mpi_program.c

Intel + Intel MPI

Use this environment for codes that benefit from Intel optimizations or require Intel compilers:

#!/bin/bash
#SBATCH --job-name=mpi_intel
#SBATCH --output=mpi.out.%j
#SBATCH --ntasks=32
#SBATCH --time=02:00:00

module load PrgEnv-intel/2025.3-slurm

srun ./my_mpi_program

Compile with mpiicc (C), mpiicpc (C++), or mpiifort (Fortran):

module load PrgEnv-intel/2025.3-slurm
mpiicc -O2 -o my_mpi_program my_mpi_program.c

NVIDIA HPC SDK

Use this environment for GPU-accelerated codes or codes using NVIDIA compilers:

#!/bin/bash
#SBATCH --job-name=mpi_nvidia
#SBATCH --output=mpi.out.%j
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:2
#SBATCH --ntasks=2
#SBATCH --time=02:00:00

module load PrgEnv-nvidia/26.1-slurm

srun ./my_gpu_mpi_program

Compile with nvc (C), nvc++ (C++), or nvfortran (Fortran). For MPI, use the MPI wrappers:

module load PrgEnv-nvidia/26.1-slurm
mpicc -O2 -o my_mpi_program my_mpi_program.c

Choosing an MPI Environment

  • GNU + OpenMPI: Good default choice; widely compatible; open source
  • Intel + Intel MPI: Often faster on Intel processors; includes MKL math library; better for codes using Intel-specific optimizations
  • NVIDIA HPC SDK: Best for GPU-accelerated MPI codes; includes CUDA-aware MPI; supports OpenACC and CUDA Fortran

Important: Use the same module for compiling and running. Mixing environments causes errors.

How to Test Your Application

1. Start an interactive session

salloc --ntasks=4 --nodes=1 --time=00:30:00

2. Run your application

./my_program &

3. Check CPU usage with top or htop

top -u $USER

Look at the %CPU column. Values over 100% indicate multiple threads. A 4-thread program shows ~400%.

4. Verify the thread/process count

ps -T -p $(pgrep -u $USER my_program) | wc -l

Common Mistakes

MistakeProblemSolution
Requesting multiple cores for serial code Wastes resources, delays scheduling Use --ntasks=1
Not constraining to single node for shared-memory code Tasks may be split across nodes, causing failure or poor performance Add --nodes=1
Requesting too few cores for auto-threading applications Overloads node, affects other users Set thread limits or request matching cores
Not using srun for MPI May not properly launch across nodes Use srun ./program instead of ./program
Hybrid job without --exclusive Node oversubscription, poor performance Add --exclusive for hybrid jobs

Further Resources