Why Topology Matters

Parallel jobs communicate between processes. The physical location of those processes affects communication speed:

  • Same core (SMT/hyperthreading): Shared L1/L2 cache - fastest
  • Same socket: Shared L3 cache - very fast
  • Same node, different sockets: Memory bus (NUMA) - fast
  • Same switch: Single network hop - moderate
  • Different switches: Multiple network hops - slower

For communication-intensive parallel jobs, keeping processes close together can significantly improve performance.

Node Topology

Each compute node has internal topology:

Sockets and Cores

Most nodes have 2 CPU sockets, each with multiple cores. Cores on the same socket share L3 cache and have faster memory access to local NUMA memory.

Node
├── Socket 0 (NUMA node 0)
│   ├── Core 0, Core 1, ... Core N
│   └── Local Memory
└── Socket 1 (NUMA node 1)
    ├── Core 0, Core 1, ... Core N
    └── Local Memory

View node topology

On a compute node, use lscpu or numactl:

# View CPU topology
lscpu | grep -E "Socket|Core|Thread|NUMA"

# View NUMA topology
numactl --hardware

NUMA Effects

Non-Uniform Memory Access (NUMA) means memory access speed depends on which socket owns the memory. Accessing remote memory (memory attached to the other socket) is slower.

For memory-intensive applications, keeping processes and their memory on the same NUMA node improves performance.

Network Topology

Nodes are connected through a hierarchical network:

Core Switch
├── Leaf Switch 1
│   ├── Node 001
│   ├── Node 002
│   └── ...
├── Leaf Switch 2
│   ├── Node 033
│   ├── Node 034
│   └── ...
└── ...

Communication between nodes on the same leaf switch is faster than communication across switches.

Slurm Topology-Aware Scheduling

Slurm can consider topology when placing jobs. The scheduler attempts to place multi-node jobs on nodes that are close together in the network.

Request contiguous nodes

Use --contiguous to request nodes that are next to each other:

#SBATCH --nodes=4
#SBATCH --contiguous

Note: This may increase queue wait time if contiguous nodes aren't available.

Request nodes on same switch

Use --switches to limit the number of network switches:

#SBATCH --nodes=8
#SBATCH --switches=1           # All nodes on same switch

You can add a timeout to fall back if the request can't be satisfied:

#SBATCH --switches=1@00:30:00  # Wait up to 30 min for single switch

If Slurm can't place all nodes on one switch within 30 minutes, it will relax the constraint.

Task and Core Binding

Control how tasks are distributed across sockets and cores:

Tasks per node and socket

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=2    # Spread across both sockets

CPU binding with srun

Use srun options to control binding:

# Bind each task to a specific core
srun --cpu-bind=cores ./my_mpi_program

# Bind each task to a socket
srun --cpu-bind=sockets ./my_mpi_program

# Bind to NUMA nodes
srun --cpu-bind=ldoms ./my_mpi_program

View current binding

Check how tasks are bound:

srun --cpu-bind=verbose ./my_program

Memory Binding

Control memory allocation policy for NUMA systems:

# Allocate memory on local NUMA node only
srun --mem-bind=local ./my_program

# Prefer local, but allow remote if needed
srun --mem-bind=prefer ./my_program

Distribution Patterns

The --distribution option controls how tasks are distributed:

PatternDescriptionBest For
block Fill each node before moving to next Jobs with heavy intra-node communication
cyclic Round-robin across nodes Load balancing, memory distribution
block:block Block across nodes, block across sockets Default, good general choice
block:cyclic Block across nodes, round-robin across sockets Spread memory load within nodes
cyclic:cyclic Round-robin across nodes and sockets Maximum distribution

Example: Block distribution

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --distribution=block

srun ./my_mpi_program

Result: Tasks 0-3 on node 1, tasks 4-7 on node 2.

Example: Cyclic distribution

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --distribution=cyclic

srun ./my_mpi_program

Result: Tasks 0,2,4,6 on node 1; tasks 1,3,5,7 on node 2.

GPU Topology

On multi-GPU nodes, GPUs may have different connectivity:

  • NVLink: High-bandwidth direct GPU-to-GPU connection
  • PCIe: Standard bus connection through CPU

For multi-GPU jobs, GPUs connected via NVLink communicate faster than those connected only via PCIe.

Check GPU topology

nvidia-smi topo -m

This shows the connection matrix between GPUs and CPUs.

GPU affinity

Slurm automatically sets CUDA_VISIBLE_DEVICES to your allocated GPUs. For best performance, bind CPU tasks to cores near their assigned GPUs:

#SBATCH --gres=gpu:h100:2
#SBATCH --cpus-per-task=8

# Let Slurm handle GPU-CPU affinity
srun --gpus-per-task=1 ./my_gpu_program

Practical Recommendations

Single-node jobs

  • Use --nodes=1 to ensure all tasks are on the same node
  • For OpenMP, set OMP_PROC_BIND=spread or close depending on memory access patterns
  • For memory-intensive work, consider NUMA-aware allocation

Multi-node MPI jobs

  • Use --ntasks-per-node to control task distribution
  • Consider --switches=1 for latency-sensitive applications
  • Test with --contiguous if communication patterns favor nearby nodes
  • Don't over-constrain: stricter topology requests mean longer queue times

Hybrid MPI+OpenMP

  • Match MPI ranks to sockets: --ntasks-per-socket=1
  • Give each rank cores on its socket: --cpus-per-task=N
  • Use --exclusive to avoid interference
  • Set OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2       # 2 MPI ranks per node (1 per socket)
#SBATCH --cpus-per-task=16        # 16 threads per rank
#SBATCH --exclusive

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=cores

srun ./my_hybrid_program

Multi-GPU jobs

  • Request GPUs that are NVLink-connected when possible
  • Match CPU cores to GPU locality
  • For distributed training across nodes, network topology matters

Measuring Impact

To see if topology affects your application:

  1. Run a baseline job without topology constraints
  2. Run with --contiguous or --switches=1
  3. Compare runtime and wait time

If the constrained job is significantly faster but waits much longer, consider whether the trade-off is worthwhile.

Further Resources