Cluster Topology and Job Placement
How cluster topology affects parallel job performance and scheduling.
Why Topology Matters
Parallel jobs communicate between processes. The physical location of those processes affects communication speed:
- Same core (SMT/hyperthreading): Shared L1/L2 cache - fastest
- Same socket: Shared L3 cache - very fast
- Same node, different sockets: Memory bus (NUMA) - fast
- Same switch: Single network hop - moderate
- Different switches: Multiple network hops - slower
For communication-intensive parallel jobs, keeping processes close together can significantly improve performance.
Node Topology
Each compute node has internal topology:
Sockets and Cores
Most nodes have 2 CPU sockets, each with multiple cores. Cores on the same socket share L3 cache and have faster memory access to local NUMA memory.
Node
├── Socket 0 (NUMA node 0)
│ ├── Core 0, Core 1, ... Core N
│ └── Local Memory
└── Socket 1 (NUMA node 1)
├── Core 0, Core 1, ... Core N
└── Local Memory
View node topology
On a compute node, use lscpu or numactl:
# View CPU topology lscpu | grep -E "Socket|Core|Thread|NUMA" # View NUMA topology numactl --hardware
NUMA Effects
Non-Uniform Memory Access (NUMA) means memory access speed depends on which socket owns the memory. Accessing remote memory (memory attached to the other socket) is slower.
For memory-intensive applications, keeping processes and their memory on the same NUMA node improves performance.
Network Topology
Nodes are connected through a hierarchical network:
Core Switch ├── Leaf Switch 1 │ ├── Node 001 │ ├── Node 002 │ └── ... ├── Leaf Switch 2 │ ├── Node 033 │ ├── Node 034 │ └── ... └── ...
Communication between nodes on the same leaf switch is faster than communication across switches.
Slurm Topology-Aware Scheduling
Slurm can consider topology when placing jobs. The scheduler attempts to place multi-node jobs on nodes that are close together in the network.
Request contiguous nodes
Use --contiguous to request nodes that are next to each other:
#SBATCH --nodes=4 #SBATCH --contiguous
Note: This may increase queue wait time if contiguous nodes aren't available.
Request nodes on same switch
Use --switches to limit the number of network switches:
#SBATCH --nodes=8 #SBATCH --switches=1 # All nodes on same switch
You can add a timeout to fall back if the request can't be satisfied:
#SBATCH --switches=1@00:30:00 # Wait up to 30 min for single switch
If Slurm can't place all nodes on one switch within 30 minutes, it will relax the constraint.
Task and Core Binding
Control how tasks are distributed across sockets and cores:
Tasks per node and socket
#SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --ntasks-per-socket=2 # Spread across both sockets
CPU binding with srun
Use srun options to control binding:
# Bind each task to a specific core srun --cpu-bind=cores ./my_mpi_program # Bind each task to a socket srun --cpu-bind=sockets ./my_mpi_program # Bind to NUMA nodes srun --cpu-bind=ldoms ./my_mpi_program
View current binding
Check how tasks are bound:
srun --cpu-bind=verbose ./my_program
Memory Binding
Control memory allocation policy for NUMA systems:
# Allocate memory on local NUMA node only srun --mem-bind=local ./my_program # Prefer local, but allow remote if needed srun --mem-bind=prefer ./my_program
Distribution Patterns
The --distribution option controls how tasks are distributed:
| Pattern | Description | Best For |
|---|---|---|
| block | Fill each node before moving to next | Jobs with heavy intra-node communication |
| cyclic | Round-robin across nodes | Load balancing, memory distribution |
| block:block | Block across nodes, block across sockets | Default, good general choice |
| block:cyclic | Block across nodes, round-robin across sockets | Spread memory load within nodes |
| cyclic:cyclic | Round-robin across nodes and sockets | Maximum distribution |
Example: Block distribution
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks=8 #SBATCH --distribution=block srun ./my_mpi_program
Result: Tasks 0-3 on node 1, tasks 4-7 on node 2.
Example: Cyclic distribution
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks=8 #SBATCH --distribution=cyclic srun ./my_mpi_program
Result: Tasks 0,2,4,6 on node 1; tasks 1,3,5,7 on node 2.
GPU Topology
On multi-GPU nodes, GPUs may have different connectivity:
- NVLink: High-bandwidth direct GPU-to-GPU connection
- PCIe: Standard bus connection through CPU
For multi-GPU jobs, GPUs connected via NVLink communicate faster than those connected only via PCIe.
Check GPU topology
nvidia-smi topo -m
This shows the connection matrix between GPUs and CPUs.
GPU affinity
Slurm automatically sets CUDA_VISIBLE_DEVICES to your allocated GPUs. For best performance, bind CPU tasks to cores near their assigned GPUs:
#SBATCH --gres=gpu:h100:2 #SBATCH --cpus-per-task=8 # Let Slurm handle GPU-CPU affinity srun --gpus-per-task=1 ./my_gpu_program
Practical Recommendations
Single-node jobs
- Use --nodes=1 to ensure all tasks are on the same node
- For OpenMP, set OMP_PROC_BIND=spread or close depending on memory access patterns
- For memory-intensive work, consider NUMA-aware allocation
Multi-node MPI jobs
- Use --ntasks-per-node to control task distribution
- Consider --switches=1 for latency-sensitive applications
- Test with --contiguous if communication patterns favor nearby nodes
- Don't over-constrain: stricter topology requests mean longer queue times
Hybrid MPI+OpenMP
- Match MPI ranks to sockets: --ntasks-per-socket=1
- Give each rank cores on its socket: --cpus-per-task=N
- Use --exclusive to avoid interference
- Set OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 # 2 MPI ranks per node (1 per socket) #SBATCH --cpus-per-task=16 # 16 threads per rank #SBATCH --exclusive export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export OMP_PROC_BIND=spread export OMP_PLACES=cores srun ./my_hybrid_program
Multi-GPU jobs
- Request GPUs that are NVLink-connected when possible
- Match CPU cores to GPU locality
- For distributed training across nodes, network topology matters
Measuring Impact
To see if topology affects your application:
- Run a baseline job without topology constraints
- Run with --contiguous or --switches=1
- Compare runtime and wait time
If the constrained job is significantly faster but waits much longer, consider whether the trade-off is worthwhile.