Advanced Job Techniques
Techniques for generating and submitting multiple jobs programmatically.
When to Automate
Consider automation when you need to:
- Submit many jobs with different input files or parameters
- Generate batch scripts dynamically based on data
- Create complex job workflows with dependencies
- Run the same analysis on multiple datasets
Note: For simple parameter sweeps, array jobs are often easier than scripted submissions.
Shell Script Loops
The simplest automation is a shell loop that submits multiple jobs:
Submit jobs for multiple input files
#!/bin/bash
# submit_all.sh - Submit a job for each input file
for file in data/*.csv; do
filename=$(basename "$file" .csv)
sbatch --job-name="process_${filename}" \
--output="logs/${filename}.out" \
--error="logs/${filename}.err" \
process_job.sh "$file"
done
The batch script process_job.sh receives the filename as $1:
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --time=01:00:00 INPUT_FILE=$1 OUTPUT_FILE="results/$(basename $INPUT_FILE .csv)_result.csv" ./analyze.py --input "$INPUT_FILE" --output "$OUTPUT_FILE"
Parameter sweep with loops
#!/bin/bash
# submit_sweep.sh - Parameter sweep
for temp in 100 200 300 400 500; do
for pressure in 1 5 10 50 100; do
sbatch --job-name="sim_T${temp}_P${pressure}" \
--output="logs/sim_T${temp}_P${pressure}.out" \
simulation.sh $temp $pressure
done
done
Generating Batch Scripts
For more complex jobs, generate the entire batch script dynamically:
Shell script generator
#!/bin/bash
# generate_and_submit.sh
for i in $(seq 1 10); do
# Generate batch script
cat > job_${i}.sh << EOF
#!/bin/bash
#SBATCH --job-name=run_${i}
#SBATCH --output=logs/run_${i}.out
#SBATCH --error=logs/run_${i}.err
#SBATCH --ntasks=1
#SBATCH --time=02:00:00
echo "Running iteration ${i}"
./myprogram --iteration ${i} --seed $((RANDOM))
EOF
# Submit the generated script
sbatch job_${i}.sh
done
Python script generator
#!/usr/bin/env python3
# generate_jobs.py
import subprocess
import os
parameters = [
{'name': 'small', 'size': 100, 'time': '01:00:00'},
{'name': 'medium', 'size': 1000, 'time': '04:00:00'},
{'name': 'large', 'size': 10000, 'time': '12:00:00'},
]
os.makedirs('generated_scripts', exist_ok=True)
os.makedirs('logs', exist_ok=True)
for param in parameters:
script_content = f"""#!/bin/bash
#SBATCH --job-name={param['name']}
#SBATCH --output=logs/{param['name']}.out
#SBATCH --error=logs/{param['name']}.err
#SBATCH --ntasks=4
#SBATCH --time={param['time']}
./simulation --size {param['size']} --output results/{param['name']}.dat
"""
script_path = f"generated_scripts/{param['name']}.sh"
with open(script_path, 'w') as f:
f.write(script_content)
# Submit the job
result = subprocess.run(['sbatch', script_path], capture_output=True, text=True)
print(f"Submitted {param['name']}: {result.stdout.strip()}")
Job Dependencies
Chain jobs together so they run in sequence:
Linear pipeline
#!/bin/bash # submit_pipeline.sh # Submit first job, capture job ID JOB1=$(sbatch --parsable preprocess.sh) echo "Submitted preprocessing: $JOB1" # Submit second job, depends on first JOB2=$(sbatch --parsable --dependency=afterok:$JOB1 analyze.sh) echo "Submitted analysis: $JOB2" # Submit third job, depends on second JOB3=$(sbatch --parsable --dependency=afterok:$JOB2 postprocess.sh) echo "Submitted postprocessing: $JOB3"
Fan-out, fan-in pattern
#!/bin/bash
# submit_fanout.sh - Run multiple jobs, then merge results
# Submit parallel processing jobs
JOBS=""
for i in $(seq 1 10); do
JOB=$(sbatch --parsable process_chunk.sh $i)
JOBS="${JOBS}:${JOB}"
done
# Remove leading colon
JOBS=${JOBS#:}
echo "Submitted processing jobs: $JOBS"
# Submit merge job that waits for all processing jobs
MERGE_JOB=$(sbatch --parsable --dependency=afterok:$JOBS merge_results.sh)
echo "Submitted merge job: $MERGE_JOB"
Dependency types
| Option | Meaning |
|---|---|
| --dependency=afterok:JOBID | Run after JOBID completes successfully |
| --dependency=afterany:JOBID | Run after JOBID completes (success or failure) |
| --dependency=afternotok:JOBID | Run only if JOBID fails |
| --dependency=after:JOBID | Run after JOBID starts |
| --dependency=singleton | Run only one job with this name at a time |
Reading Parameters from Files
For large parameter sets, read from a configuration file:
CSV parameter file
Create parameters.csv:
name,temperature,pressure,iterations run1,100,1.0,1000 run2,200,1.5,2000 run3,300,2.0,3000
Submit jobs from CSV:
#!/bin/bash
# submit_from_csv.sh
# Skip header line, read each row
tail -n +2 parameters.csv | while IFS=, read -r name temp pressure iters; do
sbatch --job-name="$name" \
--output="logs/${name}.out" \
--export=ALL,TEMP=$temp,PRESSURE=$pressure,ITERATIONS=$iters \
simulation.sh
done
Python with CSV
#!/usr/bin/env python3
import csv
import subprocess
with open('parameters.csv') as f:
reader = csv.DictReader(f)
for row in reader:
cmd = [
'sbatch',
f'--job-name={row["name"]}',
f'--output=logs/{row["name"]}.out',
f'--export=ALL,TEMP={row["temperature"]},PRESSURE={row["pressure"]}',
'simulation.sh'
]
result = subprocess.run(cmd, capture_output=True, text=True)
print(f'{row["name"]}: {result.stdout.strip()}')
Conditional Submissions
Submit jobs based on conditions:
#!/bin/bash
# submit_missing.sh - Only submit jobs for missing output files
for input in data/*.dat; do
base=$(basename "$input" .dat)
output="results/${base}_result.dat"
if [ ! -f "$output" ]; then
echo "Submitting job for $base (output missing)"
sbatch --job-name="$base" process.sh "$input"
else
echo "Skipping $base (output exists)"
fi
done
Tracking Submitted Jobs
Keep a log of submitted jobs:
#!/bin/bash
# submit_with_log.sh
LOGFILE="submission_log_$(date +%Y%m%d_%H%M%S).txt"
echo "Submission started: $(date)" > "$LOGFILE"
for file in data/*.csv; do
JOBID=$(sbatch --parsable process.sh "$file")
echo "$JOBID,$file,$(date +%Y-%m-%d_%H:%M:%S)" >> "$LOGFILE"
done
echo "Submission complete: $(date)" >> "$LOGFILE"
echo "Jobs logged to $LOGFILE"
Rate Limiting
Avoid overwhelming the scheduler with too many submissions at once:
#!/bin/bash
# submit_with_delay.sh
for file in data/*.csv; do
sbatch process.sh "$file"
sleep 0.5 # Half-second delay between submissions
done
For very large submissions (1000+ jobs), consider using array jobs with a concurrency limit (--array=1-1000%50) instead.
Best Practices
- Create directories first: Ensure log and output directories exist before submitting
mkdir -p logs results ./submit_all.sh
- Test with one job: Verify your script works before submitting hundreds of jobs
- Use --parsable: When capturing job IDs, use sbatch --parsable for clean output
- Quote variables: Always quote file paths and parameters to handle spaces correctly
- Prefer array jobs: For simple parameter sweeps, array jobs are more efficient than scripted loops
- Check queue limits: Be aware of QOS limits on concurrent jobs
- Keep submission scripts: Save your submission scripts for reproducibility
Further Resources
- Array Jobs - Simpler approach for many similar jobs
- Submission FAQ
- Batch Script Template
Cluster Topology and Job Placement
How cluster topology affects parallel job performance and scheduling.
Why Topology Matters
Parallel jobs communicate between processes. The physical location of those processes affects communication speed:
- Same core (SMT/hyperthreading): Shared L1/L2 cache - fastest
- Same socket: Shared L3 cache - very fast
- Same node, different sockets: Memory bus (NUMA) - fast
- Same switch: Single network hop - moderate
- Different switches: Multiple network hops - slower
For communication-intensive parallel jobs, keeping processes close together can significantly improve performance.
Node Topology
Each compute node has internal topology:
Sockets and Cores
Most nodes have 2 CPU sockets, each with multiple cores. Cores on the same socket share L3 cache and have faster memory access to local NUMA memory.
Node
├── Socket 0 (NUMA node 0)
│ ├── Core 0, Core 1, ... Core N
│ └── Local Memory
└── Socket 1 (NUMA node 1)
├── Core 0, Core 1, ... Core N
└── Local Memory
View node topology
On a compute node, use lscpu or numactl:
# View CPU topology lscpu | grep -E "Socket|Core|Thread|NUMA" # View NUMA topology numactl --hardware
NUMA Effects
Non-Uniform Memory Access (NUMA) means memory access speed depends on which socket owns the memory. Accessing remote memory (memory attached to the other socket) is slower.
For memory-intensive applications, keeping processes and their memory on the same NUMA node improves performance.
Network Topology
Nodes are connected through a hierarchical network:
Core Switch ├── Leaf Switch 1 │ ├── Node 001 │ ├── Node 002 │ └── ... ├── Leaf Switch 2 │ ├── Node 033 │ ├── Node 034 │ └── ... └── ...
Communication between nodes on the same leaf switch is faster than communication across switches.
Slurm Topology-Aware Scheduling
Slurm can consider topology when placing jobs. The scheduler attempts to place multi-node jobs on nodes that are close together in the network.
Request contiguous nodes
Use --contiguous to request nodes that are next to each other:
#SBATCH --nodes=4 #SBATCH --contiguous
Note: This may increase queue wait time if contiguous nodes aren't available.
Request nodes on same switch
Use --switches to limit the number of network switches:
#SBATCH --nodes=8 #SBATCH --switches=1 # All nodes on same switch
You can add a timeout to fall back if the request can't be satisfied:
#SBATCH --switches=1@00:30:00 # Wait up to 30 min for single switch
If Slurm can't place all nodes on one switch within 30 minutes, it will relax the constraint.
Task and Core Binding
Control how tasks are distributed across sockets and cores:
Tasks per node and socket
#SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --ntasks-per-socket=2 # Spread across both sockets
CPU binding with srun
Use srun options to control binding:
# Bind each task to a specific core srun --cpu-bind=cores ./my_mpi_program # Bind each task to a socket srun --cpu-bind=sockets ./my_mpi_program # Bind to NUMA nodes srun --cpu-bind=ldoms ./my_mpi_program
View current binding
Check how tasks are bound:
srun --cpu-bind=verbose ./my_program
Memory Binding
Control memory allocation policy for NUMA systems:
# Allocate memory on local NUMA node only srun --mem-bind=local ./my_program # Prefer local, but allow remote if needed srun --mem-bind=prefer ./my_program
Distribution Patterns
The --distribution option controls how tasks are distributed:
| Pattern | Description | Best For |
|---|---|---|
| block | Fill each node before moving to next | Jobs with heavy intra-node communication |
| cyclic | Round-robin across nodes | Load balancing, memory distribution |
| block:block | Block across nodes, block across sockets | Default, good general choice |
| block:cyclic | Block across nodes, round-robin across sockets | Spread memory load within nodes |
| cyclic:cyclic | Round-robin across nodes and sockets | Maximum distribution |
Example: Block distribution
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks=8 #SBATCH --distribution=block srun ./my_mpi_program
Result: Tasks 0-3 on node 1, tasks 4-7 on node 2.
Example: Cyclic distribution
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks=8 #SBATCH --distribution=cyclic srun ./my_mpi_program
Result: Tasks 0,2,4,6 on node 1; tasks 1,3,5,7 on node 2.
GPU Topology
On multi-GPU nodes, GPUs may have different connectivity:
- NVLink: High-bandwidth direct GPU-to-GPU connection
- PCIe: Standard bus connection through CPU
For multi-GPU jobs, GPUs connected via NVLink communicate faster than those connected only via PCIe.
Check GPU topology
nvidia-smi topo -m
This shows the connection matrix between GPUs and CPUs.
GPU affinity
Slurm automatically sets CUDA_VISIBLE_DEVICES to your allocated GPUs. For best performance, bind CPU tasks to cores near their assigned GPUs:
#SBATCH --gres=gpu:h100:2 #SBATCH --cpus-per-task=8 # Let Slurm handle GPU-CPU affinity srun --gpus-per-task=1 ./my_gpu_program
Practical Recommendations
Single-node jobs
- Use --nodes=1 to ensure all tasks are on the same node
- For OpenMP, set OMP_PROC_BIND=spread or close depending on memory access patterns
- For memory-intensive work, consider NUMA-aware allocation
Multi-node MPI jobs
- Use --ntasks-per-node to control task distribution
- Consider --switches=1 for latency-sensitive applications
- Test with --contiguous if communication patterns favor nearby nodes
- Don't over-constrain: stricter topology requests mean longer queue times
Hybrid MPI+OpenMP
- Match MPI ranks to sockets: --ntasks-per-socket=1
- Give each rank cores on its socket: --cpus-per-task=N
- Use --exclusive to avoid interference
- Set OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 # 2 MPI ranks per node (1 per socket) #SBATCH --cpus-per-task=16 # 16 threads per rank #SBATCH --exclusive export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export OMP_PROC_BIND=spread export OMP_PLACES=cores srun ./my_hybrid_program
Multi-GPU jobs
- Request GPUs that are NVLink-connected when possible
- Match CPU cores to GPU locality
- For distributed training across nodes, network topology matters
Measuring Impact
To see if topology affects your application:
- Run a baseline job without topology constraints
- Run with --contiguous or --switches=1
- Compare runtime and wait time
If the constrained job is significantly faster but waits much longer, consider whether the trade-off is worthwhile.