High Performance Computing | Advanced Job Techniques

When to Automate

Consider automation when you need to:

Submit many jobs with different input files or parameters
Generate batch scripts dynamically based on data
Create complex job workflows with dependencies
Run the same analysis on multiple datasets

Note: For simple parameter sweeps, array jobs are often easier than scripted submissions.

Shell Script Loops

The simplest automation is a shell loop that submits multiple jobs:

Submit jobs for multiple input files

#!/bin/bash
# submit_all.sh - Submit a job for each input file

for file in data/*.csv; do
    filename=$(basename "$file" .csv)
    sbatch --job-name="process_${filename}" \
           --output="logs/${filename}.out" \
           --error="logs/${filename}.err" \
           process_job.sh "$file"
done

The batch script process_job.sh receives the filename as $1:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --time=01:00:00

INPUT_FILE=$1
OUTPUT_FILE="results/$(basename $INPUT_FILE .csv)_result.csv"

./analyze.py --input "$INPUT_FILE" --output "$OUTPUT_FILE"

Parameter sweep with loops

#!/bin/bash
# submit_sweep.sh - Parameter sweep

for temp in 100 200 300 400 500; do
    for pressure in 1 5 10 50 100; do
        sbatch --job-name="sim_T${temp}_P${pressure}" \
               --output="logs/sim_T${temp}_P${pressure}.out" \
               simulation.sh $temp $pressure
    done
done

Generating Batch Scripts

For more complex jobs, generate the entire batch script dynamically:

Shell script generator

#!/bin/bash
# generate_and_submit.sh

for i in $(seq 1 10); do
    # Generate batch script
    cat > job_${i}.sh << EOF
#!/bin/bash
#SBATCH --job-name=run_${i}
#SBATCH --output=logs/run_${i}.out
#SBATCH --error=logs/run_${i}.err
#SBATCH --ntasks=1
#SBATCH --mem=2G
#SBATCH --time=02:00:00

echo "Running iteration ${i}"
./myprogram --iteration ${i} --seed $((RANDOM))
EOF

    # Submit the generated script
    sbatch job_${i}.sh
done

Python script generator

#!/usr/bin/env python3
# generate_jobs.py

import subprocess
import os

parameters = [
    {'name': 'small', 'size': 100, 'time': '01:00:00'},
    {'name': 'medium', 'size': 1000, 'time': '04:00:00'},
    {'name': 'large', 'size': 10000, 'time': '12:00:00'},
]

os.makedirs('generated_scripts', exist_ok=True)
os.makedirs('logs', exist_ok=True)

for param in parameters:
    script_content = f"""#!/bin/bash
#SBATCH --job-name={param['name']}
#SBATCH --output=logs/{param['name']}.out
#SBATCH --error=logs/{param['name']}.err
#SBATCH --ntasks=4
#SBATCH --mem=8G
#SBATCH --time={param['time']}

./simulation --size {param['size']} --output results/{param['name']}.dat
"""

    script_path = f"generated_scripts/{param['name']}.sh"
    with open(script_path, 'w') as f:
        f.write(script_content)

    # Submit the job
    result = subprocess.run(['sbatch', script_path], capture_output=True, text=True)
    print(f"Submitted {param['name']}: {result.stdout.strip()}")

Job Dependencies

Chain jobs together so they run in sequence:

Linear pipeline

#!/bin/bash
# submit_pipeline.sh

# Submit first job, capture job ID
JOB1=$(sbatch --parsable preprocess.sh)
echo "Submitted preprocessing: $JOB1"

# Submit second job, depends on first
JOB2=$(sbatch --parsable --dependency=afterok:$JOB1 analyze.sh)
echo "Submitted analysis: $JOB2"

# Submit third job, depends on second
JOB3=$(sbatch --parsable --dependency=afterok:$JOB2 postprocess.sh)
echo "Submitted postprocessing: $JOB3"

Fan-out, fan-in pattern

#!/bin/bash
# submit_fanout.sh - Run multiple jobs, then merge results

# Submit parallel processing jobs
JOBS=""
for i in $(seq 1 10); do
    JOB=$(sbatch --parsable process_chunk.sh $i)
    JOBS="${JOBS}:${JOB}"
done
# Remove leading colon
JOBS=${JOBS#:}

echo "Submitted processing jobs: $JOBS"

# Submit merge job that waits for all processing jobs
MERGE_JOB=$(sbatch --parsable --dependency=afterok:$JOBS merge_results.sh)
echo "Submitted merge job: $MERGE_JOB"

Dependency types

Option	Meaning
`--dependency=afterok:JOBID`	Run after JOBID completes successfully
`--dependency=afterany:JOBID`	Run after JOBID completes (success or failure)
`--dependency=afternotok:JOBID`	Run only if JOBID fails
`--dependency=after:JOBID`	Run after JOBID starts
`--dependency=singleton`	Run only one job with this name at a time

Reading Parameters from Files

For large parameter sets, read from a configuration file:

CSV parameter file

Create parameters.csv:

name,temperature,pressure,iterations
run1,100,1.0,1000
run2,200,1.5,2000
run3,300,2.0,3000

Submit jobs from CSV:

#!/bin/bash
# submit_from_csv.sh

# Skip header line, read each row
tail -n +2 parameters.csv | while IFS=, read -r name temp pressure iters; do
    sbatch --job-name="$name" \
           --output="logs/${name}.out" \
           --export=ALL,TEMP=$temp,PRESSURE=$pressure,ITERATIONS=$iters \
           simulation.sh
done

Python with CSV

#!/usr/bin/env python3
import csv
import subprocess

with open('parameters.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        cmd = [
            'sbatch',
            f'--job-name={row["name"]}',
            f'--output=logs/{row["name"]}.out',
            f'--export=ALL,TEMP={row["temperature"]},PRESSURE={row["pressure"]}',
            'simulation.sh'
        ]
        result = subprocess.run(cmd, capture_output=True, text=True)
        print(f'{row["name"]}: {result.stdout.strip()}')

Conditional Submissions

Submit jobs based on conditions:

#!/bin/bash
# submit_missing.sh - Only submit jobs for missing output files

for input in data/*.dat; do
    base=$(basename "$input" .dat)
    output="results/${base}_result.dat"

    if [ ! -f "$output" ]; then
        echo "Submitting job for $base (output missing)"
        sbatch --job-name="$base" process.sh "$input"
    else
        echo "Skipping $base (output exists)"
    fi
done

Tracking Submitted Jobs

Keep a log of submitted jobs:

#!/bin/bash
# submit_with_log.sh

LOGFILE="submission_log_$(date +%Y%m%d_%H%M%S).txt"
echo "Submission started: $(date)" > "$LOGFILE"

for file in data/*.csv; do
    JOBID=$(sbatch --parsable process.sh "$file")
    echo "$JOBID,$file,$(date +%Y-%m-%d_%H:%M:%S)" >> "$LOGFILE"
done

echo "Submission complete: $(date)" >> "$LOGFILE"
echo "Jobs logged to $LOGFILE"

Rate Limiting

Avoid overwhelming the scheduler with too many submissions at once:

#!/bin/bash
# submit_with_delay.sh

for file in data/*.csv; do
    sbatch process.sh "$file"
    sleep 0.5  # Half-second delay between submissions
done

For very large submissions (1000+ jobs), consider using array jobs with a concurrency limit (--array=1-1000%50) instead.

Best Practices

Create directories first: Ensure log and output directories exist before submitting
```
mkdir -p logs results
./submit_all.sh
```
Test with one job: Verify your script works before submitting hundreds of jobs
Use --parsable: When capturing job IDs, use sbatch --parsable for clean output
Quote variables: Always quote file paths and parameters to handle spaces correctly
Prefer array jobs: For simple parameter sweeps, array jobs are more efficient than scripted loops
Check queue limits: Be aware of QOS limits on concurrent jobs
Keep submission scripts: Save your submission scripts for reproducibility

Further Resources

Array Jobs - Simpler approach for many similar jobs
Submission FAQ
Batch Script Template

Cluster Topology and Job Placement

How cluster topology affects parallel job performance and scheduling.

Why Topology Matters

Parallel jobs communicate between processes. The physical location of those processes affects communication speed:

Same core (SMT/hyperthreading): Shared L1/L2 cache - fastest
Same socket: Shared L3 cache - very fast
Same node, different sockets: Memory bus (NUMA) - fast
Same switch: Single network hop - moderate
Different switches: Multiple network hops - slower

For communication-intensive parallel jobs, keeping processes close together can significantly improve performance.

Node Topology

Each compute node has internal topology:

Sockets and Cores

Most nodes have 2 CPU sockets, each with multiple cores. Cores on the same socket share L3 cache and have faster memory access to local NUMA memory.

Node
├── Socket 0 (NUMA node 0)
│   ├── Core 0, Core 1, ... Core N
│   └── Local Memory
└── Socket 1 (NUMA node 1)
    ├── Core 0, Core 1, ... Core N
    └── Local Memory

View node topology

On a compute node, use lscpu or numactl:

# View CPU topology
lscpu | grep -E "Socket|Core|Thread|NUMA"

# View NUMA topology
numactl --hardware

NUMA Effects

Non-Uniform Memory Access (NUMA) means memory access speed depends on which socket owns the memory. Accessing remote memory (memory attached to the other socket) is slower.

For memory-intensive applications, keeping processes and their memory on the same NUMA node improves performance.

Network Topology

Nodes are connected through a hierarchical network:

Core Switch
├── Leaf Switch 1
│   ├── Node 001
│   ├── Node 002
│   └── ...
├── Leaf Switch 2
│   ├── Node 033
│   ├── Node 034
│   └── ...
└── ...

Communication between nodes on the same leaf switch is faster than communication across switches.

Slurm Topology-Aware Scheduling

Slurm can consider topology when placing jobs. The scheduler attempts to place multi-node jobs on nodes that are close together in the network.

Request contiguous nodes

Use --contiguous to request nodes that are next to each other:

#SBATCH --nodes=4
#SBATCH --contiguous

Note: This may increase queue wait time if contiguous nodes aren't available.

Request nodes on same switch

Use --switches to limit the number of network switches:

#SBATCH --nodes=8
#SBATCH --switches=1           # All nodes on same switch

You can add a timeout to fall back if the request can't be satisfied:

#SBATCH --switches=1@00:30:00  # Wait up to 30 min for single switch

If Slurm can't place all nodes on one switch within 30 minutes, it will relax the constraint.

Task and Core Binding

Control how tasks are distributed across sockets and cores:

Tasks per node and socket

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=2    # Spread across both sockets

CPU binding with srun

Use srun options to control binding:

# Bind each task to a specific core
srun --cpu-bind=cores ./my_mpi_program

# Bind each task to a socket
srun --cpu-bind=sockets ./my_mpi_program

# Bind to NUMA nodes
srun --cpu-bind=ldoms ./my_mpi_program

View current binding

Check how tasks are bound:

srun --cpu-bind=verbose ./my_program

Memory Binding

Control memory allocation policy for NUMA systems:

# Allocate memory on local NUMA node only
srun --mem-bind=local ./my_program

# Prefer local, but allow remote if needed
srun --mem-bind=prefer ./my_program

Distribution Patterns

The --distribution option controls how tasks are distributed:

Pattern	Description	Best For
`block`	Fill each node before moving to next	Jobs with heavy intra-node communication
`cyclic`	Round-robin across nodes	Load balancing, memory distribution
`block:block`	Block across nodes, block across sockets	Default, good general choice
`block:cyclic`	Block across nodes, round-robin across sockets	Spread memory load within nodes
`cyclic:cyclic`	Round-robin across nodes and sockets	Maximum distribution

Example: Block distribution

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --distribution=block

srun ./my_mpi_program

Result: Tasks 0-3 on node 1, tasks 4-7 on node 2.

Example: Cyclic distribution

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --distribution=cyclic

srun ./my_mpi_program

Result: Tasks 0,2,4,6 on node 1; tasks 1,3,5,7 on node 2.

GPU Topology

On multi-GPU nodes, GPUs may have different connectivity:

NVLink: High-bandwidth direct GPU-to-GPU connection
PCIe: Standard bus connection through CPU

For multi-GPU jobs, GPUs connected via NVLink communicate faster than those connected only via PCIe.

Check GPU topology

nvidia-smi topo -m

This shows the connection matrix between GPUs and CPUs.

GPU affinity

Slurm automatically sets CUDA_VISIBLE_DEVICES to your allocated GPUs. For best performance, bind CPU tasks to cores near their assigned GPUs:

#SBATCH --gres=gpu:h100:2
#SBATCH --cpus-per-task=8

# Let Slurm handle GPU-CPU affinity
srun --gpus-per-task=1 ./my_gpu_program

Practical Recommendations

Single-node jobs

Use --nodes=1 to ensure all tasks are on the same node
For OpenMP, set OMP_PROC_BIND=spread or close depending on memory access patterns
For memory-intensive work, consider NUMA-aware allocation

Multi-node MPI jobs

Use --ntasks-per-node to control task distribution
Consider --switches=1 for latency-sensitive applications
Test with --contiguous if communication patterns favor nearby nodes
Don't over-constrain: stricter topology requests mean longer queue times

Hybrid MPI+OpenMP

Match MPI ranks to sockets: --ntasks-per-socket=1
Give each rank cores on its socket: --cpus-per-task=N
Use --exclusive to avoid interference
Set OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2       # 2 MPI ranks per node (1 per socket)
#SBATCH --cpus-per-task=16        # 16 threads per rank
#SBATCH --exclusive

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=cores

srun ./my_hybrid_program

Multi-GPU jobs

Request GPUs that are NVLink-connected when possible
Match CPU cores to GPU locality
For distributed training across nodes, network topology matters

Measuring Impact

To see if topology affects your application:

Run a baseline job without topology constraints
Run with --contiguous or --switches=1
Compare runtime and wait time

If the constrained job is significantly faster but waits much longer, consider whether the trade-off is worthwhile.

Advanced Job Techniques