Be sure to use resources efficiently with hybrid OpenMP-MPI jobs.
Hazel is a heterogeneous cluster, consisting of many different node types. When specifiing ptile and OMP_NUM_THREADS, keep in mind that nodes have from 8 to 32 cores. The example script specifies an 8 core node, since it can only use 8 cores per node.
See the following video segment on finding specs for nodes on the cluster.
If the code works optimally when using the maximum amount of cores on the node in shared memory (OpenMP), then use ptile=1 and set OMP_NUM_THREADS to be all the cores on the node.
In the following script, 1 MPI task per node is specified. The command nproc --all gives the total number of cores on the compute node, and OMP_NUM_THREADS is set to that value. This ensures all the cores are used, regardless of node type.
#!/bin/bash #BSUB -n 6 # Number of MPI tasks #BSUB -R span[ptile=1] # MPI tasks per node #BSUB -x # Exclusive use of nodes #BSUB -J chemtest1 # Name of job #BSUB -W 2:30 # Wall clock time #BSUB -o chemtest1.out.%J # Standard out #BSUB -e chemtest1.err.%J # Standard error module load openmpi-gcc # Set environment export OMP_NUM_THREADS=`nproc --all` mpirun ./chemtest1.exe
If the code works optimally when using more MPI tasks, and a smaller number of threads per MPI task, then use environment variables to set OMP_NUM_THREADS to be a portion of all cores on the node.
In the following script, 2 MPI tasks per node are specified. Each of those two 2 MPI tasks can only use half of the cores on the compute node. The command nproc --all gives the total number of cores on the compute node, and the line of @ halfCores = `nproc --all` / 2 assigns the number of half the cores to the variable halfCores, which is then set as the value of OMP_NUM_THREADS. This ensures all the cores are used, regardless of node type.
#!/bin/bash #BSUB -n 6 # Number of MPI tasks #BSUB -R span[ptile=2] # MPI tasks per node #BSUB -x # Exclusive use of nodes #BSUB -J chemtest1 # Name of job #BSUB -W 2:30 # Wall clock time #BSUB -o chemtest1.out.%J # Standard out #BSUB -e chemtest1.err.%J # Standard error module load openmpi-gcc # Set environment halfCores=$((`nproc --all`/2)) export OMP_NUM_THREADS=$halfCores mpirun ./chemtest1.exe
If the code scales to a large number of cores (many nodes), this script allows the user to drop the restrition of exclusive use of the nodes (a job that requires many exclusive nodes may stay PENDing for a long time). OMP_NUM_THREADS is controlled by the ptile scheduler option.
In the following script, a total of 80 threads are run. The scheduler reserves 8 cores on each of 10 nodes. One MPI task will run on each node. Normally #BSUB -n 80 means a total of 80 MPI tasks, but in this case, the default -n parameter for mpirun is overridden (see below, mpirun -n $numNodes). In summary, a total of 80 threads are run: 10 nodes, 1 MPI task per node, 8 threads for each MPI task. The number of threads (OMP_NUM_THREADS) run on each node is limited by ptile. This control allows for dropping the exclusivity restriction.
#!/bin/bash #BSUB -n 80 # Normally this is the # of MPI tasks, but here -n/ptile is the # of MPI tasks - note the special mpirun arguments below #BSUB -R span[ptile=8] # Threads per node, the -n value above should be divisible by the ptile value #BSUB -J chemtest1 # Name of job #BSUB -W 2:30 # Wall clock time #BSUB -o chemtest1.out.%J # Standard out #BSUB -e chemtest1.err.%J # Standard error module load PrgEnv-intel/2020.2.254 # Set environment export Num=$((`echo "$LSB_SUB_RES_REQ" | awk -F 'ptile=' '{print $2}' | awk -F ']' '{print $1}'`)) echo "$LSB_SUB_RES_REQ" echo "$Num" export OMP_NUM_THREADS=$Num # Set number of threads per node to ptile echo "$OMP_NUM_THREADS" numNodes=$(($LSB_DJOB_NUMPROC/$OMP_NUM_THREADS)) echo $numNodes cat $LSB_DJOB_HOSTFILE | uniq | sed 's/$/:1/' > mf # Create a machinefile called mf export cmdstring="mpirun -n $numNodes -machinefile mf ./chemtest1.exe" $cmdstring
If the code scales to a large number of cores (many nodes), this script allows the user to drop the restrition of exclusive use of the nodes (a job that requires many exclusive nodes may stay PENDing for a long time). OMP_NUM_THREADS is controlled by the ptile scheduler option, as well as how many MPI ranks user wants per node.
In the following script, a total of 80 threads are run. The scheduler reserves 8 cores on each of 10 nodes. One MPI task will run on each node. Normally #BSUB -n 80 means a total of 80 MPI tasks, but in this case, the default -n parameter for mpirun is overridden (see below, mpirun -n $numNodes). In summary, a total of 80 threads are run: 10 nodes, 1 MPI task per node, 8 threads for each MPI task. The number of threads (OMP_NUM_THREADS) run on each node is limited by ptile. This control allows for dropping the exclusivity restriction.
#!/bin/bash #BSUB -n 80 # This should be $total_num_mpi_ranks * $OMP_NUM_THREADS #BSUB -R span[ptile=8] # Number of cores reserved per node in LSF, the -n value above should be divisible by the ptile value #BSUB -J test # Name of job #BSUB -W 30 # Wall clock time #BSUB -o out.%J # Standard out #BSUB -e err.%J # Standard error source ~/.bashrc module load openmpi-gcc/openmpi5.0.5-gcc11.4.1 # set environment export num_mpi_ranks_per_node=1 # how many MPI ranks do you want per node? The ptile value must be divisible by this number. Adjusting this value will adjust number of OpenMP threads per MPI rank, i.e., adjust OMP_NUM_THREADS, see below. There may be cases where you want just 1 MPI rank per node, or other cases where you want multiple MPI ranks per node echo Number of mpi ranks per node is: $num_mpi_ranks_per_node export num_cores_per_node_fr_ptile=$((`echo "$LSB_SUB_RES_REQ" | awk -F 'ptile=' '{print $2}' | awk -F ']' '{print $1}'`)) echo Number of cores reserved per node in LSF, as specified by ptile: $num_cores_per_node_fr_ptile export OMP_NUM_THREADS=$(($num_cores_per_node_fr_ptile/$num_mpi_ranks_per_node)) # Set number of OpenMP threads per MPI rank to value of ptile divided by desired number of MPI ranks per node echo OMP_NUM_THREADS: "$OMP_NUM_THREADS" num_nodes=$(($LSB_DJOB_NUMPROC/$num_cores_per_node_fr_ptile)) # Number of nodes in LSF reservation echo Number of nodes in this LSF reservation: $num_nodes export total_num_mpi_ranks=$(($num_nodes*$num_mpi_ranks_per_node)) echo Total number of MPI ranks: $total_num_mpi_ranks cat $LSB_DJOB_HOSTFILE | uniq | sed "s/$/ slots=$num_mpi_ranks_per_node/" > mf # Make a machine file specifying how to distribute MPI ranks amongst nodes. export cmdstring="mpirun -np $total_num_mpi_ranks -machinefile mf ./hello_mpi" # substitute in your application you are running for ./hello_mpi $cmdstring