High Performance Computing | XSEDE Boot Camp Exercises

Preliminary Exercise:
Note that you should usually never run anything on a login node, but we give you permission to do only these tiny examples while doing these exercises.
Here are instructions to get the exercises, compile, and run them:


cp /usr/local/apps/XSEDE_WORKSHOPS/BOOTCAMP_2019/Exercises.tar .

tar -xvf Exercises.tar

cd Exercises/Test

module load PrgEnv-pgi

pgf90 test.f90

./a.out

pgcc test.c

./a.out

Laplace OpenMP:
The example Laplace code asks for user input, so it is run interactively. To run interactively on a compute node, use bsub -Is. The following requests 30 minutes on a node with at least 8 cores, and request exclusive use of the node as to not disrupt other users. Since it is shared memory, the cores are limited to one node by using xhosts=1.
bsub -Is -W 30 -n 8 -x -R "span[hosts=1]" bash

Once on the node, check how many cores the node has using lshosts followed by the node name. It will be under ncpus. The node name will be at the prompt, e.g.,:
lshosts login01

To compile the code using PGI: (Code is in Exercises/OpenMP/Solutions)

module load PrgEnv-pgi

pgf90 -mp laplace_omp.f90

To compile the code using Intel: (First type module purge if you already loaded PGI)


module load PrgEnv-intel

ifort -qopenmp laplace_omp.f90

To compile the code using gcc: (First type module purge if you already loaded Intel/PGI)
gfortran -fopenmp laplace_omp.f90

Do not run on the log in node. Make sure you did the bsub -Is before running.
Before running, set OMP_NUM_THREADS. Vary the number as the training examples suggest, but note that Bridges has 28 core nodes and your interactive node likely has between 8 and 16. (Check this with lshosts)


export OMP_NUM_THREADS=4

./a.out

You won't see a speedup using the time printed out from the program for GNU or Intel because the timing subroutine works differently than in the PGI. GNU/Intel add the sum of time from all procs. To do timing tests for all versions, I suggest you:

Comment out the read/write statement in the beginning of the program and hard code 'max_iterations' to 4000

Compile, then set OMP_NUM_THREADS, then do time ./a.out and save those results.

Laplace OpenACC:
The PGI compiler compiles for the architecture that it compiles on, so if we compile on a login node it might crash when submitted to a compute node (illegal instruction). To avoid targeting the architecture at all, compile on the compute node when the job is submitted. For meaningful scaling tests between serial and parallel, you will have to specify the type of compute node and GPU type in the submit script. PGI should be used for OpenACC code, so we don't show this example with the other compilers.

To run and compile, submit the following script (Code is in Exercises/OpenACC/Solutions).

#!/bin/bash
#BSUB -n 1
#BSUB -W 30
#BSUB -q gpu
#BSUB -R "rusage[ngpus_shared=1]"
#BSUB -o out.%J
#BSUB -e err.%J
module load PrgEnv-pgi
pgf90 -acc laplace_acc.f90
time echo "4000" | ./a.out >& log_acc

The above will submit the job to any GPU. For scaling tests, be more specific according to the web documentation about using LSF Resources.

To do the scaling tests, run the same batch script with the last two lines changed to this:(remove compiler flag and rename output file)


pgf90 laplace_acc.f90

time echo "4000" | ./a.out >& log_serial

To pick a certain node+GPU, for example p100, change the rusage line to this:
#BSUB -R "rusage[ngpus_shared=1] select[p100]"

Laplace MPI:
The code is in Exercises/MPI/Solutions.
First, comment out the read/write statement and hard code max_iterations=4000. Since all processes see this, you do not have to use BCAST, so you can comment that out too. If you want to test the MPI_BCAST routine, define max_iterations=4000 within the if statement for mype=0 and then the MPI_BCAST will be necessary. Try to see what happens when you don't BCAST this variable.

To compile, see the instructions for parallel compilation on Henry2 using MPI. To run, make a batch script and use bsub.

Here is how to compile and run the sample code using the Intel compiler:


module load PrgEnv-intel

mpif90 laplace_mpi.f90

Make a submit script submit.sh:

#!/bin/bash
#BSUB -n 4
#BSUB -W 30
#BSUB -o out.%J
#BSUB -e err.%J
module load PrgEnv-intel
time mpirun -n 4 ./a.out

Then submit:
bsub < submit.sh