High Performance Computing | Step 6: Parallel Jobs

Note: The video version of this segment of the tutorial is here.

What is a parallel job?

A parallel job uses more than one core. The example program in R was a simple program that created a pdf file. It is a serial program, meaning it can only be run on one core. The program cannot be run on multiple cores, because nowhere in the program does it specify how to do so. A program can only do what the programmer tells it to do. In this case, we wrote weather.R, so we are sure it is a serial code. (What is a core??)

If we didn't write the code ourselves, we have to read the documentation to determine whether the code can be run in parallel or not. The details of parallel programming are beyond the scope of a Quick Start, but in general, if you download a popular application that was written recently, it is probably multithreaded and uses shared memory. Shared memory codes are parallel but can only run on one node. If an application is capable of multithreading, it may use multiple cores whether or not you explicitly told it to do so.

A program that can be run on more than one node has distributed memory parallelization. Look at the documentation to confirm your application can run on multiple nodes, but note that a program will not be using distributed memory unless it calls mpirun before the executable name.

Click here for more information on finding out whether or not your code is parallel.

The Super-Quick-Start to running parallel jobs

Remember, the first rule of the AUP is "play nice with others". If you know your job is multithreaded or takes a lot of memory, make sure the job is confined to a single node that is not being shared with others. This can be done by specifying hosts=1 and using the exclusive option:

#BSUB -R span[hosts=1]
#BSUB -x

When using a queue that does not allow for the exclusive option, be specific in your request such that you are filling all the cores on the node. For example, this requests 8 cores (-n 8), that all 8 cores are placed on one node (ptile = 8), and that the node should have 8 cores (a node with two quad core (qc) processors).

#BSUB -n 8
#BSUB -R span[ptile=8]
#BSUB -R select[qc]

Remember that the AUP also specifies that a job should make efficient use of resources. While using the exclusive option will ensure that you are not affecting others, you still might be over or undersubscribing a node. To learn more about how to examine this, please continue with this exercise and consult HPC staff if you have further questions.

Introduction to the parallel examples

These examples use a simple FORTRAN/MPI/OpenMP code. You do not have to know any of those languages. Launching the LSF script will load the modules, compile the code, and run the code. We are using this code because it demonstrates some important aspects of parallel jobs.

Both codes are simple "Hello World" examples. The shared memory code only uses threads, and it can only run on one node. Each thread will say "hello", and it will print out the name of the node it is running on. Recall from an earlier example how to print out the name of the node:

echo $HOSTNAME

The code simply does this echo command, and in the shared memory example, each task will be on the same node. In the MPI version, the code launches tasks, and each task spawns threads. Each thread will print out which task it was spawned from and also which node it is running on. In this case, the "hello" should come from more than one node...unless something is wrong!

In addition to saying hello, each code does a simple calculation in a very long loop. This is just so that the program doesn't exit immediately. It runs for about 30 seconds.

Exercise 6.1: Get the examples

Assuming you are already in the /share/$GROUP/$USER/guide directory, copy the parallel examples directory:

cp -r /usr/local/apps/samples/guide/parallel .
cd parallel

Running the shared memory example

Look at the script submit_shared.sh. The script requests 5 minutes with all cores on a single node (span[hosts=1]) with the exclusive use of a node with at least 8 cores. The environment variable OMP_NUM_THREADS is what controls the threading behavior. Submit the job:

bsub < submit_shared.sh

The output will look something like this:

Hello from thread 4 on host n2e6-6
Hello from thread 1 on host n2e6-6
Hello from thread 3 on host n2e6-6
Hello from thread 2 on host n2e6-6

Notice that even though we requested 8 cores from LSF, the code only used 4. LSF only reserves the cores. It is up to the programmer (or the user) to dictate how many cores are actually used.

Exercise 6.2: Change the number of threads used

Change OMP_NUM_THREADS to 8, resubmit, and check the output.

If you change OMP_NUM_THREADS to something greater than 8, will LSF limit it to 8? If OMP_NUM_THREADS is set to something higher than what LSF reserves, how might this affect your application, and how would it affect other users if you did not specify -x?

Exercise 6.3: Check the behavior of the code with an interactive login

Reserve an interactive compute node by doing bsub -Is -n 1 -x -W 10 bash.

Check which node you are on by doing echo $HOSTNAME. Check how many cores are on the node with lscpu.

Do the following to set OMP_NUM_THREADS and run the code (this assumes you already ran the eariler exercises so that there is a compiled code hello_shared in your directory):

export OMP_NUM_THREADS=4
./hello_shared &
htop

The command htop should show that there are 4 threads running. htop is dynamic, and sometimes it isn't clear how many threads are running at once. For a single point in time view (-n 1) that includes not just the executable but also the threads (-H), use top:

[Use Control-C to exit htop]
top -n 1 -H

You should see something like:

P   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                        
4 18084 lllowe    30  10  222192    864    696 R 99.9  0.0   0:10.53 hello_shared                                   
3 18085 lllowe    30  10  222192    864    696 R 99.9  0.0   0:10.51 hello_shared                                   
5 18086 lllowe    30  10  222192    864    696 R 99.9  0.0   0:10.53 hello_shared                                   
1 18087 lllowe    30  10  222192    864    696 R 99.9  0.0   0:10.53 hello_shared                                   
7 18101 lllowe    30  10  162532   2760   1544 R 12.5  0.0   0:00.02 top

Exit the interactive session with exit.

Running the distributed memory example

You should still be in the parallel directory. If not, go back to it:

cd /share/$GROUP/$USER/guide/parallel

Exercise 6.4: Explaining the LSF script

Look at the script submit_mpi.sh. The following is a description of what is going on in the script:

The specifier -n 6 will be passed to mpirun, meaning there will be 6 MPI tasks.

The LSF specifier ptile=2 tells LSF to put only 2 tasks on each node. Therefore, LSF would reserve a total of 3 nodes. There are 6 tasks, with 2 on each of the three nodes.

The code has hybrid parallelism with MPI-OpenMP, and it will spawn OpenMP threads for each MPI task. Tasks and threads describe how the application is programmed. Nodes and cores describe the hardware. Ideally, programs should have one core available for each thread.

The environment variable OMP_NUM_THREADS controls the threading behavior for this code, and it is set to 4. Each of 6 tasks will spawn 4 threads, so the program will be using a total of 24 cores. In this case, we must specify -x, because LSF has only reserved 2 tasks per node.

Since there are 2 MPI tasks per node, and they each spawn 4 threads, the number of total threads running on a node will be 8.

LSF is not aware of threading behavior, either as set by OMP_NUM_THREADS in this program or as set some other way in another code you may have downloaded from a package repository.

Exercise 6.5: Submit the job

Submit the job using bsub < submit_mpi.sh.

Do bjobs -l to check how LSF reserved the nodes. There should be 3 nodes reserved with 2 tasks each:

Tue Feb  4 11:10:35: Started on 6 Hosts/Processors <2*n3l4-12> <2*n3l4-11> <2*n
                     3l4-14>, Execution Home , Execution CWD ;

The LSF output should contain something like:

Hello from thread 4 from MPI Task 1 on host n3g1-7
Hello from thread 1 from MPI Task 2 on host n3g1-7
Hello from thread 2 from MPI Task 5 on host n3g1-11
Hello from thread 1 from MPI Task 1 on host n3g1-7
Hello from thread 1 from MPI Task 6 on host n3g1-11
Hello from thread 3 from MPI Task 2 on host n3g1-7
Hello from thread 4 from MPI Task 5 on host n3g1-11
Hello from thread 2 from MPI Task 6 on host n3g1-11
Hello from thread 1 from MPI Task 5 on host n3g1-11
Hello from thread 4 from MPI Task 3 on host n3g1-12
Hello from thread 4 from MPI Task 4 on host n3g1-12
Hello from thread 4 from MPI Task 6 on host n3g1-11
Hello from thread 3 from MPI Task 6 on host n3g1-11
Hello from thread 3 from MPI Task 5 on host n3g1-11
Hello from thread 3 from MPI Task 1 on host n3g1-7
Hello from thread 3 from MPI Task 4 on host n3g1-12
Hello from thread 2 from MPI Task 2 on host n3g1-7
Hello from thread 4 from MPI Task 2 on host n3g1-7
Hello from thread 2 from MPI Task 1 on host n3g1-7
Hello from thread 1 from MPI Task 3 on host n3g1-12
Hello from thread 2 from MPI Task 4 on host n3g1-12
Hello from thread 2 from MPI Task 3 on host n3g1-12
Hello from thread 1 from MPI Task 4 on host n3g1-12
Hello from thread 3 from MPI Task 3 on host n3g1-12

Check that there are 6 tasks, each having 4 threads, and that 3 nodes are used (2 tasks/8 threads on each node).

Change the parameters and run again.

Edit the submit script to remove mpirun and resubmit the script. What happens?

Note: We are controlling the number of threads with OMP_NUM_THREADS. Your application probably does not use this variable (unless you wrote it with OpenMP or are using threaded libraries). To control the threading behavior of your code, you must consult the documentation.

Here is some advice on examining/testing/controlling the threading behavior of an application.

Go to Step 7