Learn best practices for creating an LSF batch script.
If the queue is not specified, LSF will attempt to choose the most appropriate queue based on core count and wall clock time. For jobs that do not require special resources, let LSF choose a default queue.
For parallel jobs that do not use MPI, i.e., shared memory jobs, use the LSF specifier #BSUB -R span[hosts=1] to ensure that all cores requested are confined to one node.
Sometimes LSF will choose an inappropriate queue given a very specific set of requirements. For jobs with resource requirements, investigate the available queues in the LSF Resources documentation.
There are three common queues that are not default queues: gpu, standard_ib, and mixed_ib.
For users who have access to a partner queue, using the partner queue may shorten wait times; however, this may not be the case if the partner queue is heavily used by other group members. Also, not all partner queues have access to all types of hardware. For example, most partner queues currently do not contain GPUs.
The queues available to a user can be displayed by using bqueues -u user_name
, and the properties of a queue can be displayed by using bqueues -l queue_name
.
For help interpreting the output of bqueues, see this example.
Requesting more cores will not automatically make an application faster. The software must have been written in a way that allows the program to utilize more cores.
Please look at the video on Parallel Jobs, which explains what cores are and how many should be requested.
Serial jobs: If a program is serial, i.e., it does not know how to use multiple cores, then ask for 1 core only. Requesting more than 1 core to avoid queue limitations violates the Acceptable Use Policy (AUP).
#BSUB -n 1
Shared memory jobs: If a program is documented to be multithreaded and to use shared memory, it may be run with as many cores as that which exist on a given node. For more information on requesting a specific core count, see LSF specification for resource by processor type. The requested number of cores must be confined to a single node by using the span resource specifier:
#BSUB -R span[hosts=1]or the ptile resource specifier:
#BSUB -n #numcores #BSUB -R span[ptile=#numcores]
Distributed memory jobs: If a program is documented to be able to run in distributed memory or to be using MPI, it may be run with many cores and distributed over several nodes. The optimal number of cores and nodes is highly dependent on not only the software but the problem size. Consult the software documentation and conduct short experiments with a small sample data set, a subset of the original data, or the entire original data set for a limited number of time steps. Too few cores may result in a wall clock limit higher than what is allowed in the queues, while a very high core count request could result in more time spent waiting in a queue.
Finally, perform a small test of your application with different numbers of cores, e.g. 2, 4, 8. If the code doesn't get faster, do not run it with more cores.
back to topPlease look at the video on Parallel Jobs, which explains the definitions for hardware (nodes and cores) or software (MPI) for parallel programming.
back to topRunning with the incorrect LSF specifications can result in violating the Acceptable Use Policy, and you may be asked to terminate your jobs. Read the documentation to determine the expected behavior of an application, then confirm the behavior with a short test.
Please look at the video on Parallel Jobs, which explains 'shared memory' and gives a demo of testing code behavior.
When searching through the application's documentation, search for words such as cores, threads, parallel, multithreading. An application usually has a default value, which could be a fixed number like 1 or 8, or it could be all cores available on the node. In some cases it is set to all processes detected minus 1, usually for applications developed for a PC in consideration of the OS. The default threading behavior of these programs can often be changed by adding a command line argument or a function call, e.g.:
-t --threads CPUCOUNT= numThreads()
Some programming tools have parallel functions, including MATLAB's parpool and parfor, and also some R libraries including snow, parallel, doParallel, and foreach. Check the functions used in such scripts before running.
To confirm the threading behavior of an application, do a short interactive test using the following parameters:
bsub -Is -n 8 -R "span[hosts=1]" -x -W 10 bash
This will request a node with at least 8 cores. (Increase n to reserve a node with a higher minimum core count. All nodes currently have at least 8 cores.) It will ensure exclusive use of the node. Interactive debugging sessions using the exclusive option should be kept very short to avoid creating long lines in the queue. Make sure to exit the session promptly after the testing is complete.
Before running, confirm the session is on a compute node by doing echo $HOSTNAME
. It should not have login in the name.
Proceed using one of the following sets of directions. Use directions in a) if your application is able to run in the background using “&”, and use directions in b) if your application is NOT able to run in the background using “&”.
a) If your application is able to run in the background using "&"
module load mymodule ./mycode & htop
The command htop shows the cores active on the node. It also shows the amount of memory used. htop can be confusing as it is not static. To show a snapshot of processes and threads running, use top:
top -n 1 -H
Important: htop/top are to be used to confirm the code's behavior, not to determine it experimentally! The number of threads used may depend on the inputs, and multithreading may come in bursts that are not visible during the htop session. This could be from the code spawning threads as it enters a multithreaded function or subroutine. When in doubt about the threading behavior of an application, use the -x option during the testing.
b) If your application is NOT able to run in the background using “&”
If your application does not run well in the background using "&" (e.g., MATLAB, etc.), then use these alternate directions. Note, for these directions to work, you must have ssh'd to a Hazel login node using the “-X” flag, e.g., ssh -X user_name@login.hpc.ncsu.edu
, where user_name is the Unity ID. Once logged in, and on the compute node (from issuing "bsub -Is …" command above), run these commands:
module load mymodule xterm & ./mycode
Now, in the new interactive compute node xterm terminal, you can query information about the "./mycode" application that is running in the original interactive compute node terminal:
# Issue these commands from the new xterm terminal to monitor resources your "./mycode" is using: htop top -n 1 -H top -n 1 -u $USER -H
Important: htop/top are to be used to confirm the code's behavior, not to determine it experimentally! The number of threads used may depend on the inputs, and multithreading may come in bursts that are not visible during the htop session. This could be from the code spawning threads as it enters a multithreaded function or subroutine. When in doubt about the threading behavior of an application, use the -x option during the testing.
When searching through the application's documentation, search for words such as multiple nodes, distributed memory, MPI. Also, if a code is running in distributed memory, it usually requires a module containing MPI (PrgEnv-intel or openmpi-gcc) and the use of mpirun:
module load openmpi-gcc mpirun mycode
To test whether a code works properly over multiple nodes, i.e., works properly in distributed memory, do a short timing test using the following parameters:
#BSUB -n 2 #BSUB -R span[ptile=1] #BSUB -x #BSUB -W 10This will reserve 1 core on 2 different nodes in LSF. If the code runs properly, the code will execute on both nodes. When the code doesn't work properly in distributed memory, the code may try to run on two nodes but the communication won't work, leaving the work for a single task, or the two tasks of the program will run on the first node the jobs lands on, resulting in more tasks on the node than requested. Here, the -x ensures the job doesn't interfere with someone else if it doesn't work as expected.
Note: If the user attempts to install their own version of MPI, or uses MPI installed by a package manager like Conda, it is highly unlikely to work properly with LSF.
To make sure the code runs properly, do a timing test. For the first test, use the above ptile=1 example (guarantees that 2 nodes are requested with 1 task scheduled per node), and another with ptile=2 (guarantees the tasks are scheduled to be on the same node). For the timing test to be meaningful, the nodes must have the same (or almost the same) clock speed/memory, or else one node may simply be faster than another. For that, pick a host group or specify a particular resource.
Note the above timing test with -n 2 will be sufficient to show the code doesn't work properly if it fails, but it is not conclusive that it is correct if the expected speed-up does occur; it may simply mean the code is executing fine with both tasks on the first node.
HPC Staff have additional monitoring tools. If in doubt, contact HPC Staff to arrange for a staff monitored test. Staff generally do not have permissions to a user's application, and that is the preferred method of operation. Staff can schedule a time to monitor a job being run by the user.
Already sure the MPI code works properly? Do a timing test anyway. Do not request more cores if the code does not get better performance with more cores. The resources requested should be justified by the efficiency and performance of the code.
The LSF output file may have useful information regarding the parallel or threading behavior of an application. See the following example for more details.
back to topIf a code or script can be modified to take a command line argument, then the number of threads can be set with environment variables.
Memory, multithreading, or scaling tests.
#BSUB -n 32 #BSUB -R span[hosts=1] #BSUB -R select[stc]back to top
Ask for the necessary amount of time, plus a reasonable buffer.
Specifying the maximum wall clock time for the chosen queue will not only result in longer queue waits for the submitter, but for other users as well. LSF must reserve the proper number of nodes for the fully specified time. Jobs that may have run in between other scheduled jobs are forced to wait.
Use small test runs to estimate a proper wall clock time. See the LSF output in next example; LSF output files show the run time.
back to top
Do a sample run and examine the LSF output file.
Resource usage summary: CPU time : 209542.33 sec. (1) Max Memory : 15725.28 MB (2) Average Memory : 10982.08 MB Total Requested Memory : - Delta Memory : - Max Swap : 17773 MB Max Processes : 4 Max Threads : 38 Run time : 52134 sec. (3) Turnaround time : 52125 sec.1) CPU time usually should be the wall clock time elapsed times number of cores. (If the number does not seem to reflect this, it is possible that there were performance issues or other problems.)