Learn best practices for creating an LSF batch script.

  • Which queue should I use?
  • How many cores should I ask for?
  • Just what is a core anyway???
  • How do I confirm the parallel processing behavior of my code?
  • Can I change my codes' behavior based on the node I am assigned?
  • When should I ask for exclusive use of a node?
  • How can I specify exclusive use of the node if the queue doesn't allow it?
  • How much time should I ask for?
  • How much memory does my code use?
  • I have some other question.
  • Which queue should I use?

    If the queue is not specified, LSF will attempt to choose the most appropriate queue based on core count and wall clock time. For jobs that do not require special resources, let LSF choose a default queue.

    For parallel jobs that do not use MPI, i.e., shared memory jobs, use the LSF specifier #BSUB -R span[hosts=1] to ensure that all cores requested are confined to one node.

    Sometimes LSF will choose an inappropriate queue given a very specific set of requirements. For jobs with resource requirements, investigate the available queues in the LSF Resources documentation.

    There are three common queues that are not default queues: gpu, standard_ib, and mixed_ib.

  • If you are requesting a GPU, specify the gpu queue.
  • If you are running a shared_memory application, particularly one requiring large memory (more than 256GB) specify the shared_memory queue.
  • If you are running an application using a container specify the sif queue.

    For users who have access to a partner queue, using the partner queue may shorten wait times; however, this may not be the case if the partner queue is heavily used by other group members. Also, not all partner queues have access to all types of hardware. For example, most partner queues currently do not contain GPUs.

    The queues available to a user can be displayed by using bqueues -u user_name, and the properties of a queue can be displayed by using bqueues -l queue_name. For help interpreting the output of bqueues, see this example.

    back to top

    How many cores should I ask for?

    Requesting more cores will not automatically make an application faster. The software must have been written in a way that allows the program to utilize more cores.

    Please look at the video on Parallel Jobs, which explains what cores are and how many should be requested.

    Serial jobs: If a program is serial, i.e., it does not know how to use multiple cores, then ask for 1 core only. Requesting more than 1 core to avoid queue limitations violates the Acceptable Use Policy (AUP).

    #BSUB -n 1
    

    Shared memory jobs: If a program is documented to be multithreaded and to use shared memory, it may be run with as many cores as that which exist on a given node. For more information on requesting a specific core count, see LSF specification for resource by processor type. The requested number of cores must be confined to a single node by using the span resource specifier:

    #BSUB -R span[hosts=1]
    
    or the ptile resource specifier:
    #BSUB -n #numcores
    #BSUB -R span[ptile=#numcores]
    

    Distributed memory jobs: If a program is documented to be able to run in distributed memory or to be using MPI, it may be run with many cores and distributed over several nodes. The optimal number of cores and nodes is highly dependent on not only the software but the problem size. Consult the software documentation and conduct short experiments with a small sample data set, a subset of the original data, or the entire original data set for a limited number of time steps. Too few cores may result in a wall clock limit higher than what is allowed in the queues, while a very high core count request could result in more time spent waiting in a queue.

    Finally, perform a small test of your application with different numbers of cores, e.g. 2, 4, 8. If the code doesn't get faster, do not run it with more cores.

    back to top

    Just what is a core anyway???

    Please look at the video on Parallel Jobs, which explains the definitions for hardware (nodes and cores) or software (MPI) for parallel programming.

    back to top

    How do I confirm the parallel processing behavior of my code?

    Running with the incorrect LSF specifications can result in violating the Acceptable Use Policy, and you may be asked to terminate your jobs. Read the documentation to determine the expected behavior of an application, then confirm the behavior with a short test.

    Please look at the video on Parallel Jobs, which explains 'shared memory' and gives a demo of testing code behavior.

      Interactive session test for shared memory

      When searching through the application's documentation, search for words such as cores, threads, parallel, multithreading. An application usually has a default value, which could be a fixed number like 1 or 8, or it could be all cores available on the node. In some cases it is set to all processes detected minus 1, usually for applications developed for a PC in consideration of the OS. The default threading behavior of these programs can often be changed by adding a command line argument or a function call, e.g.:

      -t 
      --threads 
      CPUCOUNT=
      numThreads()
      

      Some programming tools have parallel functions, including MATLAB's parpool and parfor, and also some R libraries including snow, parallel, doParallel, and foreach. Check the functions used in such scripts before running.

      To confirm the threading behavior of an application, do a short interactive test using the following parameters:
      bsub -Is -n 8 -R "span[hosts=1]" -x -W 10 bash

      This will request a node with at least 8 cores. (Increase n to reserve a node with a higher minimum core count. All nodes currently have at least 8 cores.) It will ensure exclusive use of the node. Interactive debugging sessions using the exclusive option should be kept very short to avoid creating long lines in the queue. Make sure to exit the session promptly after the testing is complete.

      Before running, confirm the session is on a compute node by doing echo $HOSTNAME. It should not have login in the name.

      Proceed using one of the following sets of directions. Use directions in a) if your application runs well in the background (using "&"), and use directions in b) if your application does not allow itself to be run in the background (using "&").

      a) If your application is able to run in the background using "&"

      When on the compute node, set your environment, run the code in the background, and then use htop:
      module load mymodule
      ./mycode &
      htop
      

      The command htop shows the cores active on the node. It also shows the amount of memory used. htop can be confusing as it is not static. To show a snapshot of processes and threads running, use top:

      top -n 1 -H
      

      Important: htop/top are to be used to confirm the code's behavior, not to determine it experimentally! The number of threads used may depend on the inputs, and multithreading may come in bursts that are not visible during the htop session. This could be from the code spawning threads as it enters a multithreaded function or subroutine. When in doubt about the threading behavior of an application, use the -x option during the testing.

      b) If your application is NOT able to run in the background using “&”

      If your application does not run well in the background using "&" (e.g., MATLAB, etc.), then use these alternate directions. Note, for these directions to work, you must have ssh'd to a Hazel login node using the “-X” flag, e.g., ssh -X user_name@login.hpc.ncsu.edu, where user_name is the Unity ID. Once logged in, and on the compute node (from issuing "bsub -Is …" command above), run these commands:

      module load mymodule
      xterm &
      ./mycode
      

      Now, in the new interactive compute node xterm terminal, you can query information about the "./mycode" application that is running in the original interactive compute node terminal:

      # Issue these commands from the new xterm terminal to monitor resources your "./mycode" is using:
      
      htop
      top -n 1 -H
      top -n 1 -u $USER -H
      

      Timing test for distributed memory

      When searching through the application's documentation, search for words such as multiple nodes, distributed memory, MPI. Also, if a code is running in distributed memory, it usually requires a module containing MPI (PrgEnv-intel or openmpi-gcc) and the use of mpirun:

      module load openmpi-gcc
      mpirun mycode
      

      To test whether a code works properly over multiple nodes, i.e., works properly in distributed memory, do a short timing test using the following parameters:

      #BSUB -n 2
      #BSUB -R span[ptile=1]
      #BSUB -x
      #BSUB -W 10
      
      This will reserve 1 core on 2 different nodes in LSF. If the code runs properly, the code will execute on both nodes. When the code doesn't work properly in distributed memory, the code may try to run on two nodes but the communication won't work, leaving the work for a single task, or the two tasks of the program will run on the first node the jobs lands on, resulting in more tasks on the node than requested. Here, the -x ensures the job doesn't interfere with someone else if it doesn't work as expected.

      Note: If the user attempts to install their own version of MPI, or uses MPI installed by a package manager like Conda, it is highly unlikely to work properly with LSF.

      To make sure the code runs properly, do a timing test. For the first test, use the above ptile=1 example (guarantees that 2 nodes are requested with 1 task scheduled per node), and another with ptile=2 (guarantees the tasks are scheduled to be on the same node). For the timing test to be meaningful, the nodes must have the same (or almost the same) clock speed/memory, or else one node may simply be faster than another. For that, pick a host group or specify a particular resource.

      Note the above timing test with -n 2 will be sufficient to show the code doesn't work properly if it fails, but it is not conclusive that it is correct if the expected speed-up does occur; it may simply mean the code is executing fine with both tasks on the first node.

      HPC Staff have additional monitoring tools. If in doubt, contact HPC Staff to arrange for a staff monitored test. Staff generally do not have permissions to a user's application, and that is the preferred method of operation. Staff can schedule a time to monitor a job being run by the user.

      Already sure the MPI code works properly? Do a timing test anyway. Do not request more cores if the code does not get better performance with more cores. The resources requested should be justified by the efficiency and performance of the code.

      Check the LSF output from a sample run

      The LSF output file may have useful information regarding the parallel or threading behavior of an application. See the following example for more details.

      back to top

      Can I change my code's behavior based on the number of cores I am assigned?

      If a code or script can be modified to take a command line argument, then the number of threads can be set with environment variables.

      • Use the LSF variable $LSB_DJOB_NUMPROC to get the number of cores assigned by LSF.
      • Use nproc --all to get the number of cores on the assigned node, as demonstrated in the documentation on running hybrid jobs.

      back to top

      When should I ask for exclusive use of a node?

      Memory, multithreading, or scaling tests.

      • Memory: Some serial jobs require a large amount of memory. If the job is serial, the proper number of cores to request is -n 1; however, requesting one core on a 16 core node may result in 15 other jobs being assigned to the same node. That could lead to all jobs failing because of lack of memory and could possibly crash the node. In this case, -x must be used to test the maximum memory required for the job. After the test is performed, use -R "rusage[mem=??]" to reserve enough memory such that LSF should limit the allocation of additional jobs according to the memory available on the assigned node. See the documentation on specifying memory usage for further instructions.
      • Multithreading: Some programs detect the number of cores on a node and automatically spawn the corresponding number of threads. This means that even though -n 4 was used, if the program starts on a node with 16 cores, it will use all 16 cores despite four being requested. In this case, -x must be used, or the documentation must be examined to identify how to limit the number of threads spawned by the program.
      • Scaling tests: When doing scaling tests, -x must be used to ensure the node was not being shared by other users. Timing tests can vary extensively when the application is sharing resources with other jobs. Note also that for meaningful scaling tests, the model, memory, and interconnect should be of the same type.
      • back to top

        How can I specify exclusive use of the node if the queue doesn't allow it?

        Exclusive use of a node can be ensured without using -x by being more precise in specifying the resources. The following reserves 32 cores and specifies a 32 core node, which is a node with two sixteen core processors:
        #BSUB -n 32
        #BSUB -R span[hosts=1]
        #BSUB -R select[stc]
        
        back to top

        How much time should I ask for?

        Ask for the necessary amount of time, plus a reasonable buffer.

        Specifying the maximum wall clock time for the chosen queue will not only result in longer queue waits for the submitter, but for other users as well. LSF must reserve the proper number of nodes for the fully specified time. Jobs that may have run in between other scheduled jobs are forced to wait.

        Use small test runs to estimate a proper wall clock time. See the LSF output in next example; LSF output files show the run time.
        back to top

        How much memory does my code use?

        Do a sample run and examine the LSF output file.

        Resource usage summary:
        
            CPU time :                         209542.33 sec.   (1)
            Max Memory :                       15725.28 MB     (2)
            Average Memory :                   10982.08 MB
            Total Requested Memory :           -
            Delta Memory :                     -
            Max Swap :                         17773 MB
            Max Processes :                    4
            Max Threads :                      38
            Run time :                         52134 sec.       (3)
            Turnaround time :                  52125 sec.
        
        1) CPU time usually should be the wall clock time elapsed times number of cores. (If the number does not seem to reflect this, it is possible that there were performance issues or other problems.)
        2) Max memory should be used to inform how much memory to request in the job script.
        3) Run time should be used to inform how much time to request in the job script.
        back to top
  • Copyright © 2024 · Office of Information Technology · NC State University · Raleigh, NC 27695 · Accessibility · Privacy · University Policies