High Performance Computing | Reserving proper memory resources with LSF

Cluster Specs

Nodes on the cluster have different memory sizes. See this segment from the Parallel Jobs video on finding the specs on the cluster.

The memory of the majority of different nodes on the cluster, listed in GB, are

*Nodes with this higher amount of memory are limited outside of partner queues.
**There are very few nodes on the cluster with these higher amounts of memory, they may not be available in every queue, and their request may result in a longer wait in the queue.
Contact staff with questions about the availability of high memory nodes.

When requesting memory based on increments of RAM on available nodes on the cluster, request memory somewhat lower than the full increment since OS processes also take memory. For example, if a 128GB node (or higher) is desired, then request 120GB memory.

Examine the memory requirements based on a sample job

Open the standard out file, e.g. stdout.[JOBID], and scroll to the bottom, which should contain, e.g.:

Resource usage summary:

    CPU time :                         209542.33 sec.     
    Max Memory :                       15725.28 MB   (*)
    Average Memory :                   10982.08 MB

(*) Max memory is how you will determine how much memory to request in future job scripts.

Check the maximum memory used by many previous jobs

If the jobs' LSF output files have the naming convention stdout.%JOBID, then do

grep "Max Memory" stdout*

Checking the memory of a running job

To find information about all your running jobs, do:

bjobs -r -X -o "jobid queue cpu_used run_time avg_mem max_mem slots delimiter=','"

This will return a CSV formatted list of your jobs showing the job ID, queue, total CPU time, elapsed wall clock time, average memory utilized, maximum memory utilized, and the number of cores reserved.

Check the max memory for your jobs, and make sure it does not exceed the amount requested or an amount that is a significant portion of the node assigned. You can check how much memory your assigned node has by using lshosts. For example, if your job is running on node n3m3-1, you can find the memory by doing:

lshosts | grep n3m3-1

To find the nodes assigned to your job, do:

bjobs -l [JOBID]

Also check if the maximum memory of the job steadily increases with time. If it does, look at the code or documentation and decide whether that is expected behavior. If not, it might be a problem with the application, such as a memory leak.

Estimating the expected memory requirements of a job

You may be able to estimate how much memory will be required for your job based on the following:

The memory requirements may be equal to the file sizes of the input data.
The memory may be proportional to the size of input data. E.g., if the code reads input data as a matrix, and then has to make a copy of that matrix, and multiply them to store a third matrix, then the memory requirement would be 3x the file size.

Increasing the number of grid points usually will require a proportionate increase in the amount of memory.
Estimation based on calculated variables:
Number of grid points * number of variables * size of data type(s)

For the given application, look at the memory requirements for given examples.
Look at user forums for the application.
For bioinformatics software, post queries in the NC State Bioinformatics Users Group. Post your own results so that other users may benefit.

Recommendations for LSF scripts

Once the memory requirements have been estimated, memory requirements can be specified in an LSF script by using rusage. Please check the link to using rusage for more information.

Usage requests are per host, and the default unit is GB; to request that your job be assigned to a node or a set of nodes each having at least 64GB of RAM, do

#BSUB -R "rusage[mem=64]"

#BSUB -R "rusage[mem=64GB]"

When to use -x

As explained in the video on the Acceptable Use Policy, your job must not interfere with other users, and it should make efficient use of resources.

If your job will take most of the memory resources of a node, use -x. It is also appropriate to use -x when doing the tests to measure the amount of memory needed.

For production runs (e.g., submitting many simultaneous jobs), it is inappropriate to request -x when your job does not need to do so.

On the other hand, for production runs, you may want to ensure that other users are not placed on your node. Using -x may be appropriate in this case, but you should contact staff to assist in creating LSF batch scripts that ensure you are using -x on the subset of resources appropriate to your job. Staff may also be able to assist in bundling jobs such that only your jobs occupy the nodes you are assigned to.