High Performance Computing | Examining available compute resources using LSF

Find out properties of queues, compute nodes, GPUs.

Why is my job still pending?

How does LSF determine job priority?

How many jobs are running in a particular queue?

Which resources have a GPU, and are there any GPUs available right now?

How many nodes are there that have ... [M2070 GPUs, nodes with AVX2 instructions, dual quad core nodes, etc.]

Using bjobs, I find that EXEC_HOST is bc2e4. What does that mean?

For scaling tests, I need to use the same piece of hardware. How do I specify this?

Why do I get drastically different run times for the same run script?

What kind of hardware does my advisor's partner queue have?

Which queues do I have access to?

What are the wall clock limits for the debug queue? The single_chassis queue?

How can I find out the maximum RAM I can ask for in each queue?

Why is my job still pending?

Find more info about jobs by using bjobs -l. Also note that bjobs -lp -p3 gives even more detail. This will include a list of reasons the job is pending, and it may include an estimate of when the job will start, e.g., Job will start no sooner than indicated time stamp. It is possible that the chosen resources are currently being used, but it is also possible that the specified resources do not exist on the system. For example, LSF will not give an error message upon requesting a 64 core node with 500 GB of memory; it will simply wait until such a node is installed, leaving the job in a forever pending state.

How does LSF determine job priority?

Job priority is determined by several factors including fair share priority, queue priority, and time of submission.

See further details on job priority.

How many jobs are running in a particular queue?

Search for jobs being run by all users and filter for those in that particular queue. For example, to check how many jobs are running in the gpu queue, use
bjobs -u all | grep gpu

Which resources have a GPU, and are there any GPUs available right now?

You can find which hosts have a GPU by using
lshosts | grep gpu

bqueues will show the number of total jobs in the queue (NJOBS), how many are actually running (RUN), and how many are pending (PEND). MAX is the maximum number of cores available. For some queues, like gpu, the MAX is not shown.
bqueues -l gpu

How many nodes are there that have ... [M2070 GPUs, nodes with AVX2 instructions, dual quad core nodes, etc.]

The resources for each model is given by lshosts. There is currently one P100 node:

[unityID@login04 ~]$ lshosts | grep p100
n3h39       LINUXRH E52650v4   1.0    24 262050M 32767M    Yes (gpu twc sse sse2 ssse3 sse4_1 sse4_2 avx avx2 p100)

Here are the some commands to find other resources:

[unityID@login04 ~]$ lshosts | grep m2070
[unityID@login04 ~]$ lshosts | grep avx2
[unityID@login04 ~]$ lshosts | grep qc

See LSF Resources for more information on specific resources.

Using bjobs, I find that EXEC_HOST is bc2e4. What does that mean?

EXEC_HOST is the host group the job is running on. To find more about the individual hosts available in that group, use bmgroup

[unityID@login04 ~]$ bmgroup bc2e4
GROUP_NAME    HOSTS                     
bc2e4        n2e4-1 n2e4-2 n2e4-3 n2e4-4 n2e4-5 n2e4-6 n2e4-7 n2e4-8 n2e4-9 n2e4-10 n2e4-11 n2e4-12 n2e4-13 n2e4-14

To find out more about the specific hosts, e.g., n2e4-5, use

[unityID@login04 ~]$ lshosts | grep n2e4-3
n2e4-3      LINUXRH    E5405   1.0     8 16383M 32767M    Yes (qc sse sse2 ssse3 sse4_1)

This shows that node n2e4-3 has processor model E5405, 8 cores (2 quad core processors), 16 G memory, does not support AVX instructions and does not have InfiniBand(ib).
back to top

For scaling tests, I need to use the same piece of hardware. How do I specify this?

If a node from the same group is needed, e.g., same blade or same rack on single_chassis, use the -m option.
#BSUB -m "bmgroup"
Example:

#BSUB -m "blade2a1"

If the exact same piece of hardware is needed, meaning the same actual node(host), use the -m option.
#BSUB -m "hostname"
Example:

#BSUB -m "n2e4-3"

Note that this may give very long queue wait times. Additionally, it should be verified that the node is contained in the resource pool or queue that it is being submitted to.

See LSF Resources for more information on specific resources.

Why do I get drastically different run times for the same run script?

If the resource type is not specified, the queuing system will assign a job wherever it might fit. This will result in the job being executed on different types of hardware - new or old, more or less cores, etc. For consistent run times, specify a particular resource. Run times may also show marked differences when shared with other jobs. For scaling tests, use -x to avoid contention with other jobs.

See LSF Resources for more information on specific resources.

What kind of hardware does my advisor's partner queue have?

Suppose a partner queue is monkey. Do bqueues -l monkey:

[unityID@login04 ~]$ bqueues -l monkey
QUEUE: monkey 
-- partner queue

PARAMETERS/STATISTICS
PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV PJOBS 
100   10  Open:Active     264    -    -    -     0     0     0     0     0    0     0

HOSTS:  monkey_ib+10 interconnect_ib+8 blade2h2+4

This shows that there are 264 cores available on the partner queue monkey. The queue has access to the monkey_ib group and also the interconnect_ib group. To find the hardware,

[unityID@login04 ~]$ bmgroup monkey_ib 
GROUP_NAME    HOSTS                     
monkey_ib  n2g3-2 n2g3-3 n2g3-4 n2g3-5 n2g3-6 n2g3-7 n2g3-8 n2g3-9 n2g3-10 n2g3-11 n2g3-1

To get more specific hardware info,

[unityID@login04 ~]$ lshosts n2g3-2
HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
n2g3-2      LINUXRH E52650v4   1.0    24 130237M 32767M    Yes (twc sse sse2 ssse3 sse4_1 sse4_2 avx avx2 ib)

This shows that the monkey_ib group consists of eleven 24 core nodes, supporting up to AVX2 instruction, and has InfiniBand.

Which queues do I have access to?

If the queue is not specified in the job script, LSF will attempt to choose the most appropriate queue. To find the queues that a user has access to, use bqueues -u followed by the login name (Unity ID).
bqueues -u unityID

What are the wall clock limits for the debug queue? The single_chassis queue?

Use bqueues -l:

[unityID@login04 ~]$ bqueues -l debug
MAXIMUM LIMITS:
RUNLIMIT                
10.0 min of servlsf

[unityID@login04 ~]$ bqueues -l single_chassis
MAXIMUM LIMITS:
RUNLIMIT                
5760.0 min of servlsf

At the date of this publication, the limit for the debug queue was 10 minutes, and the limit for the single_chassis queue was 4 days. Queue limits are subject to change without notice.
back to top

How can I find out the maximum RAM I can ask for in each queue?

LSF will report an error if I ask for more processors or wall time than allocated for a queue, but if I specify too much memory, the jobs is submitted but never runs.

Most default queues contain nodes of all memory sizes. Some non-default queues may be more limited, and some partner queues may contain larger memory nodes. See this documentation on memory resources for the most current information.

Instructions on finding specs is also listed in the FAQ about finding the types of hardware available in a queue.
back to top