Linux Cluster organization used for Hazel:
This is a greatly simplified diagram of the Linux cluster organization used for Hazel with two independent networks in the cluster, one for connecting nodes to storage and the second for message passing between job tasks.
The login node is accessible to the broader network, Internet or campus network and provides method to submit jobs to the compute nodes. No jobs should be run on the login node. This is a shared resource intended to be used to submit and monitor jobs.
Compute nodes are where jobs are executed (computations are performed). Compute nodes should not be accessed directly. Jobs are submitted to the scheduler from the login node. Only the scheduler should access the compute nodes and run jobs there.
Note that compute nodes have two network connections: HPC private network and HPC message passing network. Compute nodes do not have a connection to any external network, that is they are not connected to the campus network or the Internet. Access to any location outside the cluster has to be specifically configured via a proxy server to route traffic from compute nodes to those locations.
The HPC private network connects all the cluster nodes together. It is primarily used for job control and access to storage. In Hazel this is an Ethernet network with 100Gbps Ethernet core and 25Gbps connections to nodes. As of May 2025 there are still some older nodes (FlexChassis hardware) that have 10Gbps private network connections.
The HPC message passing network is dedicated to communication between job tasks, such as distributed memory parallel jobs using message passing interface (MPI) communication. In Hazel this is an InfiniBand network with multiple 200Gbps (high data rate - HDR) links forming the core and 100Gbps (enhanced data rate - EDR) links to nodes.
The parallel file system (GPFS) is connected to the HPC private network via multiple 100Gbps links. There are two Lenovo DSS storage arrays each with two NSD servers and two JBODs that together create the file system where scratch directories are located. This is the file system that should be used to hold data for running jobs. However this storage is not backed up and files that have not been accessed for 30 days are automatically deleted.
The storage for home and application directories is not shown on this diagram. Currently these directories are located on the same physical storage (a NetApp FAS 8700) as Research Storage. This storage is also connected by multiple 100Gbps Ethernet links to Hazel. However, this network attached storage is mounted using network file system (NFS) protocol which has performance limitations and is not compatible with parallel I/O operations. These directories are backed up and also have periodic snapshots that enable self-service recovery of accidentally deleted files.