High-Performance Computing (HPC) Clusters¶

What is High-Performance Computing (HPC)?¶

High-Performance Computing (HPC) is a practice of aggregating computing power in a way that allows for much higher performance than one could obtain from a typical desktop computer or workstation. This is achieved by deploying supercomputers - highly powerful machines designed to tackle complex computational problems.

Why HPC Clusters?¶

HPC clusters are essential in various scientific, engineering, and business fields. They enable researchers to run complex simulations, engineers to design products, and businesses to process vast amounts of data.

Understanding HPC Clusters¶

An HPC cluster is a group of linked computers, known as nodes, working together to provide significant computational power. These nodes work in unison, effectively acting as a single supercomputer. Each node within a cluster is usually a self-contained computer with its own memory and operating system.

Architecture of HPC Clusters¶

The architecture of an HPC cluster involves a complex network of interconnected nodes. Here's a breakdown of the core components:

Compute Nodes: These are the workhorses of an HPC cluster. Each compute node is a server equipped with processors (CPUs or GPUs), memory, and local storage. The number of compute nodes can range from a few to several thousand, depending on the scale of the cluster.
Head Node or Master Node: The head node manages the cluster by controlling the distribution of jobs and the allocation of resources. It typically hosts the job scheduler, which queues and schedules user jobs for execution on the compute nodes.
Login Nodes: These nodes are the user's point of access to the cluster. Users log into these nodes and submit jobs to the job scheduler. Login nodes may also be used for compiling code, managing files, and other preparatory tasks. However, they are not designed for running computationally intensive tasks.
Storage Nodes: These nodes provide shared storage that is accessible to all nodes in the cluster. This can be in the form of a Network File System (NFS), a parallel file system, or other types of distributed storage (BeeGFS, Lustre, GPFS, etc.).
Network: The nodes are interconnected via a high-speed network, typically Ethernet or InfiniBand. The network's topology can vary depending on the cluster's design and performance requirements.
Software Stack: The software stack in an HPC cluster includes the operating system (often a variant of Linux), a job scheduler (like SLURM or PBS), compilers, libraries, and application software.

Each node within a cluster typically has a similar hardware and operating system configuration, ensuring consistent performance and compatibility across the cluster. The nodes access a shared file system, providing a unified data access interface for the entire cluster.

How HPC Clusters are Used¶

HPC clusters are used to handle tasks that require significant computational power. These tasks could be part of a single large job, such as simulating weather patterns, or they could be many small jobs that are unrelated but can be processed in parallel, like rendering frames for a film.

Role of HPC Clusters in Science and Engineering¶

HPC clusters play a pivotal role in scientific and engineering applications. For example:

Physics: Simulating quantum mechanics phenomena or cosmic events.
Chemistry: Modeling molecular interactions for new drug discovery.
Biology: Sequencing genomes or modeling protein folding.
Engineering: Simulating aerodynamics for new aircraft designs or analyzing structural integrity of buildings.

Overview of the HPC systems at DIPC¶

HPC System	Description
Atlas EDR	70-node system with 3,424 cores and 19 TB memory. Some nodes have NVIDIA Tesla P40 or NVIDIA RTX 3090 GPGPUs. Nodes are interconnected at 100 Gbps with a 5:1 blocking factor.
Atlas FDR	Cluster-type system with 209 nodes and 6,116 cores. Nodes are interconnected at 56 Gbps with a 5:1 blocking factor. Managed by SLURM.