Using High-Performance Computing Clusters to Support Fine-Grained Parallel Applications

A custom-built serial board connects FPGAs to accelerate performance.

A heterogeneous cluster comprised of host processors and field programmable gate arrays (FPGAs) was used to accelerate the performance of parallel fine-grained applications using a direct FPGA- to-FPGA communications channel. The communications channel is implemented with an all-to-all board that attaches directly to the FPGA boards via their I/O interface. Parallel Discrete Event Simulation (PDES) was used to demonstrate the acceleration performance.

Architecture of the Individual HPCC Node. The I/O card is used to interconnect the FPGAs directly to each other using a custom-built all-to-all serial board. This board provides connectivity from every node to every other node concurrently using a dedicated serial line.
PDES is an approach to parallelizing sim ulation to increase its performance and ca pacity, allowing the simulation of bigger, more detailed models, and more interesting scenarios in a given time. PDES underlies several areas of interest for the De part ment of Defense, including war games, planning and decision-making, and complex system design and analysis including both hardware and software systems.

In previous efforts to accelerate the performance of PDES, it was found that the communication subsystem is a major bottleneck in PDES performance. In addition, initial efforts in exploiting the FPGAs on a Heterogeneous High Per - formance Cluster (HHPC) to accelerate the performance of a PDES simulation were reported. Using FPGA boards to accelerate the performance of some critical simulation subsystems was the goal of the study. Since PDES is a fine-grained operation, and the communication with the FPGA board is expensive, it is almost impossible to use the FPGAs to optimize the simulation kernel.

In response to this limitation, an alternative channel for the FPGAs to communicate without having to interrupt the primary host processor was created. To achieve this, a serial all-to-all connector board that provides direct, low-bandwidth, low-latency connectivity among the FPGA boards was designed. This board provided a channel for the FPGAs to communicate directly, potentially greatly improving the performance of fine-grained applications with components of the computation residing on the FPGAs.

To demonstrate such an application, the Global Virtual Time computation was used as a target for FPGA implementation. Each node provides local time and message counts when it enters GVT computation phase and whenever transit message count changes to the FPGA board. The boards communicate among each other to detect the global messages in transit count. When that reaches 0, they compute the minimum of the local times and broadcast it to all the host processors.

The all-to-all board was tested for functionality and performance to set the baseline physical rate on which it can communicate. Further, support for communication using the all-to-all board had to be developed: the equivalent for the link layer for this communication channel.

The HHPC is a Beowulf cluster made of off-the-shelf PCs (featuring dual Intel Xeon processors) interconnected via a Gigabit Ethernet Network and a Myrinet network. In addition, each node has an (AMD) Wildstar II FPGA board on the PCI bus. The Wildstar has a Xilinx Virtex II FPGA, some DRAM and SRAM banks, and an LVDS I/O card. The I/O card was used to interconnect the FPGAs directly to each other using a custom-built all-to-all serial board. This board provides connectivity from every node to every other node concurrently using a dedicated serial line. This results in a low-latency but low-bandwidth communication channel among the FPGAs.

Without this connectivity, all communication must go through the communication fabric at a latency ranging at about 10 microseconds (for the Myrinet) to several tens of microseconds for Gigabit Ethernet. Typically, FPGA boards are used to accelerate sequential or highgranularity parallel applications that have high data parallelism or unusual data paths. PDES does not fit this profile: it is fine-grained and does not, in general, require high data parallelism.

This work was done by Nael Abu-Gazaleh of the State University of New York – Binghamton for the Air Force Research Laboratory.

AFRL-0118



This Brief includes a Technical Support Package (TSP).
Document cover
Using High-Performance Computing Clusters to Support Fine-Grained Parallel Applications

(reference AFRL-0118) is currently available for download from the TSP library.

Don't have an account? Sign up here.