Open MPI: Open Source High Performance Computing

Software in the Public Interest (SPI)

The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners.

Other

High Performance Computing

Software

https://www.open-mpi.org/

The big data revolution has ushered an era with ever increasing volumes and complexity of data requiring ever faster computational analysis. During this very same era, CPU performance growth has been stagnating, pushing the industry to either scale their computation horizontally using multiple nodes in datacenters, or to scale vertically using heterogeneous components to reduce compute time. However, networking and storage continue to provide both higher throughput and lower latency, which allows for leveraging heterogeneous components, deployed in data centers around the world. Still, the integration of big data analytics frameworks with heterogeneous hardware components such as GPGPUs and FPGAs is challenging, because there is an increasing gap in the level of abstraction between analytics solutions developed with big data analytics frameworks, and accelerated kernels developed with heterogeneous components. In this article, we focus on FPGA accelerators that have seen wide-scale deployment in large cloud infrastructures. FPGAs allow the implementation of highly optimized hardware architectures, tailored exactly to an application, and unburdened by the overhead associated with traditional general-purpose computer architectures. FPGAs implementing dataflow-oriented architectures with high levels of (pipeline) parallelism can provide high application throughput, often providing high energy efficiency. Latency-sensitive applications can leverage FPGA accelerators by directly connecting to the physical layer of a network, and perform data transformations without going through the software stacks of the host system. While these advantages of FPGA accelerators hold promise, difficulties associated with programming and integration limit their use. This article explores the existing practices in big data analytics frameworks, discusses the aforementioned gap in development abstractions, and provides some perspectives on how to address these challenges in the future.

Engineering, Other

High Performance Data Analysis

Paper

https://ieeexplore.ieee.org/document/9439431

This paper presents an overview and performance analysis of a software-programmable domain-customizable System-on-Chip (SoC) overlay for low-latency inferencing of variable and low-precision Machine Learning (ML) networks targeting Internet-of-Things (IoT) edge devices. The SoC includes a 2-D processor array that can be customized at design time for FPGA logic families. The overlay resolves historic issues of poor designer productivity associated with traditional Field Programmable Gate Array (FPGA) design flows without the performance losses normally incurred by overlays. A standard Instruction Set Architecture (ISA) allows different ML networks to be quickly compiled and run on the overlay without the need to resynthesize. Performance results are presented that show the overlay achieves 1.3 × − 8.0 × speedup over custom designs while still allowing rapid changes to ML algorithms on the FPGA through standard compilation.

Engineering, Other

Machine Learning / AI

Paper

https://www.computer.org/csdl/proceedings-article/fpl/2021/375900a024/1xDQ3iK1FRK

Spiking Neural Networks (SNNs) are the next generation of Artificial Neural Networks (ANNs) that utilize an event-based representation to perform more efficient computation. Most SNN implementations have a systolic array-based architecture and, by assuming high sparsity in spikes, significantly reduce computing in their designs. This work shows this assumption does not hold for applications with signals of large temporal dimension. We develop a streaming SNN (S2N2) architecture that can support fixed-per-layer axonal and synaptic delays for its network. Our architecture is built upon FINN and thus efficiently utilizes FPGA resources. We show how radio frequency processing matches our S2N2 computational model. By not performing tick-batching, a stream of RF samples can efficiently be processed by S2N2, improving the memory utilization by more than three orders of magnitude.

Engineering, Other

Machine Learning / AI

Paper

https://dl.acm.org/doi/10.1145/3431920.3439283

Binary neural networks (BNNs) have 1-bit weights and activations. Such networks are well suited for FPGAs, as their dominant computations are bitwise arithmetic and the memory requirement is also significantly reduced. However, compared to start-of-the-art compact convolutional neural network (CNN) models, BNNs tend to produce a much lower accuracy on realistic datasets such as ImageNet. In addition, the input layer of BNNs has gradually become a major compute bottleneck, because it is conventionally excluded from binarization to avoid a large accuracy loss. This work proposes FracBNN, which exploits fractional activations to substantially improve the accuracy of BNNs. Specifically, our approach employs a dual-precision activation scheme to compute features with up to two bits, using an additional sparse binary convolution. We further binarize the input layer using a novel thermometer encoding. Overall, FracBNN preserves the key benefits of conventional BNNs, where all convolutional layers are computed in pure binary MAC operations (BMACs). We design an efficient FPGA-based accelerator for our novel BNN model that supports the fractional activations. To evaluate the performance of FracBNN under a resource-constrained scenario, we implement the entire optimized network architecture on an embedded FPGA (Xilinx Ultra96 v2). Our experiments on ImageNet show that FracBNN achieves an accuracy comparable to MobileNetV2, surpassing the best-known BNN design on FPGAs with an increase of 28.9% in top-1 accuracy and a 2.5x reduction in model size. FracBNN also outperforms a recently introduced BNN model with an increase of 2.4% in top-1 accuracy while using the same model size. On the embedded FPGA device, FracBNN demonstrates the ability of real-time image classification.

Engineering, Other

Machine Learning / AI

Paper

https://dl.acm.org/doi/10.1145/3431920.3439296

The remarkable success of machine learning (ML) in a variety of research domains has inspired academic and industrial communities to explore its potential to address hardware Trojan (HT) attacks. While numerous works have been published over the past decade, few survey papers, to the best of our knowledge, have systematically reviewed the achievements and analyzed the remaining challenges in this area. To fill this gap, this article surveys ML-based approaches against HT attacks available in the literature. In particular, we first provide a classification of all possible HT attacks and then review recent developments from four perspectives, i.e., HT detection, design-for-security (DFS), bus security, and secure architecture. Based on the review, we further discuss the lessons learned in and challenges arising from previous studies. Despite current work focusing more on chip-layer HT problems, it is notable that novel HT threats are constantly emerging and have evolved beyond chips and to the component, device, and even behavior layers, therein compromising the security and trustworthiness of the overall hardware ecosystem. Therefore, we divide the HT threats into four layers and propose a hardware Trojan defense (HTD) reference model from the perspective of the overall hardware ecosystem, therein categorizing the security threats and requirements ineach layer to provide a guideline for future research in this direction.

Engineering, Other

Machine Learning / AI

Paper

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8952724

FPGA Architecture: Survey and Challenges

University of Toronto, Canada

Field-Programmable Gate Arrays (FPGAs) have become one of the key digital circuit implementation media over the last decade. A crucial part of their creation lies in their architecture, which governs the nature of their programmable logic functionality and their programmable interconnect. FPGA architecture has a dramatic effect on the quality of the final device's speed performance, area efficiency and power consumption.

Engineering, Other

High Performance Computing

Paper

https://ieeexplore.ieee.org/document/8187326