COMPASSLAB | SparseFT: Sparsity-aware Fault Tolerance for Reliable CNN Inference on GPUs

Publications

SparseFT: Sparsity-aware Fault Tolerance for Reliable CNN Inference on GPUs

2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT) (Poster)

Gwangeun Byeon
Seungtae Lee
Seongwook Kim
Yongjun Kim
Prashant J. Nair
Seokin Hong

Abstract

Graphics Processing Units (GPUs), while offering exceptional performance for CNN inference tasks, are susceptible to both transient and permanent hardware faults due to the integration of numerous processing elements and advancements in technology scaling. This paper proposes a novel and cost-effective fault mitigation technique, called Sparsity-aware Fault Tolerance (SparseFT), to ensure reliable CNN inference on GPUs. SparseFT leverages inherent sparsity in the activation maps to detect and correct errors on the processing elements without hardware redundancy. By exploiting the characteristic of dot-products, where multiplications with zero operands are ineffectual, SparseFT dynamically duplicates an effectual computation (i.e., a multiplication with non-zero operands) to the processing element initially assigned to the ineffectual one. It then compares the duplicated computation results to detect errors. Experimental results demonstrate that SparseFT achieves more than 97% error detection coverage with less than 1% performance overhead for the state-of-the-art CNN models.

Keywords

Computational modeling

Fault tolerant systems

Redundancy

Graphics processing units

Hardware

Parallel architectures

Transient analysis