Publications

SparseFT: Sparsity-aware Fault Tolerance for Reliable CNN Inference on GPUs

2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT) (Poster)

  • Gwangeun Byeon

  • Seungtae Lee

  • Seongwook Kim

  • Yongjun Kim

  • Prashant J. Nair

  • Seokin Hong

Abstract

Graphics Processing Units (GPUs), while offering exceptional performance for CNN inference tasks, are susceptible to both transient and permanent hardware faults due to the integration of numerous processing elements and advancements in technology scaling. This paper proposes a novel and cost-effective fault mitigation technique, called Sparsity-aware Fault Tolerance (SparseFT), to ensure reliable CNN inference on GPUs. SparseFT leverages inherent sparsity in the activation maps to detect and correct errors on the processing elements without hardware redundancy. By exploiting the characteristic of dot-products, where multiplications with zero operands are ineffectual, SparseFT dynamically duplicates an effectual computation (i.e., a multiplication with non-zero operands) to the processing element initially assigned to the ineffectual one. It then compares the duplicated computation results to detect errors. Experimental results demonstrate that SparseFT achieves more than 97% error detection coverage with less than 1% performance overhead for the state-of-the-art CNN models.

Keywords

  • Computational modeling
  • Fault tolerant systems
  • Redundancy
  • Graphics processing units
  • Hardware
  • Parallel architectures
  • Transient analysis