COMPASSLAB | In-Cache Processing with Power-of-Two Quantization for Fast CNN Inference on CPUs

Publications

In-Cache Processing with Power-of-Two Quantization for Fast CNN Inference on CPUs

2023 International Technical Conference on Circuits/Systems, Computers, and Communications (ITC-CSCC)

Joseph Woo
Seungtae Lee
Seongwook Kim
Gwangeun Byeon
Seokin Hong

Abstract

Convolutional Neural Networks (CNN) demand high computational capabilities, motivating researchers to leverage Processing-In-Memory (PIM) technology to achieve significant performance improvements. However, implementing complex arithmetic operations such as multiplication within memory is a significant challenge in the PIM architecture. To address this challenge, this paper proposes a PIM-enabled cache (PEC) architecture that utilizes shifters for performing multiplication operations at a low cost. We also introduce a filter-wise hardware-friendly Power-of-Two (POT) quantization scheme that quantizes weights into power-of-two values for specific filters to accelerate convolution operations with the PEC. Our experimental results demonstrate that the proposed PEC, together with the POT quantization, achieves 2.28x performance improvement on average with an accuracy degradation of 0.784%.

Keywords

Degradation

Computers

Quantization (signal)

Costs

Multicore processing

Convolution

Computer architecture

Convolutional Neural Network

Power-of-Two Quantization

Hardware-Friendly Quantization

Processing in Memory