Lupin: Spatial Resource Stealing with Outlier-First Encoding for Mixed-Precision LLM Acceleration
Taein Kim
Sukhyun Han
Seongwook Kim
Gwangeun Byeon
Jungmin Lee
Seokin Hong
The rapid growth of Large Language Models (LLMs) exceeds on-chip memory capacity during inference, causing frequent external memory access and bandwidth limitations that hinder data transfer efficiency and overall performance. Quantization reduces the bit-width of data, which lowers memory traffic, but outliers still limit their accuracy. Existing mixed-precision accelerators attempt to mitigate this problem through specialized encoding schemes, but they introduce hardware complexity, irregular memory layouts, or pipeline stalls, thereby limiting overall efficiency. We introduce \textit{Lupin}, an algorithm-architecture co-design featuring outlier-first encoding, which assigns high precision to outliers by occupying the storage and computing resources of less critical normal values. This scheme preserves matrix regularity, ensures compatibility with memory systems, and enables stall-free execution using paired low-precision MAC units. Experimental results show that Lupin maintains model accuracy while achieving a 2.02× speedup and a 24% lower power consumption. These results highlight Lupin as an efficient solution for accelerating mixed-precision LLMs.
Keywords