Publications

Lupin: Spatial Resource Stealing with Outlier-First Encoding for Mixed-Precision LLM Acceleration

2026 Design, Automation and Test in Europe Conference (DATE 2026)

Taein Kim
Sukhyun Han
Seongwook Kim
Gwangeun Byeon
Jungmin Lee
Seokin Hong

Abstract

The rapid growth of Large Language Models (LLMs) exceeds on-chip memory capacity during inference, causing frequent external memory access and bandwidth limitations that hinder data transfer efficiency and overall performance. Quantization reduces the bit-width of data, which lowers memory traffic, but outliers still limit their accuracy. Existing mixed-precision accelerators attempt to mitigate this problem through specialized encoding schemes, but they introduce hardware complexity, irregular memory layouts, or pipeline stalls, thereby limiting overall efficiency. We introduce \textit{Lupin}, an algorithm-architecture co-design featuring outlier-first encoding, which assigns high precision to outliers by occupying the storage and computing resources of less critical normal values. This scheme preserves matrix regularity, ensures compatibility with memory systems, and enables stall-free execution using paired low-precision MAC units. Experimental results show that Lupin maintains model accuracy while achieving a 2.02× speedup and a 24% lower power consumption. These results highlight Lupin as an efficient solution for accelerating mixed-precision LLMs.

Keywords

LLM inference

Processing-In-Memory

Load balancing

Framework