Scrooge: Accelerating Attention Inference in LLMs via Early Termination Mechanism
Gwangeun Byeon
Seongwook Kim
Taein Kim
Jungmin Lee
Seokin Hong
Large Language Models (LLMs) have demonstrated remarkable performance in natural language processing and are now widely adopted in diverse applications. However, their significant computation and memory cost severely limit their acceleration. In particular, the self-attention mechanism is a significant bottleneck, as it cannot exploit batch parallelism across prompts, and its memory traffic grows quadratically with sequence length. In this paper, we propose Scrooge, a novel hardware accelerator framework that leverages an attention early termination mechanism, designed to address the inefficiency of self-attention. The self-attention mechanism does not assign equal importance to all tokens. Instead, semantically important tokens consistently receive higher attention scores. Consequently, preserving sufficient attention for a subset of important tokens is often enough to maintain model accuracy, even without computing attention for all tokens. Our key insight is that once sufficient attention has been accumulated, further computation with the remaining tokens only increases complexity without improving accuracy. Scrooge leverages this insight to approximate the attention of the remaining tokens and terminates the attention computation dynamically once it has gathered sufficient attention. With this method, Scrooge reduces both latency and memory traffic while maintaining accuracy. Experimental results show that Scrooge achieves a 1.7× speedup and a 0.47× reduction in memory traffic with negligible accuracy loss
Keywords