Evaluating Performance of Modern GPU with Partitioned Last-Level Caches
Jihun Yoon
Joonseong Hwang
Sukhyun Han
Sungbin Jang
Yoonho Jang
Seokin Hong
For increased computing capability with the need for high parallelism, data-center GPUs (A100, H100, etc.) make their architecture more efficient by splitting Network-on-Chip (NoC) into two partitions to improve local bandwidth. However, these partitioned designs introduce non-uniform access latencies, which significantly affect performance but are often neglected in current simulation frameworks. In this paper, we characterize the performance discrepancy caused by the unmodeled part of network contention in traditional non-partitioned GPU simulations, which fail to reflect real hardware behavior. We extend Accel-Sim to accurately model the A100 GPU architecture, incorporating realistic L2 partitioning, cross-partition latency, and latency asymmetry. Through microbenchmark-guided calibration and hardware profiler validation, we demonstrate that our simulator highly correlates with real hardware and unveils critical workload-specific trade-offs. Our model achieves a correlation of 0.991 with real hardware, compared to 0.989 for the prior design. Moreover, the prior non-partitioned design results in 1.17× longer L2 access latency than the partitioned design. This finding highlights the need for precise delay modeling in evaluating and designing future GPU architectures.
Keywords