The transformer architecture has enabled large language models (LLMs) to improve a wide range of AI applications. A primary component, the multi-head self-attention mechanism, presents a major bottleneck due to its extensive computational and memory bandwidth requirements. While recent approaches using into sparse attention and attention formula reordering address these challenges, efficient LLM processing remains a key bottleneck for existing hardware. This thesis proposes an energy-efficient compute-in/near-memory (CINM) processor using eDRAM, to mitigate these bottlenecks through three key features. First, an attention block fusion computation strategy is employed to maximize data reuse within the attention map. This approach yields an 85.86% reduction in external memory access and achieves hardware utilization to 86.1%. Second, a CINM architecture resolves the imbalance between memory and computation, which, combined with a heterogeneous pipeline, achieves a 77.27% reduction in system latency. Third, a compute-in-memory array supporting the cross- read operation eliminates data direction conflicts, resulting in a 98.44% latency reduction. Furthermore, this array utilizes dual-row computation with reduced adder logic to improve energy efficiency by 1.58×. The processor, designed in 28nm CMOS technology, achieves 36.28–58.05 TOPS/W and demonstrates an F1 score of 92.41% on the SQuAD 1.1v benchmark using the BigBird-large model.
Publisher
Ulsan National Institute of Science and Technology