학술논문

SpeedLimit: Neural Architecture Search for Quantized Transformer Models

Document Type

Working Paper

Author

Chai, Yuji; Bailey, Luke; Jin, Yunho; Karle, Matthew; Ko, Glenn G.; Brooks, David; Wei, Gu-Yeon; Kung, H. T.

Source

Subject

Computer Science - Machine Learning

Language

Abstract

While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송