학술논문

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Document Type

Conference

Author

Aminabadi, Reza Yazdani; Rajbhandari, Samyam; Awan, Ammar Ahmad; Li, Cheng; Li, Du; Zheng, Elton; Ruwase, Olatunji; Smith, Shaden; Zhang, Minjia; Rasley, Jeff; He, Yuxiong

Source

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis SC High Performance Computing, Networking, Storage and Analysis, SC22: International Conference for. :1-15 Nov, 2022

Subject

Communication, Networking and Broadcast Technologies
Technological innovation
Computational modeling
Aggregates
High performance computing
Graphics processing units
Production
Transformers
Deep Learning
Distributed Inference
Mixture of Experts
PyTorch
DeepSpeed
Transformer models

Language

ISSN

2167-4337

Abstract

The landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc. With such diversity, designing a versatile inference system is challenging. DeepSpeed-Inference addresses these challenges by (1) a multi-GPU inference solution to minimize latency while maximizing throughput for both dense and sparse transformers when the model fits in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU/NVMe/GPU memory to enable high-throughput inference for models larger than aggregate GPU memory. DeepSpeed-Inference reduces latency by 6.4× and increases throughput by 1.5 ×over the state-of-the-art. It enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25 ×larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over 50% of A6000 peak).

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송