학술논문

Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers

Document Type

Conference

Author

Yoo, Jaehoon; Kim, Semin; Lee, Doyup; Kim, Chiheon; Hong, Seunghoon

Source

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVPR Computer Vision and Pattern Recognition (CVPR), 2023 IEEE/CVF Conference on. :22888-22897 Jun, 2023

Subject

Computing and Processing
Training
Transformers
Robustness
Encoding
Complexity theory
Pattern recognition
Decoding
Image and video synthesis and generation

Language

ISSN

2575-7075

Abstract

Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the longterm dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of longterm dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive transformers for generating moderately long videos in both quality and speed. Videos and code are available at https://sites.google.com/view/mebt-cvpr2023.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송