학술논문

Linear Video Transformer Network

Document Type

Conference

Author

Source

한국자동차공학회 춘계학술대회. 2021-06 2021(6):942-945

Subject

Video Recognition
Video Classification
Action Recognition
Linear Transformers
Transformers
Visual Transformers
Artificial Intelligence
Deep Learning

Language

Korean

ISSN

2713-7163

Abstract

This paper presents a video classification model built using linear complexity transformers. This method utilizes attention mechanism both in spatial and temporal domain but using efficient Transformers (namely the Linformer and Longformer) to lower complexity from quadratic to linear in both domains. By leveraging low rankness of self-attention (Linformer) and patterns of local attention within fixed window and some global attention (Longformer), we can lower quadratic complexity of full self-attention (in both temporal and spatial domain) to linear complexity, which saves us memory usage and training/inference time drastically. Transformer based models have shown in a literature comparable results and superior speed to 3D Convolution based models (namely I3D and SlowFast). Experiments show the comparable results on Kinetics-400 dataset. By comparing pros and cons of Longformer and Linformer on temporal /spatial domain, we conclude which model has best accuracy and/or inference speed.

Online Access

Full Text (DBPIA)

이메일

부산대학교 도서관

Online Access

메일 발송