학술논문

Linear Video Transformer Network
Document Type
Conference
Source
한국자동차공학회 춘계학술대회. 2021-06 2021(6):942-945
Subject
Video Recognition
Video Classification
Action Recognition
Linear Transformers
Transformers
Visual Transformers
Artificial Intelligence
Deep Learning
Language
Korean
ISSN
2713-7163
Abstract
This paper presents a video classification model built using linear complexity transformers. This method utilizes attention mechanism both in spatial and temporal domain but using efficient Transformers (namely the Linformer and Longformer) to lower complexity from quadratic to linear in both domains. By leveraging low rankness of self-attention (Linformer) and patterns of local attention within fixed window and some global attention (Longformer), we can lower quadratic complexity of full self-attention (in both temporal and spatial domain) to linear complexity, which saves us memory usage and training/inference time drastically. Transformer based models have shown in a literature comparable results and superior speed to 3D Convolution based models (namely I3D and SlowFast). Experiments show the comparable results on Kinetics-400 dataset. By comparing pros and cons of Longformer and Linformer on temporal /spatial domain, we conclude which model has best accuracy and/or inference speed.

Online Access