학술논문

Empowering lightweight video transformer via the kernel learning
Document Type
article
Source
Electronics Letters, Vol 60, Iss 9, Pp n/a-n/a (2024)
Subject
artificial intelligence
multimedia computing
video signal processing
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
Language
English
ISSN
1350-911X
0013-5194
Abstract
Abstract Video transformers achieve superior performance in video recognition. Despite the recent advances in video transformers, they still require substantial computation and memory resources. To cater for the computation efficiency, a kernel‐based video transformer is proposed, including: (1) a new formulation of the video transformer via the kernel learning is presented to better understand the individual components of it; (2) a lightweight Kernel‐based spatial–temporal multi‐head self‐attention block is explored to learn the compact joint spatial–temporal video feature; (3) an adaptive‐score position embedding method is conducted to promote the flexibility of video transformer. Experimental results on several action recognition datasets demonstrate the effectiveness of the proposed method. Only pretrained on ImageNet‐1K, the method achieves the preferable balance between computation and accuracy, while requiring 7× fewer parameters and 13× fewer floating point operations than other comparable methods.