학술논문

Actor-Aware Self-Supervised Learning for Semi-Supervised Video Representation Learning
Document Type
Periodical
Source
IEEE Transactions on Circuits and Systems for Video Technology IEEE Trans. Circuits Syst. Video Technol. Circuits and Systems for Video Technology, IEEE Transactions on. 33(11):6679-6692 Nov, 2023
Subject
Components, Circuits, Devices and Systems
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Representation learning
Training
Task analysis
Sports
Semisupervised learning
Visualization
Smoothing methods
Action recognition
actor-aware pseudo-labeling
contrastive learning
inter-video background mixing
semi-supervised learning
Language
ISSN
1051-8215
1558-2205
Abstract
Self-supervised contrastive learning has shown a significant improvement in performance for action recognition tasks by discovering useful signals from unlabeled videos. Nevertheless, the unique features of existing video benchmark datasets have led the learned video representations to be contextually biased toward dominant backgrounds and scene correlations. Thus, ultimately leading to poor generalizations on scene-invariant action recognition. Therefore, we propose Actor-aware Self-supervised Learning for Semi-supervised Video Representation Learning (ActorSL). We aligned localized actors and their corresponding scene information to encourage the model to learn discriminative regions and mitigate the model’s dependency on the video background during contrastive training. Furthermore, we present an inter-video Background Mixing (iBM) augmentation strategy to introduce scene consistency into the model. We patch inter-video crops of four randomly selected frames for iBM to create a unique frame for each video. The patched frame is blended with the target video frames to generate a spatially augmented sample. Then, the actor-scene aligned features and features of iBM-augmented videos are utilized to optimize contrastive loss and consistency regularization jointly in a semi-supervised way. Moreover, iBM combines the one-hot-encoded labels of patches with the label of the target video as a label smoothing regularizer to soften the decision boundaries of the semi-supervised model. Our experimental results reveal that, ActorSL notably improved current state-of-the-art semi-supervised methods on the Kinetics-400, UCF101, and HMDB51 datasets under a low-label regime. Code released at https://github.com/Endarzboy/ActorSL.