학술논문

Neighbor-Guided Pseudo-Label Generation and Refinement for Single-Frame Supervised Temporal Action Localization
Document Type
Periodical
Source
IEEE Transactions on Image Processing IEEE Trans. on Image Process. Image Processing, IEEE Transactions on. 33:2419-2430 2024
Subject
Signal Processing and Analysis
Communication, Networking and Broadcast Technologies
Computing and Processing
Semantics
Videos
Location awareness
Predictive models
Annotations
Feature extraction
Transformers
Neighbor information
pseudo label generation
pseudo label refinement
single-frame temporal action localization
Language
ISSN
1057-7149
1941-0042
Abstract
Due to the sparse single-frame annotations, current Single-Frame Temporal Action Localization (SF-TAL) methods generally employ threshold-based pseudo-label generation strategies. However, these approaches suffer from inefficient data utilization, as only parts of unlabeled frames with confidence scores surpassing a predefined threshold are selected for training. Moreover, the variability of single-frame annotations and unreliable model predictions introduce pseudo-label noise. To address these challenges, we propose two strategies by using the relationship of the video segments with their neighbors’: 1) temporal neighbor-guided soft pseudo-label generation (TNPG); and 2) semantic neighbor-guided pseudo-label refinement (SNPR). TNPG utilizes a local-global self-attention mechanism in a transformer encoder to capture temporal neighbor information while focusing on the whole video. Then the generated self-attention map is multiplied by the network predictions to propagate information between labeled and unlabeled frames, and produce soft pseudo-label for all segments. Despite this, label noise persists due to unreliable model predictions. To mitigate this, SNPR refines pseudo-labels based on the assumption that predictions should resemble their semantic nearest neighbors’. Specifically, we search for semantic nearest neighbors of each video segment by cosine similarity in the feature space. Then the refined soft pseudo-labels can be obtained by a weight combination of the original pseudo-label and the semantic nearest neighbors’. Finally, the model can be trained with the refined pseudo-labels, and the performance has been greatly improved. Comprehensive experimental results on different benchmarks show that we achieve state-of-the-art performances on THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets.