학술논문

Cross-Modal Learning for Event-Based Semantic Segmentation via Attention Soft Alignment
Document Type
Periodical
Author
Source
IEEE Robotics and Automation Letters IEEE Robot. Autom. Lett. Robotics and Automation Letters, IEEE. 9(3):2359-2366 Mar, 2024
Subject
Robotics and Control Systems
Computing and Processing
Components, Circuits, Devices and Systems
Semantic segmentation
Cameras
Image reconstruction
Task analysis
Streams
Training
Semantics
Deep learning for visual perception
transfer learning
semantic scene understanding
Language
ISSN
2377-3766
2377-3774
Abstract
By demonstrating robustness in scenarios characterized by high-speed motion and extreme lighting changes, event cameras hold great potential for enhancing the perception reliability of autonomous driving systems. Because of its novelty and data sparsity, the progress of event-based algorithms is hindered by the scarcity of high-quality labeled datasets. In this letter, we propose CMESS (Cross-Modal learning for Event-based Semantic Segmentation), which eliminates the need for event labels by transferring the model from labeled image datasets (source domain) to unlabeled event datasets (target domain) via unsupervised domain adaptation (UDA). Compared to existing UDA methods that require hard alignment of visually consistent embeddings, our approach achieves soft alignment via cross-attention and then augments it with knowledge distillation to convey fine-grained source knowledge to the target domain. Additionally, we introduce an event-driven bidirectional self-labeling method to generate weakly supervised signals for event-only datasets. These designs facilitate cross-modal learning without requiring per-pixel paired frames or online reconstruction. Experimental results show that our method outperforms existing state-of-the-art methods in both UDA and supervised settings on common evaluation benchmarks, making it a universal framework for further unlabeled event-related visual tasks.