학술논문

Multimodal Supervised Contrastive Learning in Remote Sensing Downstream Tasks
Document Type
Periodical
Source
IEEE Geoscience and Remote Sensing Letters IEEE Geosci. Remote Sensing Lett. Geoscience and Remote Sensing Letters, IEEE. 21:1-5 2024
Subject
Geoscience
Power, Energy and Industry Applications
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Signal Processing and Analysis
Task analysis
Self-supervised learning
Remote sensing
Training
Sensors
Optical sensors
Image sensors
Deep learning
multimodality
remote sensing
scene classification
supervised contrastive (SupCon) learning
Language
ISSN
1545-598X
1558-0571
Abstract
To leverage the large amount of unlabeled data available in remote sensing datasets, self-supervised learning (SSL) methods have recently emerged as an ubiquitous tool to pretrain robust image encoder models from unlabeled images. However, when used in a downstream setting, these models often need to be fine-tuned for a specific task after their pretraining. This fine-tuning still requires labeling information to train a classifier on top of the encoder while also updating the encoder weights. In this letter, we investigate the specific task of multimodal scene classification where a sample is composed of multiple views from multiple heterogeneous satellite sensors. We propose a method to improve the categorical cross-entropy fine-tuning process which is often used to specify the model for this downstream task. Our approach, based on the supervised contrastive (SupCon) learning, uses the label information available to train an image encoder in a contrastive manner from multiple modalities while also training the task-specific classifier online. Such a multimodal SupCon loss helps better align representations from samples coming from multiple sensors but having the same class labels, thus improving the performance of the fine-tuning process. Our experiments on two public datasets including DFC2020 and Meter-ML with Sentinel-1/Sentinel-2 images show a significant gain over the baseline multimodal cross-entropy loss. All codes and datasets will be made publicly available for reproducibility upon acceptance.