KOR

e-Article

Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery
Document Type
Periodical
Source
IEEE Transactions on Geoscience and Remote Sensing IEEE Trans. Geosci. Remote Sensing Geoscience and Remote Sensing, IEEE Transactions on. 60:1-11 2022
Subject
Geoscience
Signal Processing and Analysis
Transformers
Task analysis
Remote sensing
Visualization
Feature extraction
Head
Computer vision
Co-attention
remote sensing
self-attention
transformers
vision-language models
visual question answering (VQA)
Language
ISSN
0196-2892
1558-0644
Abstract
Recently, vision-language models based on transformers are gaining popularity for joint modeling of visual and textual modalities. In particular, they show impressive results when transferred to several downstream tasks such as zero and few-shot classification. In this article, we propose a visual question answering (VQA) approach for remote sensing images based on these models. The VQA task attempts to provide answers to image-related questions. In contrast, VQA has gained popularity in computer vision, in remote sensing, it is not widespread. First, we use the contrastive language image pretraining (CLIP) network for embedding the image patches and question words into a sequence of visual and textual representations. Then, we learn attention mechanisms to capture the intradependencies and interdependencies within and between these representations. Afterward, we generate the final answer by averaging the predictions of two classifiers mounted on the top of the resulting contextual representations. In the experiments, we study the performance of the proposed approach on two datasets acquired with Sentinel-2 and aerial sensors. In particular, we demonstrate that our approach can achieve better results with reduced training size compared with the recent state-of-the-art.