학술논문

Masked Vision Transformers for Hyperspectral Image Classification
Document Type
Conference
Source
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) CVPRW Computer Vision and Pattern Recognition Workshops (CVPRW), 2023 IEEE/CVF Conference on. :2166-2176 Jun, 2023
Subject
Computing and Processing
Engineering Profession
Solid modeling
Three-dimensional displays
Training data
Computer architecture
Transformers
Data models
Convolutional neural networks
Language
ISSN
2160-7516
Abstract
Transformer architectures have become state-of-the-art models in computer vision and natural language processing. To a significant degree, their success can be attributed to self-supervised pre-training on large scale unlabeled datasets. This work investigates the use of self-supervised masked image reconstruction to advance transformer models for hyperspectral remote sensing imagery. To facilitate self-supervised pre-training, we build a large dataset of unlabeled hyperspectral observations from the EnMAP satellite and systematically investigate modifications of the vision transformer architecture to optimally leverage the characteristics of hyperspectral data. We find significant improvements in accuracy on different land cover classification tasks over both standard vision and sequence transformers using (i) blockwise patch embeddings, (ii) spatialspectral self-attention, (iii) spectral positional embeddings and (iv) masked self-supervised pre-training 1 . The resulting model outperforms standard transformer architectures by +5% accuracy on a labeled subset of our EnMAP data and by +15% on Houston2018 hyperspectral dataset, making it competitive with a strong 3D convolutional neural network baseline. In an ablation study on label-efficiency based on the Houston2018 dataset, self-supervised pre-training significantly improves transformer accuracy when little labeled training data is available. The self-supervised model outperforms randomly initialized transformers and the 3D convolutional neural network by +7-8% when only 0.1-10% of the training labels are available.