학술논문

Dual-Stream Transformer With Distribution Alignment for Visible-Infrared Person Re-Identification
Document Type
Periodical
Source
IEEE Transactions on Circuits and Systems for Video Technology IEEE Trans. Circuits Syst. Video Technol. Circuits and Systems for Video Technology, IEEE Transactions on. 33(11):6764-6776 Nov, 2023
Subject
Components, Circuits, Devices and Systems
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Transformers
Feature extraction
Task analysis
Training
Computer architecture
Measurement
Data mining
Person re-identification
visible-infrared
distribution alignment
cross-modality
dissimilarity space
Language
ISSN
1051-8215
1558-2205
Abstract
Visible-infrared person re-identification(VI-ReID) aims to match the person images captured by visible and infrared cameras and suffers from severe cross-modality discrepancy and intra-modality variations. Existing approaches mainly use convolution neural network (CNN)-based architectures to extract pedestrian features, which fail to capture the long-range dependencies within an image. In addition, previous works usually attempt to bridge the modality gap by using adversarial learning to generate style-consistent images or designing different feature-level metric learning constraints. However, few works consider the cross-modality disparity from the perspective of assessing overall distance distribution discrepancy. To address these problems, we design a pure Transformer-based Visible-Infrared (TransVI) network with a conventional two-stream structure, which can explicitly capture modality-specific representations and learn multi-modality sharable knowledge. TransVI can efficiently address the lack of global dependency in CNN-based architectures due to the multi-head self-attention modules in the transformer, which allows us to capture the long-range dependencies of pedestrian images. Furthermore, we introduce the Cross-Modality Dissimilarity-based Maximum Mean Discrepancy (CMD-MMD) constraint to handle the cross-modality discrepancy at the distance distribution level. Specifically, CMD-MMD leverages intra-modality distribution separability to guide inter-modality distribution separability learning, aligning pair-wise distance distributions of intra- and inter-modality for within-class and between-class, respectively. In this way, the distance distributions of intra- and inter-modality become more similar, significantly mitigating the cross-modality discrepancy and learning more modality invariant representations. Extensive experimental results on two public VI-ReID datasets confirm that our proposed framework can achieve state-of-the-art performance.