학술논문

DSAP: Dynamic Sparse Attention Perception Matcher for Accurate Local Feature Matching
Document Type
Periodical
Source
IEEE Transactions on Instrumentation and Measurement IEEE Trans. Instrum. Meas. Instrumentation and Measurement, IEEE Transactions on. 73:1-16 2024
Subject
Power, Energy and Industry Applications
Components, Circuits, Devices and Systems
Visualization
Transformers
Feature extraction
Sparse matrices
Correlation
Backpropagation
Vectors
Deep learning
dynamic attention perception
local feature matching
relative pose estimation
sparse attention
visual localization
Language
ISSN
0018-9456
1557-9662
Abstract
Local feature matching, which aims to establish the matches between image pairs, is a pivotal component of multiple visual applications. While current transformer-based works exhibit remarkable performance, they mechanically alternate self- and cross-attention in a predetermined order without considering their prioritization, culminating in inadequate enhancement of visual descriptors. Moreover, when calculating attention matrices to integrate global context, current methods only explicitly model the correlation among the feature channels without taking their importance into account, leaving insufficient message propagation. In this work, we develop a dynamic sparse attention perception (DSAP) matcher to tackle the aforementioned issues. To resolve the first issue, DSAP presents a dynamic perception strategy (DPS) that enables the network to dynamically implement feature enhancement via modifying both forward and backward propagation. During forward propagation, DPS assigns a learnable perception score to each transformer layer and employs an exponential moving average algorithm (EMA) to calculate the current score. After that, DPS utilizes an indicator function to binarize the score, allowing DSAP to adaptively determine the appropriate utilization of self- or cross-attention at the current iteration. During backward propagation, DPS employs a gradient estimator that adjusts the gradient of perception scores, thus rendering them differentiable. To tackle the second issue, DSAP introduces a weighted sparse transformer (WSFormer) that recalibrates attention matrices by concurrently considering both channel importance and channel correlation. WSFormer predicts attention vectors to weight attention matrices while constructing multiple sparse attention matrices to integrate various global messages, thus highlighting informative channels and inhibiting redundant message propagation. Extensive experiments in public datasets and real environments demonstrate that DSAP achieves exceptional performances across various downstream tasks, including relative pose estimation and visual localization. The code is available at https://github.com/mooncake199809/DSAP.