학술논문

DIML: Deep Interpretable Metric Learning via Structural Matching
Document Type
Periodical
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE Trans. Pattern Anal. Mach. Intell. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 46(4):2518-2532 Apr, 2024
Subject
Computing and Processing
Bioengineering
Measurement
Transformers
Learning systems
Visualization
Computer architecture
Task analysis
Computational modeling
Distance metric learning
interpretable AI
visual recognition
Language
ISSN
0162-8828
2160-9292
1939-3539
Abstract
In this paper, we present a new framework named DIML to achieve more interpretable deep metric learning. Unlike traditional deep metric learning method that simply produces a global similarity given two images, DIML computes the overall similarity through the weighted sum of multiple local part-wise similarities, making it easier for human to understand the mechanism of how the model distinguish two images. Specifically, we propose a structural matching strategy that explicitly aligns the spatial embeddings by computing an optimal matching flow between feature maps of the two images. We also devise a multi-scale matching strategy, which considers both global and local similarities and can significantly reduce the computational costs in the application of image retrieval. To handle the view variance in some complicated scenarios, we propose to use cross-correlation as the marginal distribution of the optimal transport to leverage semantic information to locate the important region in the images. Our framework is model-agnostic, which can be applied to off-the-shelf backbone networks and metric learning methods. To extend our DIML to more advanced architectures like vision Transformers (ViTs), we further propose truncated attention rollout and partial similarity to overcome the lack of locality in ViTs. We evaluate our method on three major benchmarks of deep metric learning including CUB200-2011, Cars196, and Stanford Online Products, and achieve substantial improvements over popular metric learning methods with better interpretability.