학술논문

Neural Tree Decoder for Interpretation of Vision Transformers
Document Type
Periodical
Author
Source
IEEE Transactions on Artificial Intelligence IEEE Trans. Artif. Intell. Artificial Intelligence, IEEE Transactions on. 5(5):2067-2078 May, 2024
Subject
Computing and Processing
Visualization
Transformers
Decision trees
Decision making
Task analysis
Decoding
Computational modeling
explainable artificial intelligence
interpretable artificial intelligence
machine learning
visualization of decision and learning models
Language
ISSN
2691-4581
Abstract
In this study, we propose a novel vision transformer neural tree decoder (ViT-NeT) that is interpretable and highly accurate in terms of fine-grained visual categorization (FGVC). A ViT acts as a backbone, and to overcome the limitations of ViT, the output context image patch is fed to the proposed NeT. NeT aims to more accurately classify fine-grained objects using similar interclass correlations and different intra-class correlations. ViT-NeT can also describe decision-making processes and visually interpret the results through tree structures and prototypes. Because the proposed ViT-NeT is designed not only to improve FGVC classification performance, but also to provide human-friendly interpretation, it is effective in resolving the tradeoff between performance and interpretability. We compared the performance of ViT-NeT with other state-of-the-art (SoTA) methods using the widely applied FGVC benchmark datasets CUB-200-2011, Stanford Dogs, Stanford Cars, NABirds, and iNaturalist. The proposed method shows a promising quantitative and qualitative performance in comparison to previous SoTA methods as well as an excellent interpretability.