학술논문

Visualization Comparison of Vision Transformers and Convolutional Neural Networks
Document Type
Periodical
Source
IEEE Transactions on Multimedia IEEE Trans. Multimedia Multimedia, IEEE Transactions on. 26:2327-2339 2024
Subject
Components, Circuits, Devices and Systems
Communication, Networking and Broadcast Technologies
Computing and Processing
General Topics for Engineers
Visualization
Transformers
Optimization
Image reconstruction
Neural networks
Task analysis
Minimization
Vision Transformer
convolutional neural network
feature representation
optimization visualization
Language
ISSN
1520-9210
1941-0077
Abstract
Recent research has demonstrated that Vision Transformers (ViTs) are capable of comparable or even better performance than convolutional neural network (CNN) baselines. The differences in their structural designs are obvious, but our understanding of the differences in their feature representations remains limited. In this work, we propose several techniques to achieve high-quality visualization of representations in ViTs. Both qualitative and quantitative experiments show that our technical improvements can observably improve ViT visualization quality compared to previous studies. Furthermore, we conduct visualizations to explore the disparities between ViTs and CNNs pre-trained on ImageNet1K, revealing three intriguing properties of ViTs: a) ViT feature propagation retains image detail information with minimal loss, whereas CNNs discard most image details for class discrimination. b) Different from CNNs, object-related features do not show in ViT higher layers, suggesting that class-discriminative features may not be required for ViT classification. c) Our visualization-assisted texture-bias experiment reveals that both ViTs and CNNs exhibit texture bias, of which ViTs seem to be more biased towards local textures.