학술논문

Parameter-Efficient and Student-Friendly Knowledge Distillation

Document Type

Periodical

Author

Rao, J.; Meng, X.; Ding, L.; Qi, S.; Liu, X.; Zhang, M.; Tao, D.

Source

IEEE Transactions on Multimedia IEEE Trans. Multimedia Multimedia, IEEE Transactions on. 26:4230-4241 2024

Subject

Components, Circuits, Devices and Systems
Communication, Networking and Broadcast Technologies
Computing and Processing
General Topics for Engineers
Training
Smoothing methods
Knowledge transfer
Data models
Adaptation models
Predictive models
Knowledge engineering
Knowledge distillation
parameter-efficient
image classification

Language

ISSN

1520-9210
1941-0077

Abstract

Pre-trained models are frequently employed in multimodal learning. However, these models have too many parameters and need too much effort to fine-tune the downstream tasks. Knowledge distillation (KD) is a method to transfer knowledge using the soft label from this pre-trained teacher model to a smaller student, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, adjustment of temperature parameters, label smoothing and teacher-student joint training methods (online distillation) to smooth the soft label of a teacher network, have been proposed. But those methods rarely explain the effect of smoothed soft labels to enhance the KD performance. The main contributions of our work are the discovery, analysis, and validation of the effect of the smoothed soft label and a less time-consuming and adaptive transfer of the pre-trained teacher's knowledge method, namely PESF-KD by adaptive tuning soft labels of the teacher network. Technically, we first mathematically formulate the mismatch as the sharpness gap between teacher's and student's predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on various benchmarks including CV and NLP show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송