학술논문

Randomness Regularization With Simple Consistency Training for Neural Networks

Document Type

Periodical

Author

Li, J.; Liang, X.; Wu, L.; Wang, Y.; Meng, Q.; Qin, T.; Zhang, M.; Liu, T.

Source

IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE Trans. Pattern Anal. Mach. Intell. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 46(8):5763-5778 Aug, 2024

Subject

Computing and Processing
Bioengineering
Training
Data models
Transformers
Mathematical models
Task analysis
Predictive models
Computational modeling
Randomness
regularization
consistency training
neural networks

Language

ISSN

0162-8828
2160-9292
1939-3539

Abstract

Randomness is widely introduced in neural network training to simplify model optimization or avoid the over-fitting problem. Among them, dropout and its variations in different aspects (e.g., data, model structure) are prevalent in regularizing the training of deep neural networks. Though effective and performing well, the randomness introduced by these dropout-based methods causes nonnegligible inconsistency between training and inference. In this paper, we introduce a simple consistency training strategy to regularize such randomness, namely R-Drop, which forces two output distributions sampled by each type of randomness to be consistent. Specifically, R-Drop minimizes the bidirectional KL-divergence between two output distributions produced by dropout-based randomness for each training sample. Theoretical analysis reveals that R-Drop can reduce the above inconsistency by reducing the inconsistency among the sampled sub structures and bridging the gap between the loss calculated by the full model and sub structures. Experiments on $\mathbf{7}$7 widely-used deep learning tasks ($\mathbf{23}$23 datasets in total) demonstrate that R-Drop is universally effective for different types of neural networks (i.e., feed-forward, recurrent, and graph neural networks) and different learning paradigms (supervised, parameter-efficient, and semi-supervised). In particular, it achieves state-of-the-art performances with the vanilla Transformer model on WMT14 English $\to$→ German translation ($\mathbf{30.91}$30.91 BLEU) and WMT14 English $\to$→ French translation ($\mathbf{43.95}$43.95 BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models.

Online Access

Full Text (IEEE) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송