학술논문

Dynamic Masking Rate Schedules for MLM Pretraining

Document Type

Working Paper

Author

Ankner, Zachary; Saphra, Naomi; Blalock, Davis; Frankle, Jonathan; Leavitt, Matthew L.

Source

Subject

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Language

Abstract

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%. We propose to instead dynamically schedule the masking rate throughout training. We find that linearly decreasing the masking rate over the course of pretraining improves average GLUE accuracy by up to 0.46% and 0.25% in BERT-base and BERT-large, respectively, compared to fixed rate baselines. These gains come from exposure to both high and low masking rate regimes, providing benefits from both settings. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models, achieving up to a 1.89x speedup in pretraining for BERT-base as well as a Pareto improvement for BERT-large.

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송