학술논문

NormSoftmax: Normalizing the Input of Softmax to Accelerate and Stabilize Training
Document Type
Conference
Source
2023 IEEE International Conference on Omni-layer Intelligent Systems (COINS) Omni-layer Intelligent Systems (COINS), 2023 IEEE International Conference on. :1-6 Jul, 2023
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
General Topics for Engineers
Robotics and Control Systems
Training
Neural networks
Machine learning
Transformers
Robustness
Stability analysis
Probability distribution
Language
Abstract
Softmax is a basic function that normalizes a vector to a probability distribution and is widely used in machine learning, most notably in cross-entropy loss function and dot product attention operations. However, the optimization of softmax-based models is sensitive to the input statistics change. We observe that the input of softmax changes significantly during the initial training stage, causing slow and unstable convergence when training the model from scratch. To remedy the optimization difficulty of softmax, we propose a simple yet effective substitution, named NormSoftmax, where the input vector is first normalized to unit variance and then fed to the standard softmax function. Similar to other existing normalization layers in machine learning models, NormSoftmax can stabilize and accelerate the training process, and also increase the robustness of the training procedure against hyperparameters. Experiments on Transformer-based models and convolutional neural networks validate that our proposed NormSoftmax is an effective plug-and-play module to stabilize and speed up the optimization of neural networks with cross-entropy loss or dot-product attention operations.