학술논문

Pre-training Fine-tuning data Enhancement method based on active learning
Document Type
Conference
Source
2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) TRUSTCOM Trust, Security and Privacy in Computing and Communications (TrustCom), 2022 IEEE International Conference on. :1447-1454 Dec, 2022
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Training
Analytical models
Costs
Annotations
Training data
Predictive models
Data models
Active learning
Clustering analysis
Pre-training model
Natural Language Processing
Language
ISSN
2324-9013
Abstract
With the development of Internet technology, the number of Internet users increases rapidly, and the amount of data generated on the Internet is very large every day. At the same time, with the development of storage technology and query technology, it is very easy to collect massive data, but the information value contained in these data is uneven, and most of them are unmarked. However, traditional supervised learning has a great demand for labeled samples. Faced with a large number of unlabeled samples, there is a problem of the lack of effective automatic labeling methods, and manual labeling costs are high. If the strategy of simple random sampling is used for annotation, it may lead to the selection of noisy information and waste of resources, and low-quality training data could also have an influence on the prediction accuracy of the model. Meanwhile, the training effect of traditional deep learning methods is very limited for small sample labeled training sets.This paper takes the text emotion analysis task in natural language processing as the background, selects IMDB film review data as the training set and test set, starts with the design of active learning algorithm based on clustering analysis, combined with the appropriate pre-training fine-tuning model, constructs a data enhancement method based on active learning. In the experiment, it is found that when the labeled training set is reduced by 90%, the prediction accuracy of the pre-training model is reduced by no more than 2%, which verifies the effectiveness of the data enhancement method combining active learning with the pre-training model.