학술논문

Multitask Fine-Tuning for Passage Re-Ranking Using BM25 and Pseudo Relevance Feedback
Document Type
Periodical
Author
Source
IEEE Access Access, IEEE. 10:54254-54262 2022
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Task analysis
Training
Computational modeling
Information retrieval
Data models
Semantics
Neural networks
passage ranking
pre-trained language model
self-supervised learning
Language
ISSN
2169-3536
Abstract
Passage re-ranking is a machine learning task that estimates relevance scores between a given query and candidate passages. Keyword features based on the lexical similarities between queries and passages have been traditionally used for the passage re-ranking models. However, such approaches have a limitation; it is difficult to find semantic and contextual features beyond word-matching information. Recently, several studies based on neural pre-trained language models such as BERT overcome the limitations of traditional keyword-based models and they show significant performance improvements. Such ranking models have the advantage of finding the contextual features of queries and documents better than traditional keyword-based methods. However, these deep learning-based models require large amounts of data for training. Such training data is usually manually labeled with high cost, and how to utilize the data efficiently is an important issue. This paper proposes a fine-tuning method for efficient training of the neural re-ranking model. The proposed model utilizes data augmentation by simultaneously learning the ranking and MLM tasks during the fine-tuning stages. For the MLM task, different parts of a passage are masked at each training epoch. Even if only one pair of query and passage is given, the model is exposed to diverse cases with passages dynamically masked from the one. Also, the probability distribution of term importance is trained on the model. We calculate term importance weight by two novel methods using BM25 and pseudo relevance feedback. Terms are sampled and masked according to the importance weight. The ranking model learns representation based on the term weight distribution by executing the MLM task. A novel method with pseudo relevance feedback is applied for calculating term importance. It enables the neural ranking models to form representation according to feedbacks from an initial search stage. The proposed model is trained with data from the MS MARCO leaderboard for the re-ranking task. Our model achieves the state-of-the-art MRR@10 score in the leaderboard except for the ensemble-based method. In addition, our model demonstrates significant performance in three different evaluation metrics: MRR@10, Mean Rank, and Hit@(5,10,20,50).