학술논문

MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching
Document Type
Conference
Source
2024 IEEE 40th International Conference on Data Engineering (ICDE) ICDE Data Engineering (ICDE), 2024 IEEE 40th International Conference on. :3421-3434 May, 2024
Subject
Computing and Processing
Merging
Pipelines
Distributed databases
Self-supervised learning
Benchmark testing
Data engineering
Labeling
Entity Matching
Data Integration
Language
ISSN
2375-026X
Abstract
Entity Matching (EM), which aims to identify all pairs of records referring to the same real-world entity from relational tables, is one of the most important tasks in real-world data management systems. Due to the labeling process of EM being extremely labor-intensive, unsupervised EM is more applicable than supervised EM in practical scenarios. Traditional unsupervised EM assumes that all entities come from two tables; however, it is more common to match entities from multiple tables in practical applications, that is, multi-table entity matching (multi-table EM). Unfortunately, effective and efficient unsupervised multi-table EM remains under-explored. To fill this gap, this paper formally studies the problem of unsupervised multi-table entity matching and proposes an effective and efficient solution, termed as MultiEM. MultiEM is a parallelable pipeline of enhanced entity representation, table-wise hierarchical merging, and density-based pruning. Extensive experimental results on six real-world benchmark datasets demonstrate the superiority of MultiEM in terms of effectiveness and efficiency.