학술논문

Data Ambiguity Profiling for the Generation of Training Examples
Document Type
Conference
Source
2023 IEEE 39th International Conference on Data Engineering (ICDE) ICDE Data Engineering (ICDE), 2023 IEEE 39th International Conference on. :450-463 Apr, 2023
Subject
Computing and Processing
Training
Deep learning
COVID-19
Structured Query Language
Correlation
Scalability
Natural languages
data to text
example generation
tabular natural language inference
NLP
data ambiguity
Language
ISSN
2375-026X
Abstract
Several applications, such as text-to-SQL and computational fact checking, exploit the relationship between relational data and natural language text. However, state of the art solutions simply fail in managing "data-ambiguity", i.e., the case when there are multiple interpretations of the relationship between text and data. Given the ambiguity in language, text can be mapped to different subsets of data, but existing training corpora only have examples in which every sentence/question is annotated precisely w.r.t. the relation. This unrealistic assumption leaves the target applications unable to handle ambiguous cases. To tackle this problem, we present an end-to-end solution that, given a table D, generates examples that consist of text, annotated with its data evidence, with factual ambiguities w.r.t. D. We formulate the problem of profiling relational tables to identify row and attribute data ambiguity. For the latter, we propose a deep learning method that identifies every pair of data ambiguous attributes and a label that describes both columns. Such metadata is then used to generate examples with data ambiguities for any input table. To enable scalability, we finally introduce a SQL approach that can generate millions of examples in seconds. We show the high accuracy of our solution in profiling relational tables and report on how our automatically generated examples lead to drastic quality improvements in two fact-checking applications, including a website with thousands of users, and in a text-to-SQL system.