학술논문

DataOps-4G: On Supporting Generalists in Data Quality Discovery
Document Type
Periodical
Source
IEEE Transactions on Knowledge and Data Engineering IEEE Trans. Knowl. Data Eng. Knowledge and Data Engineering, IEEE Transactions on. 35(5):4668-4681 May, 2023
Subject
Computing and Processing
Data integrity
Task analysis
Programming
Codes
Cleaning
Data models
Annotations
Data quality
generalist
dataops
log analysis.
Language
ISSN
1041-4347
1558-2191
2326-3865
Abstract
Data preparation has become a necessary but labor and resource-intensive step to perform data analytics. To date, such activities still require considerable manual effort from experts. In this paper, we focus on a specific data preparation activity, namely data quality discovery. We explore different settings in which data workers undertake data quality discovery tasks and the implications of those settings for the efficiency and effectiveness of data workers. To this end, we propose DataOps-4G, a data quality discovery platform for generalists that allows users to interact with data without the need to write code. We wrap up pre-defined code snippets that implement useful functionalities to explore data quality and bundle the code into so-called DataOps. Then, we conduct a lab-based user study to evaluate our DataOps-4G platform from two perspectives: (i) effectiveness , the accuracy of the outcomes achieved by participants; and (ii) efficiency , their effort and strategies in task completion. Our experimental results uncover how effectiveness and efficiency can be affected by their task completion patterns and strategies. This opens up the possibility of popularizing data quality discovery processes by employing non-experts (e.g., from crowdsourcing platforms) and consequently allowing experts to focus on more complex activities (e.g., building machine learning models).