학술논문

QED: Groupon's ETL management and curated feature catalog system for machine learning
Document Type
Conference
Source
2016 IEEE International Conference on Big Data (Big Data) Big Data (Big Data), 2016 IEEE International Conference on. :1639-1646 Dec, 2016
Subject
Aerospace
Bioengineering
Computing and Processing
General Topics for Engineers
Geoscience
Signal Processing and Analysis
Feature extraction
Data mining
Machine learning algorithms
Pipelines
Training
Metadata
big data management
data pipeline
feature catalog
machine learning
Language
Abstract
In today's technology industry where machine learning has become essential, the effectiveness of algorithms ultimately depends on a robust data pipeline, and fast model prototyping and tuning require easy feature discovery and consumption. Careful management of ETL processes and their produced datasets is key to both model development in the research stage and model execution in the production environment. In this paper we present QED, an ETL management and curated feature catalog system that provides robust, streamlined machine learning pipelines. First, QED promises dynamic, reliable, and timely data delivery to the production pipeline. Its enhanced ETL process persists data from upstream sources in local data stores and ensures their correctness. Second, in contrast to previous systems, QED is capable not only of producing a daily scoring dataset, but also a training dataset with minimized bias by preserving the historical observations of feature values. Third, QED's multiple data store design allows batch process of large datasets as well as fast random access to single records. Finally, its curated feature catalog system enables sharing and reuse of machine learning features. QED serves as the data backend for a variety of machine learning models that provide key insights into the global business, and optimize the daily operations of Groupon.