학술논문
QED: Groupon's ETL management and curated feature catalog system for machine learning
Document Type
Conference
Author
Source
2016 IEEE International Conference on Big Data (Big Data) Big Data (Big Data), 2016 IEEE International Conference on. :1639-1646 Dec, 2016
Subject
Language
Abstract
In today's technology industry where machine learning has become essential, the effectiveness of algorithms ultimately depends on a robust data pipeline, and fast model prototyping and tuning require easy feature discovery and consumption. Careful management of ETL processes and their produced datasets is key to both model development in the research stage and model execution in the production environment. In this paper we present QED, an ETL management and curated feature catalog system that provides robust, streamlined machine learning pipelines. First, QED promises dynamic, reliable, and timely data delivery to the production pipeline. Its enhanced ETL process persists data from upstream sources in local data stores and ensures their correctness. Second, in contrast to previous systems, QED is capable not only of producing a daily scoring dataset, but also a training dataset with minimized bias by preserving the historical observations of feature values. Third, QED's multiple data store design allows batch process of large datasets as well as fast random access to single records. Finally, its curated feature catalog system enables sharing and reuse of machine learning features. QED serves as the data backend for a variety of machine learning models that provide key insights into the global business, and optimize the daily operations of Groupon.