학술논문

Zebra: A novel method for optimizing text classification query in overload scenario

Document Type

Original Paper

Author

Yu, Tianhuan; He, Zhenying; Yang, Zhihui; Ye, Fei; Fan, Yuankai; Jing, Yinan; Zhang, Kai; Wang, X. Sean

Source

World Wide Web: Internet and Web Information Systems. 26(3):905-931

Subject

Query processing
Text classification
Overload
Probabilistic filter
Load shedding

Language

English

ISSN

1386-145X
1573-1413

Abstract

Text classification is a crucial task in the text mining field, and it can be included in queries with user-defined functions(UDF). In many web applications, such as Twitter mining or Weibo real-time processing, when the amount of text data to be processed is enormous, there will be many overload phenomena. At the same time, when the system is overloaded, the delays in the query process can negatively affect the user experience in a streaming scenario. This paper focuses on the query with text classification on streaming data. We propose a novel method called Zebra with progressive pipelines to optimize the overload query situations. The core module of Zebra is the probabilistic filter which can reduce an incredible amount of text data based on semantic information of the query predicate. We train weak classifiers as filters using data with labels from brute-force pipelines. Next, we use a parameter search method to choose a suitable filter with the best settings and apply it to progressive pipelines. Experiments with several text workloads on real-world datasets show that Zebra can achieve higher accuracy stably while answering the query in time.

Online Access

Full Text (ProQuest Central) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송