학술논문

Online Learning From Incomplete and Imbalanced Data Streams
Document Type
Periodical
Source
IEEE Transactions on Knowledge and Data Engineering IEEE Trans. Knowl. Data Eng. Knowledge and Data Engineering, IEEE Transactions on. 35(10):10650-10665 Oct, 2023
Subject
Computing and Processing
Heuristic algorithms
Real-time systems
Costs
Data mining
Classification algorithms
Optimization
Aerospace electronics
Data streams
F-measure
incomplete feature spaces
imbalanced data
online learning
Language
ISSN
1041-4347
1558-2191
2326-3865
Abstract
Learning with streaming data has attracted extensive research interest in recent years. Existing online learning approaches have specific assumptions regarding data streams, such as requiring fixed or varying feature spaces with explicit patterns and balanced class distributions. While the data streams generated in many real scenarios commonly have arbitrarily incomplete feature spaces and dynamic imbalanced class distributions, making existing approaches be unsuitable for real applications. To address this issue, this paper proposes a novel Online Learning from Incomplete and Imbalanced Data Streams (OLI $^{2}$2 DS) algorithm. OLI $^{2}$2 DS has a two-fold main idea: 1) it follows the empirical risk minimization principle to identify the most informative features of incomplete feature spaces, and 2) it develops a dynamic cost strategy to handle imbalanced class distributions in real-time by transforming F-measure optimization into a weighted surrogate loss minimization. To evaluate OLI $^{2}$2 DS, we compare it with state-of-the-art related algorithms in three kinds of experiments. First, we adopt 14 real datasets to simulate three scenarios of incomplete feature spaces, i.e., trapezoidal, feature evolvable, and capricious data streams. Second, based on a benchmark online analyzer, we generate 13 datasets to simulate incomplete data streams with different imbalance ratios. Third, we analyze concept drift in two simulated scenes, i.e., online learning and data stream mining, and verify the adaption of OLI $^{2}$2 DS on repeated concept drifts and variable imbalance ratios. The results demonstrate that OLI $^{2}$2 DS achieves a significantly better performance than its rivals. Besides, a real-world case study on movie review classification is conducted to elaborate on our OLI $^{2}$2 DS algorithm's effectiveness. Code is released at https://github.com/youdianlong/OLI2DS.