학술논문

Streaming Data Analysis: Clustering or Classification?
Document Type
Periodical
Source
IEEE Transactions on Systems, Man, and Cybernetics: Systems IEEE Trans. Syst. Man Cybern, Syst. Systems, Man, and Cybernetics: Systems, IEEE Transactions on. 51(1):91-102 Jan, 2021
Subject
Signal Processing and Analysis
Robotics and Control Systems
Power, Energy and Industry Applications
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
General Topics for Engineers
Clustering algorithms
Animals
Indexes
Data analysis
Data models
Streaming media
Probabilistic logic
Classifier design
cluster footprints
CluStream
DenStream
sequential k-means
stream clustering
streaming data analysis (SDA)
Language
ISSN
2168-2216
2168-2232
Abstract
This article is a position paper about models and algorithms that are generally called “stream clustering.” Semantics and methods used in this field are often co-opted from static clustering, but they do not serve well for streaming data analysis. Most “state-of-the-art” methods, such as sequential k-means, Birch, CluStream, DenStream, etc., acknowledge that the data are seen but once in real streaming analysis (e.g., intrusion detection, voter fraud, etc.). Interpretation of their outputs generally overlooks the fact that when the data cannot be saved, batch clustering ideas, such as preclustering assessment, partitioning, and cluster validity are not relevant. But in the current literature, the data, or some subset of it, are often saved for hindsight evaluation (we call this fake stream clustering). Our position? Useful analysis of real streaming data is in its infancy. We do not argue that current approaches to streaming clustering are wrong: rather, we regard them as transitional methods which will eventually lead to a new and useful paradigm for this type of computation. We think that this class of models and algorithms are actually classifiers, but with a special added component, viz., continuously updated cluster footprints of the instream processing. We need to carefully define the objectives of streaming analysis, and then choose terminology and methods that suit this evolving paradigm.