학술논문

Scalable Kernel $k$-Means With Randomized Sketching: From Theory to Algorithm
Document Type
Periodical
Source
IEEE Transactions on Knowledge and Data Engineering IEEE Trans. Knowl. Data Eng. Knowledge and Data Engineering, IEEE Transactions on. 35(9):9210-9224 Sep, 2023
Subject
Computing and Processing
Kernel
Time complexity
Complexity theory
Hilbert space
Costs
Unsupervised learning
Feature extraction
++%24k%24<%2Ftex-math>+++k<%2Fmml%3Ami>+<%2Fmml%3Amath>++<%2Falternatives>+<%2Finline-formula>+<%2Fnamed-content>-means%22">Kernel $k$ k -means
randomized sketching
statistical and computational trade-offs
excess risk bound
Language
ISSN
1041-4347
1558-2191
2326-3865
Abstract
Kernel $k$k-means is a fundamental unsupervised learning in data mining. Its computational requirements are typically at least quadratic in the number of data, which are prohibitive for large-scale scenarios. To address these issues, we propose a novel randomized sketching approach SKK based on the circulant matrix. SKK projects the kernel matrix left and right according to the proposed sketch matrices to obtain a smaller one and accelerates the matrix-matrix product by the fast Fourier transform based on the circulant matrix, which can greatly reduce the computational requirements of the approximate kernel $k$k-means estimator with the same generalization bound as the exact kernel $k$k-means in the statistical setting. In particular, theoretical analysis shows that taking the sketch dimension of $\sqrt{n}$n is sufficient for SKK to achieve the optimal excess risk bound with only a fraction of computations, where $n$n is the number of data. The extensive experiments verify our theoretical analysis, and SKK achieves the state-of-the-art performances on 12 real-world datasets. To the best of our knowledge, in randomized sketching, this is the first time that unsupervised learning makes such a significant breakthrough.