학술논문

CaFe DBSCAN: A Density-based Clustering Algorithm for Causal Feature Learning
Document Type
Conference
Source
2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA) Data Science and Advanced Analytics (DSAA), 2023 IEEE 10th International Conference on. :1-10 Oct, 2023
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Representation learning
Machine learning algorithms
Clustering methods
Clustering algorithms
Estimation
Data science
Probabilistic logic
Causal Feature Learning
Density-based Clustering
Macro-level Causal Effects
Language
Abstract
Causal Feature Learning (CFL) infers macro-level causes (e.g., an aggregation of pixels in a traffic light image) from micro-level data (e.g., pixels of the image) by clustering the predicted probabilities of effect states (e.g., state of the traffic light). The current method for CFL uses a two-step procedure. First, a classifier for the effect states is trained, and afterwards, the predicted effect state probabilities are clustered. With CaFe DBSCAN, we present a novel density-based clustering method that conducts CFL directly by estimating conditional probabilities during clustering. To this end, we introduce the notion of clustering regions with similar conditional probabilities of the effect states given their micro-level data points. Our single-step approach has the following benefits: (1) CaFe DBSCAN introduces a comprehensive approach to Causal Feature Learning. Unlike existing methods, CaFe DBSCAN uses a probabilistic framework and does not require separate classification and clustering steps implemented by different algorithms relying on various assumptions, parameter settings, and optimization goals. (2) We do not need to train and tune a classifier first, hence the algorithm is more runtime-efficient than the current approach. (3) Due to the properties of density-based clustering algorithms, CaFe DBSCAN is robust against noise and outliers, which leads to purer clusters. (4) Our algorithm automatically infers a reasonable number of clusters, i.e., macro-level causes. We demonstrate the benefits of CaFe DBSCAN on synthetic and real-world data.