학술논문

Learning-Based Dissimilarity for Clustering Categorical Data

Document Type

article

Author

Edgar Jacob Rivera Rios; Miguel Angel Medina-Pérez; Manuel S. Lazo-Cortés; Raúl Monroy

Source

Applied Sciences, Vol 11, Iss 8, p 3509 (2021)

Subject

dissimilarity
categorical data
clustering
Technology
Engineering (General). Civil engineering (General)
TA1-2040
Biology (General)
QH301-705.5
Physics
QC1-999
Chemistry
QD1-999

Language

English

ISSN

2076-3417

Abstract

Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity measures characterize the distance among the values an attribute may take using precisely the number of different values the attribute takes (the attribute space) and the frequency at which they occur. These kinds of measures overlook attribute interdependence, which may provide valuable information when capturing per-attribute object dissimilarity. In this paper, we introduce a novel object dissimilarity measure that we call Learning-Based Dissimilarity, for comparing categorical data. Our measure characterizes the distance between two categorical values of a given attribute in terms of how likely it is that such values are confused or not when all the dataset objects with the remaining attributes are used to predict them. To that end, we provide an algorithm that, given a target attribute, first learns a classification model in order to compute a confusion matrix for the attribute. Then, our method transforms the confusion matrix into a per-attribute dissimilarity measure. We have successfully tested our measure against 55 datasets gathered from the University of California, Irvine (UCI) Machine Learning Repository. Our results show that it surpasses, in terms of various performance indicators for data clustering, the most prominent distance relations put forward in the literature.

Online Access

Full Text (ProQuest Central) Full Text (Gale Academic Onefile) Open Access (DOAJ) Open Access (EBSCO) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송