학술논문

Optimizing Clustering Algorithms for Anti-Microbial Evaluation Data: A Majority Score-Based Evaluation of K-Means, Gaussian Mixture Model, and Multivariate T-Distribution Mixtures
Document Type
Periodical
Source
IEEE Access Access, IEEE. 11:79793-79800 2023
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Clustering algorithms
Chemical compounds
Antibacterial activity
Partitioning algorithms
Indexes
Mixture models
Machine learning algorithms
Clustering
K-means
GMM
multivariate t distribution
Silhouette width
within sum square
Dunn index
Language
ISSN
2169-3536
Abstract
This study presents a detailed analysis of the performance of the majority score clustering algorithm on three different datasets of anti-microbial evaluation, namely the minimum inhibitory concentration (MIC) of bacteria, and the antifungal activity of chemical compounds against 4 bacteria (E. coli, P. aeruginosa, S. aureus, S. pyogenes) and 2 fungi (C. albicans, As. fumigatus). Clustering is an unsupervised machine learning method used to group chemical compounds based on their similarity. In this paper, we apply the k-means clustering, Gaussian mixture model (GMM), and mixtures of multivariate t distribution to antibacterial activity datasets. To determine the optimal number of clusters and which clustering algorithm performs best, we use a variety of clustering validation indices (CVIs) which include within sum square (to be minimized), connectivity (to be minimized), Silhouette Width (to be maximized), and the Dunn Index (to be maximized). Based on the majority score clustering algorithm, we conclude that the k-means and mixture of multivariate t-distribution methods perform best in terms of the maximum CVIs, while GMM performs best in terms of the minimum CVIs. K-means clustering and mixture of multivariate t-distribution provide 3 optimal clusters for the anti-microbial evaluation of antibacterial activity dataset and 5 optimal clusters for the MIC bacteria dataset. K-means clustering, mixture of multivariate t-distribution, and GMM provide 3 optimal clusters for both the antibacterial and antifungal activity datasets. K-means clustering algorithm performs the best in terms of the majority-based clustering algorithm. This study may be useful for the pharmaceutical industry, chemists, and medical professionals in the future.