학술논문

3gClust: Human Protein Cluster Analysis
Document Type
Periodical
Source
IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM Trans. Comput. Biol. and Bioinf. Computational Biology and Bioinformatics, IEEE/ACM Transactions on. 16(6):1773-1784 Jan, 2019
Subject
Bioengineering
Computing and Processing
Proteins
Feature extraction
Amino acids
Clustering algorithms
Databases
Measurement
Human protein cluster analysis
amino acid frequency features
hierarchical clustering
cluster partitioning
biological function
structural similarity
Language
ISSN
1545-5963
1557-9964
2374-0043
Abstract
We present a human protein cluster analysis by combining: 1) n-gram based amino acid frequency features, 2) optimal feature selection, 3) hierarchical clustering, and 4) advanced partitioning techniques. Our method qualitatively and quantitatively groups proteins with increasing sequence similarity into similar clusters by calculating the frequency model of amino acids using n-grams . We experiment with $n = 1$n=1, i.e., unigrams, $n = 2$n=2, i.e., bigrams, and finally $n = 3$n=3, i.e., trigrams for optimal selection of features to design the 3gClust algorithm. The benchmarking results on 20,105 manually curated human proteins show that 3gClust ensures better cluster compactness in the case of proteins with similar functional groups, biological processes, structural alignment, and shared domains (e.g., aquaporins, keratins ). Quantitative analysis of non singleton clusters shows significant improvement in their compactness in comparison to other state-of-the art methodologies. 3gClust is available at https://sites.google.com/site/bioinfoju/projects/3gclust for academic use along with supplementary materials, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2840996, and datasets.