학술논문

Estimating the statistical significance of classifiers by varying the number of genes
Document Type
Conference
Source
2006 IEEE International Workshop on Genomic Signal Processing and Statistics Genomic Signal Processing and Statistics, 2006. GENSIPS '06. IEEE International Workshop on. :109-110 May, 2006
Subject
Computing and Processing
Signal Processing and Analysis
Support vector machines
Support vector machine classification
Cancer
Gene expression
Resonance light scattering
Data mining
Colon
Pathology
Data analysis
Least squares methods
Language
ISSN
2150-3001
2150-301X
Abstract
We present a statistically well founded method to construct cancer predictors using gene expression profiles. This methodology is applied to a new microarray data set extracted from 25 patients affected by colon cancer. In particular, we answer to precise questions: how many gene expression levels are correlated with the pathology and how many are sufficient for an accurate classification? The proposed method provides answer to these questions avoiding the potential pitfalls hidden in the analysis of microarray data. We have evaluated the generalization error, estimated through the Leave-K-Out Cross Validation error, of two different classification schemes by varying the number of selected genes. We found that, Regularized Least Squares (RLS) and Support Vector Machines (SVM) classifiers, using the whole gene set, have error rates of e = 14% (p = 0.023) and e = 11% (p = 0.016) respectively. Concerning the number of genes, the performances of RLS and SVM classifiers do not change when the 74% of genes is used. The statistical significance was measured by using permutation test.