13-09-2013, 03:23 PM
Machine learning-based receiver operating characteristic (ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research
Machine learning-based receiver.pdf (Size: 1.64 MB / Downloads: 187)
Abstract
Receiver operating characteristic (ROC) curves were generated to obtain classification area under the curve (AUC) as a
function of feature standardization, fuzzification, and sample size from nine large sets of cancer-related DNA microarrays.
Classifiers used included k-nearest neighbor (kNN), naı
̈ve Bayes classifier (NBC), linear discriminant analysis (LDA), qua-
dratic discriminant analysis (QDA), learning vector quantization (LVQ1), logistic regression (LOG), polytomous logistic
regression (PLOG), artificial neural networks (ANN), particle swarm optimization (PSO), constricted particle swarm opti-
mization (CPSO), kernel regression (RBF), radial basis function networks (RBFN), gradient descent support vector
machines (SVMGD), and least squares support vector machines (SVMLS). For each data set, AUC was determined
for a number of combinations of sample size, total sum[Àlog(p)] of feature t-tests, with and without feature standardiza-
tion and with (fuzzy) and without (crisp) fuzzification of features. Altogether, a total of 2,123,530 classification runs were
made. At the greatest level of sample size, ANN resulted in a fitted AUC of 90%, while PSO resulted in the lowest fitted
AUC of 72.1%. AUC values derived from 4NN were the most dependent on sample size, while PSO was the least. ANN
depended the most on total statistical significance of features used based on sum[Àlog(p)], whereas PSO was the least
dependent. Standardization of features increased AUC by 8.1% for PSO and À0.2% for QDA, while fuzzification increased
AUC by 9.4% for PSO and reduced AUC by 3.8% for QDA. AUC determination in planned microarray experiments with-
out standardization and fuzzification of features will benefit the most if CPSO is used for lower levels of feature significance
(i.e., sum1⁄2À logðpÞ $ 50) and ANN is used for greater levels of significance (i.e., sum1⁄2À logðpÞ $ 500). When only stan-
dardization of features is performed, studies are likely to benefit most by using CPSO for low levels of feature statistical
significance and LVQ1 for greater levels of significance. Studies involving only fuzzification of features should employ
LVQ1 because of the substantial gain in AUC observed and low expense of LVQ1. Lastly, PSO resulted in significantly
greater levels of AUC (89.5% average) when feature standardization and fuzzification were performed. In consideration
of the data sets used and factors influencing AUC which were investigated, if low-expense computation is desired then
LVQ1 is recommended. However, if computational expense is of less concern, then PSO or CPSO is recommended.
Introduction
DNA microarrays have been used extensively to interrogate gene expression profiles of cells in different clas-
ses of treatment or disease. The majority of analyses performed with DNA microarrays commonly include iden-
tification of differentially expressed genes via inferential tests of hypothesis, predictive modeling through
function approximation (e.g., survival analysis), unsupervised classification to identify similar profiles over
samples or features, or supervised classification for sample class prediction. There is a voluminous literature
on statistical power and sample size determination for inferential testing to identify differentially expressed genes
[1–9]. That so much concentration on power and sample size is devoted to differential expression stems from the
predominance of applications focusing on biological questions, where differential expression is the primary goal.
Etiological (cause-effect) biological questions are routinely part of both experimental and clinical applications,
which ultimately target the roles of molecules and pathways responsible for the observed effects. On the other
hand, expression-based sample classification (e.g., patient classification) is less biologically focused on genes in
causal pathways and more directed toward clinical questions related to patient classification.
DNA microarray data sets used
Data used for classification analysis were available in C4.5 format from the Kent Ridge Biomedical Data
Set Repository (http://sdmc.i2r.a-star.edu.sg/rp). The two-class adult brain cancer data were comprised of 60
arrays (21 censored, 39 failures) with expression for 7129 genes [12]. The two-class adult prostate cancer data
set consisted of 102 training samples (52 tumor, and 50 normal) with 12,600 features. The original report for
the prostate data supplement was published by Singh et al. [13]. Two breast cancer data sets were used. The
first had 2 classes and consisted of 15 arrays for 8 BRCA1 positive women and 7 BRCA2 positive women with
expression profiles of 3170 genes [14], and the second was also a two-class set including 78 patient samples and
24,481 features (genes) comprised of 34 cases with distant metastases who relapsed (‘‘relapse’’) within 5 years
after initial diagnosis and 44 disease-free (‘‘non-relapse’’) for more than 5 years after diagnosis [15]. Two-class
expression data for adult colon cancer were based on the paper published by Alon et al. [16]. The data set
contains 62 samples based on expression of 2000 genes in 40 tumor biopsies (‘‘negative’’) and 22 normal
(‘‘positive’’) biopsies from non-diseased colon biopsies from the same patients. An adult two-class lung cancer
set including 32 samples (16 malignant pleural mesothelioma (MPM) and 16 adenocarcinoma (ADCA)) of the
lung with expression values for 12,533 genes [20] was also considered. Two leukemia data sets were evaluated:
one two-class data set with 38 arrays (27 ALL, 11 AML) containing expression for 7129 genes [21], and the
other consisting of 3 classes for 57 pediatric samples for lymphoblastic and myelogenous leukemia (20 ALL,
17 MLL and 20 AML) with expression values for 12,582 genes [22]. The Khan et al. [23] data set on pediatric
small round blue-cell tumors (SRBCT) had expression profiles for 2308 genes and 63 arrays comprising 4 clas-
ses (23 arrays for EWS – Ewing Sarcoma, 8 arrays for BL – Burkitt lymphoma, 12 arrays for NB – neuroblas-
toma, and 20 arrays for RMS – rhabdomyosarcoma).
Discussion and conclusions
The effects of sample size, feature significance, feature standardization, and feature fuzzification varied over
the classifiers used. Particle swarm optimization (PSO) and constricted particle swarm optimization (CPSO)
were the best performing classifiers resulting in the greatest levels of fitted AUC. LVQ1 typically resulted
in the second greatest levels of fitted AUC. Quadratic discriminant analysis (QDA) and logistic regression
(LOG) commonly resulted in the least levels of fitted AUC. Artificial neural networks (ANN) was on occasion
the best and worst classifier. Table 3 lists the two best and two worst classifiers for fitted AUC values shown in
Figs. 6–13.