17-03-2012, 10:16 AM
A Comparative Study on Application of Data Mining Technique in Human Shape Clustering: Principal Component Analysis vs. Factor Analysis
Abstract
Traditional human shape classification usually adopted
some key measurements, leading several problems to product
ergonomic design. Multivariate analysis is able to supplement the
disadvantages of traditional method. Among methods of
multivariate analysis, Principal component analysis (PCA) and
Factor analysis (FA) have enjoyed widespread popularity.
Though both of them are to reduce the dimensions of variances in
the sample, there are differences between PCA and FA worth
further investigation. The purpose of the paper is to demonstrate
the differences between PCA and FA by analyzing the
multivariate anthropometric data. K-means cluster analysis was
developed to divide samples into groups with homogenous
characteristics according to the PCA scores (or FA scores).
ANOVA (analysis of variance) was adopted to compare the
dimensions in corresponding clusters between PCA and FA. For
all the dimensions, the p-value equals to 0.000, indicating there is
significant difference for the samples between PCA and FA at the
significance level of 0.005. Finally, the regression models of the
reference dimensions based on the key dimensions, i.e., stature
and waist girth, were investigated for the ease of utilization the
FA (or PCA) results into applications such as building a family of
digital manikin. In conclusion, the techniques have similarities
and differences, and should not be abused. PCA analyzes all
variance of the data set, while FA analyzes only common
variances. A priori decision on the techniques depends on the
domain expertise, and the statistic characteristics of the sample.
Keywords: Principal component analysis, Factor analysis,
Cluster analysis, Sizing system, ANOVA
I. INTRODUCTION
Human shape classification, which is usually named as
sizing system in the field of ergonomics, can be used to classify
a certain population into groups with homogeneous
characteristics on the basis of several key dimensions of body
shape. Sizing system is essential for many industrial fields,
especially for garment design and production. With the
development of information technology, sizing is also adopted
to design the digital manikin. Numerous software packages
have been developed by experts to help build, like Jack and
Santo [1]. However, a wrongly used concept, i.e., average man,
encounters a severe problem: there is actually no such a person
who enjoys all the average dimensions. So the key problem is
how to analyze the multiple dimensions. Various methods have
been used to solve this problem [2]. Factor analysis (FA) and
Principal Component Analysis (PCA), two classical variable
reduction techniques, gained widespread popularity. However,
they are often confused as the same statistical method.
Actually, there are differences between them. FA only analyzes
the common variance. But PCA considers all variance in the
data set. PCA and FA enjoyed great differences on definition,
theory, the results and also the proper application situation.
Here we take human shape classification as a case study to
compare PCA and FA. However, no detailed explanations from
perspective of mathematics and statistics were listed here.
Readers interested in it can refer to the references.
A. Data Preparation
The anthropometric data of this paper was obtained from a
large scale national anthropometric survey, conducted by China
National Institute of Standardization. The survey, carried out
during 2006 to 2007, mainly focused on immaturity at age from
4 to 17. The amount of samples in this survey was nearly
20,000 and contained children from 11 provinces. Totally
19892 samples, including 9937 males and 9955 females, were
adopted in this study. Twenty-five anthropometric dimensions
were selected to compare the results from PCA and FA. The
definitions of the anthropometric dimensions were based on
ISO 8559.
B. Principal component analysis
PCA is a statistical procedure that transforms a number of
highly correlated variables into a smaller number of principal
components which account for most of the variance of the
observed variables. Usually components with eigenvalue lager
than one would be chosen and at the same time, the proportion
of the variance explained by these components are also taken in
to consideration. After that, cluster analysis was used to divide
the dimensions of the samples into three types.
C. Factor analysis
Factor analysis is a multivariate statistical analysis method
that examines the inter-relationships among a large number of
variables and extracts the underlying factors [3]. The key
978-1-4244-5046-6/10/$26.00c 2010 IEEE 2014
factors of the body size are acquired by using factor analysis.
The interpretability of the solutions can be enhanced by
different rotation techniques. By using factor analysis and
cluster analysis, the samples were divided into three types,
according to the factor scores.
D. Cluster analysis
Cluster analysis was used as an un-supervised data analysis
tool for classification [4]. It’s the procedures that attempt to
find natural partitions of patterns. Clustering has been applied
to population accommodation and sizing system design [5]-[8].
For example, by using k-means clustering on 48
anthropometric body measurements, 414 male subjects drawn
from the CAESARTM database were partitioned into five welldefined
clusters, i.e., small, medium, large, extra-large and
extra-extra-large body sizes [5]. K-means clustering was used
by Moon and Nam [6] to classify the lower body shapes of
elderly women into fewer figure types, and then established a
lower garment sizing system. By using Principal Component
Analysis (PCA) and k-means cluster analysis, Zheng et al. [7]
identified the under-bust girth and the breast depth width ratio
as the two most critical parameters for a bra sizing system for
Chinese women. The two parameters were drawn out of 98
measurements obtained from 3D body scanning and 5
supplementary manual measurements. Chung et al. [8] used
factor analysis to extract critical factors, i.e., girth factor, width
factor and height factor, from anthropometric dimensions. Then,
by using factor scores of the three extracted factors as
independent variables, cluster analysis was performed for the
classification of body shape for the elementary school, junior
high-school and senior high-school students. K-means
clustering was adopted in this study. K-means clustering aims
at the optimization of the criterion function that attempting to
minimize the distance of each sample from the center of the
cluster to which the sample belongs.
However, an underlying disadvantage of k-means clustering
is that the number of the clusters should be specified
beforehand. Efforts have been made to find an automatic
strategy to determine the number of clusters [9]. In fact, it’s
difficult or even impossible to find a common method suitable
for the determination of number of clusters in k-means
clustering. For the specific application for sizing systems
design in anthropometry, it’s argued single correct size number
does not exist [10]. The number of sizes of the article should be
based on the degree of fit required for proper performance of
the item [11]. People should be encouraged to utilize domain
knowledge as much as possible for the determination of the
number of clusters. Furthermore, factors such as cost,
anticipated lifetime for applicability of the sizing system, and
the variety of applications should be taken into account as well
[10]. Schneider et al. [12] selected four as an optimal number
for representing the driving population. The current garment
sizing system in China uses three as the optimal sizing number:
namely small size, medium size, and large size. In order for the
ease of comparison, the number of K for the clustering was set
as three in our case study.
Based on the principal component analysis and factor
analysis above, the procedure to compare the difference
between these two methods is presented as follows.
A. Principal component analysis
PCA was developed to extract principal components among
25 dimensions. According to the Kaiser’s eigenvalue criterion,
two components were selected because their eigenvalues are
greater than 1.0, as shown in Table 1. As the first two
components occupy 88.51% variance and the remaining
components occupy less variance, the first two components
were selected.
Cluster analysis was used to divide samples into groups
with similar characteristics. Here the three clusters are based on
the two PCA scores of each sample. The first cluster contains
7674 samples, the second 7600, and the third 4618.
The relationship among three clusters is shown in Figure 1.
The three lines are nearly parallel without any crosses. The
differences at the majority of the points approximately equal
each other. The most obvious difference happened at points of
height dimensions, such as eye height and shoulder height.
B. Factor analysis
KMO and the Bartlett should be examined at first to
identify whether this issue is fit for factor analysis. It is often
suggested that Kaiser-Meyer-Olkin (KMO) should be 0.5 as a
minimal level, and Bartlett test should be significant.
According to the result, the KMO measure is 0.976, greater
than the minimal level. The Bartlett's test of sphericity, pvalue=
0.000, is significant. That is, its associated probability is
less than 0.05, which means that the correlation matrix is not an
identity matrix. The samples are suitable for FA.
Taking the loading of the factors into consideration, the first
key factor is mainly related to the height dimensions. And the
second key factor indicates the girth dimensions. Here only the
first two key factors were considered, because the total
variance of the original variables can be explained about
88.51% by the first two key factors.
Factor loadings of body dimensions to the first two factors
were shown in Table 2. Though the factor loading to factor 1
of the anterior superior iliac spine height is larger than that of
stature, it is replaced by stature. This can be explained as
follows. First, measuring of anterior superior iliac spine height
is more difficult then measuring the stature. In addition, based
on the results of correlation matrix, the correlation value
between anterior superior iliac spine height and stature is
0.981. That means the stature can represent the anterior
superior iliac spine height with high correlation. It is reasonable
to identify the stature and the waist girth as the two most
influential dimensions to the key factors. Consequently, the
first factor was named as height factor, and the second factor
was named as girth factor.