03-09-2014, 03:17 PM
High-dimensional datasets
High-dimensional.docx (Size: 834.79 KB / Downloads: 8)
CHAPTER 1: INTRODUCTION
High-dimensional datasets are becoming increasingly common in many application fields. Spectral imaging studies in biology and astronomy, omics data analysis in bioinformatics, or cohort studies of large groups of patients are some examples where analysts have to deal with datasets with a large number of dimensions. It is not even uncommon that such datasets have more dimensions than data items, which generally makes the application of standard methods from statistics substantially difficult (i.e., the “p >> n problem”). Most of the available analysis approaches are tailored(simple) for multidimensional datasets that consist of multiple, but not really a large number of dimensions and they easily fail to provide reliable and interpretable results when the dimension count is in the thousands or even hundreds.
Multi-Dimensions are difficult-to-relate scales of measure, such as categorical, discrete and continuous. Some can be replicates of other dimensions or encode exactly the same information acquired using a different method. There can be explicit relations in-between the dimensions that are known a priori by the expert. Some of these relations are likely to be represented as meta-data already. Very importantly also, there are usually inherent structures between the dimensions that could be discovered with the help of computational and visual analysis, e.g., correlation relations or common distributions types. Standard methods from data mining or statistics do not consider any known heterogeneity within the space of dimensions – while this might be appropriate for certain cases, where the data dimensions actually are homogeneous, it is obvious that not considering an actually present heterogeneity must lead to analysis results of limited quality.
A natural approach to understanding high-dimensional datasets is to use multivariate statistical analysis methods. For instance Principal Component Analysis (PCA), i.e., a method that is a widely used for dimension reduction.
At this point, the exploitation of any known structure between the dimensions can help the analyst to make a more reliable and interpretable analysis. With an interactive visual exploration and analysis of these structures, the analyst can make informed selections of subgroups of dimensions. These groups provide sub-domains where the computational analysis can be done locally. The outcomes of such local analyses can then be merged and provide a better overall under-standing of the high-dimensional dataset. Such an approach is very much in line with the goal of visual analytics, where the analyst makes decisions with the support of interactive visual analysis methods.
In this paper, we present an approach that enables a structure-aware analysis of high-dimensional datasets. We introduce the interactive visual identification of representative factors as a method to consider these structures for the interactive visual analysis of high-dimensional datasets. Our method is based on generating a manageable number of representative factors, or just factors, where each represents a sub-group of dimensions. These factors are then analyzed iteratively and together with the original dimensions. At each iteration, factors are refined or generated to provide a better representation of the relations between the dimensions.
CONTRIBUTION
We borrow ideas from factor analysis in statistics and feature selection in machine learning. Factor analysis aims at determining factors, representing groups of dimensions that are highly interrelated (correlated) .Factor analysis operates solely on the correlation relation between the dimensions and does not allow the analyst to incorporate a priori information on the structure. A second inspiration for our approach are the feature subset selection techniques, where variables (dimensions) are ordered and grouped according to their relevance and usefulness to the analysis .
In order to visually analyze dimensions through the generation of factors, we make use of visualizations where the dimensions are the main visual entities. We analyze the generated factors together with the original dimensions and make them a seamless part of the analysis. Due to the iterative nature of our analysis pipeline, a number of factors can be generated and refined as results of individual iterations.
We present techniques to compare and evaluate these factors in the course of the analysis. Our factor generation mechanism can be both considered as a method to represent the aggregated information from groups of dimensions and a method to apply computational analysis more locally, i.e., to groups of dimensions.
Altogether, we present the following contributions in this paper:
• Methods to create representative factors for different types of dimension groups
• A visual analysis methodology that jointly considers the representative factors and the original dimensions
• Methods to assess and compare factors
LITERATURE SURVEY
3.1 Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.
3.1.1 Type of factor analysis
1. Exploratory factor analysis (EFA) is used to identify complex interrelationships among items and group items that are part of unified concepts. The researcher makes no "a priori" assumptions about relationships among factors.
2. Confirmatory factor analysis (CFA) is a more complex approach that tests the hypothesis that the items are associated with specific factors. CFA uses structural equation modeling to test a measurement model whereby loading on the factors allows for evaluation of relationships between observed variables and unobserved variables. Structural equation modeling approaches can accommodate measurement error, and are less restrictive than least-squares estimation. Hypothesized models are tested against actual data, and the analysis would demonstrate loadings of observed variables on the latent variables (factors), as well as the correlation between the latent variables.
3.1.2 Types of factoring
1. Principal component analysis (PCA): PCA is a widely used method for factor extraction, which is the first phase of EFA.[2] Factor weights are computed in order to extract the maximum possible variance, with successive factoring continuing until there is no further meaningful variance left. The factor model must then be rotated for analysis.
2. Canonical factor analysis, also called Rao's canonical factoring, is a different method of computing the same model as PCA, which uses the principal axis method. Canonical factor analysis seeks factors which have the highest canonical correlation with the observed variables. Canonical factor analysis is unaffected by arbitrary rescaling of the data.
3. Common factor analysis, also called principal factor analysis (PFA) or principal axis factoring (PAF), seeks the least number of factors which can account for the common variance (correlation) of a set of variables.
4. Image factoring: based on the correlation matrix of predicted variables rather than actual variables, where each variable is predicted from the others using multiple regression.
5. Alpha factoring: based on maximizing the reliability of factors, assuming variables are randomly sampled from a universe of variables. All other methods assume cases to be sampled and variables fixed.
6. Factor regression model: a combinatorial model of factor model and regression model; or alternatively, it can be viewed as the hybrid factor model,[3] whose factors are partially known.
3.2 Principal Components Analysis (PCA): is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
3.3 Relation between PCA and Factor Analysis
Principle components creates variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principle components can be used to find clusters in a set of data. PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. PCA is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors.
Factor analysis is similar to principle component analysis, in that factor analysis also involves linear combinations of variables. Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance" . Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.
3.4 Statistical Model:
Definition: Suppose we have a set of observable random variables, with means e Suppose for some unknown constants and unobserved random variables , where and , where , we have
Here, the are independently distributed error terms with zero mean and finite variance, which may not be the same for all . Let , so that we have
3.4.1 Basic Descriptive Statistics:
• Skewness is a measure of the extent to which a probability distribution of a real-valued random variable "leans" to one side of the mean. The skewness value can be positive or negative, or even undefined.
MATHEMATICAL MODEL
4.1 COMPUTATIONAL AND STATISTICAL TOOLBOX
In order to make the distinction easier, the visualizations with a blue background are visualizations of data items and those with a yellow background are visualizations of the dimensions. For the construction of the factors, we determine a selection of computational tools and statistics that can help us to analyze the structure of the dimensions space.
As one building block, we use a selection of statistics to populate several columns of the S table. In order to summarize the distributions of the dimensions, we estimate several basic descriptive statistics. For each dimension d, we estimate the mean (μ), standard deviation (σ), skewness (skew) as a measure of symmetry, kurtosis (kurt)torepresent peakedness, and the quartiles (Q1−4) that divide the ordered values into four equally sized buckets. We also include the robust estimates of the center and the spread of the data, namely the median (med)and the inter-quartile range (IQR). Additionally, we compute the count of unique values (uniq) and the percentage of univariate outliers (%out) in a dimension. uniq values are usually higher for continuous dimensions and lower for categorical dimensions. We use a method based on robust statistics to determine %out values. One common measure to study the relation between dimensions is the correlation between them. Correlation values are in the range [-1, +1] where -1 indicates a perfect negative and +1 a perfect positive correlation.
4.2 FACTOR CONSTRUCTION
The machine learning and data mining literature provides us with valuable methods and concepts of feature (generally called an attribute in data mining) feature selection and feature extraction. Feature selection is the process of selecting a subset of relevant features for use in model construction feature subset selection methods try to find dimensions that are more relevant and useful by evaluating them with respect to certain measures .Feature extraction when the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation set of feature and it is usually map the data to a lower dimensional space.
There are three different methods to construct representative factors using a combination of feature extraction and selection techniques. Each factor construction method is a mapping from a subset of dimensions D1 to a representative factor DR. The mapping can be denoted as f : D1→DR,where D1∈ 2D. The t dimensions that are represented by DR are denoted as dR0 ,...,dRt.
Each factor creation is followed by a step where we compute a number of statistics for DR and add these values to the S table. In other words, we extend the D table with a DR column and the S table with a row associated with DR.
CHAPTER5: PROBLEM DEFINITION: OBJECTIVE, ASSUMPTION
Datasets with a large number of dimensions per data item (hundreds or more) are challenging both for computational and visual analysis. In this paper we extract knowledge from large and complex dataset by visual analysis which is very complicate. Moreover, these dimensions have different characteristics and relations that result in sub-groups and/or hierarchies over the set of dimensions. Such structures lead to heterogeneity within the dimensions.
5.1 REPRESENTATIVE FACTORS
In order to achieve a structure-aware analysis of the data, we represent the underlying structures with representative factors, or factors, We then analyze and evaluate these factors together with the original data to achieve a more informed use of the computational analysis tools.
5.2 Construction of Dataset Table:
Our method operates (in addition to the original dataset) on a data table dedicated specifically to the dimensions. We construct this dimensions-related data table by combining a set of derived statistics with available meta-data on the dimensions. In order to achieve this we assign a feature vector to each dimension, where each value is a computed statistic/property or some meta-data about this dimension we consider the original dataset to consist of n items (rows) and p dimensions (columns), the derived data table has a size of p×k,i.e each dimension has k values associated to it. The set of dimensions is denoted as D and the new dimensions properties table as S.
Through a visual analysis of S, we determine structures within the dimensions that then result in a number of sub-groups. We represent these sub-groups of dimensions with representative factors and assign feature vectors to these factors by computing certain features, e.g.,statistics. Factors serve both as data aggregation and as a method to apply computational tools locally and represent their results in a common frame together with the original dimensions. i.e
6 EVALUTION OF FACTORS
6.1 Evaluation of the representatives: The evaluation and a more quantitative comparison of the factors is an essential part of a representative factor based analysis pipeline as presented here.
The first method is related to the correlation based coloring of the factors and the represented dimensions. As an inherent part of the factor generation, we compute the Pearson correlation between DR and the dimensions that it represents dRi. The result is a set of t values corrR, where each value is in the range [-1, +1].We color-code these pieces of correlation information in the views using two different color maps (Figure 3-e). Firstly, we represent the aggregated correlation values as shades of green. For each DR, we find the average of the absolute values of corrR. More saturated green represent higher levels of correlation (either positive or negative) and paler green represent lower levels. Secondly, we encode the individual values of corrR when a factor is expanded. Each represented dimension dRi is colored according to the correlation with DR. Here, we use a second color map where negative correlations are depicted with blue and positive correlation with red.
The second mechanism to evaluate the factors is called profile plots.When the set of statistics associated with dimensions is considered, factors do not represent all the properties equally.
ANALYTICAL PROCESS
The structure-aware analysis of the dimensions space through the use of these factors involves a number of steps. In the following, we go through the steps and exemplify them in the analysis of the ECG data.
Step 1: Handling missing data –Missing data are often marked prior to the analysis and available as meta-data. It is important to handle missing data properly. We employ a simple approach here and replace the missing values with the mean value of continuous dimensions prior to the normalization step. Similarly, in the case of categorical data, we replace the missing values with the mode of the dimension, i.e., the most frequent value in the dimension. Moreover, we store the number of missing values per each dimension in S for further reference.
Step 2: Informed normalization – Normalization is an essential step in data analysis to make the dimensions comparable and suitable for computational analysis. Different data scales require different types of normalization (e.g., for categorical variables scaling to the unit interval can be suitable, but not z-standardization) and different analysis tools require different normalizations, e.g., z-standardization is preferred prior to PCA. We enable three different normalization options,namely, scaling to the unit interval [0,1], z-standardization, and, robust z-standardization. In the robust version, we use med as the robust estimate of the distribution’s center and IQR for its spread. In order to determine which normalization is suitable for the dimensions, we compute certain statistics, namely uniq, pValshp and %out ,prior to normalization. We visualize uniq vs. %out (Figure 6-a) to determine the groups of dimensions that are suitable for different types of normalizations. Dimensions with low uniq values (marked with 1 in figure) are usually categorical and scaling to the unit interval is suitable.
Dimensions with higher uniq values (marked 2) are more suitable for z-standardization. And, for those dimensions that contain larger percentage of one dimensional outliers (marked 3), a robust normalization is preferable. We normalize the same sub-group of dimensions using all the three methods and apply PCA separately on the three differently normalized groups. Figure 6-b shows the first two principal components factors. We observe that non-robust and robust normalizations resulted in similar outputs, however the unit scaling resulted in PCs that carry lower variance.
Step 3: Factor generation – In this step, we analyze the structures in the dimensions space firstly through the help of meta-data information. We choose to represent each channel only by the first principal component. Each channel in the ECG data has 22 dimensions associated, however, we select a sub-group of these features (the continuous features (dimensions) that have larger uniq values) and then construct projection factors for each channel. The resulting groups are now displayed on a uniq vs. %out plot (Figure 7).
Step 4: Evaluating and refining factors iteratively – In figure 7-1 we notice that the factor that is representing the V2 channel (denoted as DV2 R ), has a higher percentage of 1D outliers. This is interpreted as a sign of an irregular distribution of items in this factor and we decide to analyze this factor further. First, we have a look at the items in a scatterplot of the first two components of DV2 R and we clearly see that there are two separate groups (figure 7-2)
ANALYSIS OF HEALTHY BRAIN AGING STUDY DATA
In this use case we analyze the data related to a longitudinal study of cognitive aging. The participants in the study were healthy individuals, recruited through advertisements in local newspapers. Individuals with known neurological diseases were excluded before the study. All participants took part in a neuropsychological examination and a multimodal imaging procedure, with about 7 years between the first and third wave of the study. One purpose of the study was to investigate the association between specific, image-derived features and cognitive functions in healthy aging. The neuropsychological examination covered tests of motor function, attention/executive function, visual cognition, memory- and verbal function. The participants results on these tests are evaluated by a group of neuropsychologists.
The dataset covers 83 healthy individuals with the measurements from the first wave of the study in 2005. For each subject, weighted image was segmented into 45 anatomical regions, and 7 different measures were extracted for each region. The 7 features associated with each brain region are number of voxels, volume and mean, standard deviation, minimum, maximum and range of the intensity values in the region. This information on the brain regions and the features is represented in the meta-data file, which is then used in the analysis. The above operation creates 45×7 = 315 dimensions per subject. In addition, details about each individual, such as age and gender, and the results of the neuropsychological examination are added to this dataset. With this addition, the resulting dataset has 357 dimensions. In other words, the resulting table’s size is 83×357 a great challenge for visual as well as computational analysis. Such a high dimensionality usually requires analysts to delimit the analysis to a selected subset of segments, based on an a priori specified hypothesis. Our aim here is to discover different subsets of individuals and brain regions that are relevant for building new hypotheses.
There are 4 tests were common to studies of aging of brain.
• A vocabulary test involved the examinee selecting the best synonyms of target words, in each case from a set of five alternatives.
• A speed test required the participant to classify pairs of line patterns as the same or different as rapidly as possible.
• Reasoning was assessed with the Raven’s Progressive Matrices, in which each test item consists of a matrix of geometric patterns with one missing cell, and the task for the participant is to select the best completion of the missing cell from a set of alternatives.
• Finally, a memory test involved three auditory presentations of the same list of unrelated words, with the participant instructed to recall as many words as possible after each presentation.
CONCLUSIONS
With our method, we present how the structures in high-dimensional datasets can be incorporated into the visual analysis process. We introduce representative factors as a method to apply computational tools locally and as an aggregated representation for sub-groups of dimensions. A combination of the already available information and the derived features on the dimensions are utilized to discover the structures in the dimensions space.
We suggest three different approaches to generate representatives for groups with different characteristics. These factors are then compared and evaluated through different interactive visual representations. We mainly use dimension reduction methods locally to extract the information from the sub-structures. Our goal is not to solely assist dimension reduction but rather to enable an informed use of dimension reduction methods at different levels to achieve a better understanding of the data.
The usual work flow when dealing with such complex datasets is to elimit the analysis based on known hypotheses and try to confirm or reject these using computational and visual analysis. Our interactive visual analysis scheme proved to be helpful to explore new relations between the dimensions that can provide a basis for the generation of new hypotheses.