15-09-2016, 10:49 AM
1454578245-Clusteringoftimeseries.docx (Size: 90.61 KB / Downloads: 5)
Abstract
Time series clustering has shown effective in providing useful information in various domains. There seems to be an increase in use of time series clustering day by day for temporal data mining research. To provide an overview, this paper surveys and summarizes previous works that investigated the clustering of time series data in various application domains. The basics of time series clustering are presented, the measures to determine the distance measure between two time series being compared, either in the forms of raw data, extracted features, or some model parameters. This paper surveys the Hidden Markov model in the model based approach for clustering of time series data. The past researchs are organized into three groups depending upon whether they work directly with the raw data either in the time or frequency domain, indirectly with features extracted from the raw data, or indirectly with models built from the raw data. The uniqueness and limitation of previous research are discussed and several possible topics for future research are identified. Also the applications of time series clustering in various scientific fields are summarized, including the sources of data used. It is hoped that this review will serve as the stepping stone for those interested in advancing this area of research.
1. Introduction
A time series is defined as a set of statistics which is collected at regular time intervals. Time series data occur in wide range in many application areas of life.
• economics - e.g., monthly,yearly data for unemployment etc.
• finance - e.g., daily exchange rates of a currency, a share price etc.
• environmental - e.g., daily rainfall,daily temperature, air quality readings.
• medicine - e.g., ECG brain wave activity for every 2−10 seconds.
In time series data is collected over time for a single or a group of variables. Time series can be applied to mostly all activities. Time series are used by humans for communication, description and visualization.Parameters and other characteristics of mathematical models for time series is used for real-world interpretations as time is considered to be a physical concept. This helps in the analysis and synthesis of time series.
In scientific investigations, time series forms the basis. There are medical behaviours, seasonal behaviors, trends, changes to be studied and understood. Basic questions of scientific concern are formulated in terms of time series concepts - Time series concepts are used to answer questions arousing from scientific concerns such what is Predicted value? Leading? Lagging? How is Signal? What is seasonal effect?What are new phenomenon? What is trending? How are the cycles? Because of the tremendous variety of possibilities, many simplications are needed in many time series analyses.
Tasks:
Analysis of time series data has four major tasks:
1.Clustering
2. Indexing
3.Classification
4.Segmentation.
In clustering, groups of time series having similar patterns are found out.In indexing similar time series in order are found, given a query series. Classification classifies each time series to a known category by using a particular trained model. Segmentation makes partitions of time series. Time series data is considered to be multidimensional data. This means that there is one observation is made per time unit and each time unit makes one dimension.
Clustering:
Clustering aims to identify structure in an unlabeled data set by objectively organizing data into homogeneous groups where the between-group-object similarity is minimized and the between-group-object dissimilarity is maximized. Clustering is essential when there is no labeled data available regardless the data is binary, categorical, numerical, interval, spatial, temporal, spatio-temporal, ordinal, relational, textual, , image, multimedia, or mixtures of the above data types. If all the feature values of data do not change with time, or change negligibly then data is called static .Majority of clustering analyses have been performed on static data. Most clustering programs developed as an independent program or as part of a large suite of data analysis or data mining software to date work only with static data. Han and Kamber [1] have classified clustering methods developed for handling various static data into five categories: partitioning methods, hierarchical methods, density based methods, grid-based methods, and model-based methods. Description of each category of methods follows.
Given a set of n unlabeled data sets or tuples, a partitioning method constructs k partitions of the data .Each partition of the data represents a cluster which contains at least one object and K<=n.
If each object belongs to exactly one cluster then the partition is crisp and if one object is allowed to be in more than one cluster to a different degree then it is said to be fuzzy. Two well knowned heuristic methods for crisp partitions are the k-means algorithm [2] and the k-medoids algorithm [3].In k- means algorithm each cluster is represented by the mean value of the objects in the cluster. In the k-medoids algorithm [3] each cluster is represented by the most centrally located object in a cluster. Two methods for fuzzy partitions are the fuzzy c-means algorithm [4] and the fuzzy c-medoids algorithm [5]. These algorithms work well for finding spherical-shaped clusters.Also they work efficiently for small to medium data sets. Specially designed algorithms such as the adaptive fuzzy clustering algorithms[6] and the Gustafson–Kessel algorithm are required to find clusters with non-spherical or other complex shapes. Most clustering methods use the partitioning methods, especially the k-means algorithm [7,8], the k-medoids algorithm[9], and the fuzzy cmeans algorithm [10].
A hierarchical clustering method is implemented by grouping data into a tree of clusters. There are major two types of hierarchical clustering methods viz agglomerative and divisive.
Agglomerative methods is initiated by placing each data object in its own cluster . After this merge clusters into bigger and bigger clusters, until all objects are in a single big cluster or until certain termination criteria or a condition such as the desired number of clusters are satisfied. Divisive methods work exactly the opposite. A pure hierarchical clustering method has inability to perform adjustment once a merge or split decision is executed. Other clustering techniques are integrated with hierarchical clustering for improving the clustering quality of hierarchical methods which is trending. Careful analysis of object “linkages” at each hierarchical partitioning were performed by Chameleon [11] and CURE [12] whereas BIRCH [13] uses iterative relocation to refine the results obtained by hierarchical agglomeration.
The idea behind density-based methods such as DBSCAN [14] is to continue growing a cluster till the density (number of objects or data points) in the “neighborhood” exceeds some threshold value. Rather than producing a clustering explicitly, OPTICS [15] computes an augmented cluster ordering for automatic and interactive cluster analysis. The ordering contains information that is equivalent to density-based clustering obtained from a wide range of parameter settings, thus overcoming the difficulty of selecting parameter values.
Grid-based methods quantize the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. An example
of the grid-based approach is STING [16], which uses several levels of rectangular cells corresponding to different levels of resolution. Statistical information of the attributes
in each cell are pre-calculated and stored. A query process usually starts at a higher level of the hierarchical structure. For each cell in the current layer, the confidence
interval is calculated reflecting the cell’s similarity to the given query. Irrelevant cells are removed. The query process continues to the next lower level for the similar cells until the bottom layer is reached.
For each of the clusters a model is assumed from model based methods and attempts to properly fit the data to the assumed model. There are two approaches of model-based methods - statistical approach and neural network approach. An example of statistical approach is AutoClass [17]. Autoclass [17] uses Bayesian statistical analysis for estimating the number of clusters. Two methods of the neural network approach to clustering are competitive learning, including ART [18] and self-organizing feature maps [19].
2. Basics of time series clustering
Time series clustering requires clustering algorithm or procedure to form clusters for a given a set of unlabeled data objects. The choice of clustering algorithm depends both on the type of data available and on the particular purpose and application. For time series data , the data distinctions can be made as to whether data is discrete-valued or real-valued, uniformly or non-uniformly sampled, univariate or multivariate, and whether data series are of equal or unequal length. Before clustering operations are performed ,non-uniformly sampled data must be converted into uniformed data. This is achieved by a wide range of methods, from simple down sampling based on the roughest sampling interval to a sophisticated modeling and estimation approach.
To cluster different types of time series data ,various algorithms have been developed. All of these algorithms try to modify the existing algorithms for clustering static data to handle time series data or to convert time series data into the form of static data so that the existing algorithms for clustering static data can be directly used. The former approach works directly with raw time series data ,hence it is called raw-data-based approach. The major modification is made in replacing the distance measure for static data with an appropriate one for time series. In the latter approach raw time series data is either converted into a feature vector of lower dimension or a number of model parameters, and then applies an appropriate clustering algorithm to the extracted feature vectors or model parameters. Hence this approach is called feature- and model-based approach, respectively. Fig 1 describes the three different approaches: raw-data-based, feature-based, and model parameters for clustering without the need for another clustering algorithm.model-based. Note that the left branch of model-based approach trained the model and used the model parameter for clustering without need of another clustering algorithm.