19-07-2013, 02:47 PM
Data Preprocessing
Data Preprocessing[.pdf (Size: 247.23 KB / Downloads: 61)
introduction
Today's real-world databases are highly susceptible to noise, missing, and inconsistent data due to their typically
huge size, often several gigabytes or more. How can the data be preprocessed in order to help improve the quality of
the data, and consequently, of the mining results? How can the data be preprocessed so as to improve the e ciency
and ease of the mining process?
There are a number of data preprocessing techniques. Data cleaning can be applied to remove noise and correct
inconsistencies in the data. Data integration merges data from multiple sources into a coherent data store, such
as a data warehouse or a data cube. Data transformations, such as normalization, may be applied. For example,
normalization may improve the accuracy and e ciency of mining algorithms involving distance measurements. Data
reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. These
data processing techniques, when applied prior to mining, can substantially improve the overall data mining results.
In this chapter, you will learn methods for data preprocessing. These methods are organized into the following
categories: data cleaning, data integration and transformation, and data reduction. The use of concept hierarchies
for data discretization, an alternative form of data reduction, is also discussed.
Why preprocess the data?
Imagine that you are a manager at AllElectronics and have been charged with analyzing the company's data with
respect to the sales at your branch. You immediately set out to perform this task. You carefully study inspect
the company's database or data warehouse, identifying and selecting the attributes or dimensions to be included
in your analysis, such as item, price, and units sold. Alas! You note that several of the attributes for various
tuples have no recorded value. For your analysis, you would like to include information as to whether each item
purchased was advertised as on sale, yet you discover that this information has not been recorded. Furthermore,
users of your database system have reported errors, unusual values, and inconsistencies in the data recorded for some
transactions. In other words, the data you wish to analyze by data mining techniques are incomplete lacking
attribute values or certain attributes of interest, or containing only aggregate data, noisy containing errors, or
outlier values which deviate from the expected, and inconsistent e.g., containing discrepancies in the department
codes used to categorize items.
Inconsistent data
There may be inconsistencies in the data recorded for some transactions. Some data inconsistencies may be corrected
manually using external references. For example, errors made at data entry may be corrected by performing a
paper trace. This may be coupled with routines designed to help correct the inconsistent use of codes. Knowledge
engineering tools may also be used to detect the violation of known data constraints. For example, known functional
dependencies between attributes can be used to nd values contradicting the functional constraints.
There may also be inconsistencies due to data integration, where a given attribute can have di erent names in
di erent databases.
Data cube aggregation
Imagine that you have collected the data for your analysis. These data consist of the AllElectronics sales per quarter,
for the years 1997 to 1999. You are, however, interested in the annual sales total per year, rather than the total
per quarter. Thus the data can be aggregated so that the resulting data summarize the total sales per year instead of
per quarter. This aggregation is illustrated in Figure 3.4. The resulting data set is smaller in volume, without loss
of information necessary for the analysis task.
Data cubes were discussed in Chapter 2. For completeness, we briefely review some of that material here. Data
cubes store multidimensionalaggregated information. For example, Figure 3.5 shows a data cube for multidimensional
analysis of sales data with respect to annual sales per item type for each AllElectronics branch. Each cells holds
an aggregate data value, corresponding to the data point in multidimensional space. Concept hierarchies may exist
for each attribute, allowing the analysis of data at multiple levels of abstraction. For example, a hierarchy for
branch could allow branches to be grouped into regions, based on their address. Data cubes provide fast access to
precomputed, summarized data, thereby bene ting on-line analytical processing as well as data mining.
The cube created at the lowest level of abstraction is referred to as the base cuboid. A cube for the highest level
of abstraction is the apex cuboid.
Dimensionality reduction
Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task, or
redundant. For example, if the task is to classify customers as to whether or not they are likely to purchase a popular
new CD at AllElectronics when noti ed of a sale, attributes such as the customer's telephone number are likely to be
irrelevant, unlike attributes such as age or music taste. Although it may be possible for a domain expert to pick out
some of the useful attributes, this can be a di cult and time-consuming task, especially when the behavior of the
data is not well-known hence, a reason behind its analysis!. Leaving out relevant attributes, or keeping irrelevant
attributes may be detrimental, causing confusion for the mining algorithm employed. This can result in discovered
patterns of poor quality. In addition, the added volume of irrelevant or redundant attributes can slow down the
mining process.