Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: A robust missing value imputation method for noisy data
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
A robust missing value imputation method for noisy data

[attachment=26699]
Abstract

Missing data imputation is an important research
topic in data mining. The impact of noise is seldom considered
in previous works while real-world data often contain
much noise. In this paper, we systematically investigate
the impact of noise on imputation methods and propose
a new imputation approach by introducing the mechanism
of Group Method of Data Handling (GMDH) to deal
with incomplete data with noise.

Introduction

Data in business are often corrupted by missing values,
especially the data collected from surveys. For example,
consumer data obtained from questionnaires usually contain
missing values because the consumers refuse to answer
some sensitive questions (e.g., income level, age) or
they simply have no opinions about them and so on. Industrial
databases are another data source which contains
a lot of missing data. The databases maintained by Honeywell
company, for instance, have more than 50% of its items
(or values) missing, despite great efforts taken in data collection
[23]. Such nonresponses complicate the data mining
process because most data mining algorithms cannot be immediately
and straightforwardly applied to incomplete data.
The simplest method to deal with missing data is data reduction
which deletes the instances with missing values.

Related work

Methods to deal with missing values are not something new.
In 1976, Rubin developed a framework of inference from
incomplete data that is still in use today [38]. After that
many researchers have run into this area and proposed a
great number of methods. All the imputation methods can
be roughly classified into the following six categories:
• Mean substitution: It is the simplest imputation method.
It replaces the missing values by the mean of all the observed
values or a subgroup at the same variable. It is fast,
simple and easily implemented.
• Hot-deck imputation [14]: For Hot-deck imputation,
missing values are recovered from similar cases drawn
from the same dataset. It is often used to handle missing
data of survey.
• Regression imputation [7]: Regression imputation uses regression
models to predict missing values.Many forms of
regression models can be used for regression imputation
such as linear regression, logistic regression and semiparametric
regression [36].
• EM imputation: The EM imputation is based on the
Expectation-Maximization (EM) algorithm proposed by
Dempster, Laird and Rubin [10]. It uses the iterative procedure
of the EM algorithm to calculate the sufficient statistics
and estimate the parameters. The missing values
will be produced in the process.

Algorithm of RIBG

The main idea of RIBG is using the mechanism GMDH to
impute missing data in the hope it will give more accurate
imputation results than traditional imputation approaches
even when data contain noise. Let us consider an incomplete
dataset D with r variables D = {A1,A2, . . . , Ar }. RIBG
will fill in the original incomplete dataset D by simple
mean imputation to get an initial complete dataset. We use
mean imputation to initially impute the missing values because
it has been proved to be an efficient pre-imputation
method [12]. Then we use the mechanism of GMDH to predict
and update these initial estimated missing values with
an iterative process.

Conclusions

In this paper, we systematically studied the impact of noise
on missing value imputation methods when noise and missing
values distributed throughout the dataset. By observing
the behavior of the different imputation methods at different
noise levels, we drew the conclusion that noise has
great negative effects on imputation methods, especially
when the noise level is high. Meanwhile, we designed a
robust method RIBG based on GMDH to impute missing
values in noisy environment. Comparative studies have
shown that RIBG performs quite well in comparison with
other four popular imputation methods in the presence of
noise.