25-10-2016, 09:42 AM
1460961092-2015numerical.pdf (Size: 741.26 KB / Downloads: 49)
Abstract: Anonymized data publication has received considerable attention from the research community in recent
years. For numerical sensitive attributes, most of the existing privacy-preserving data publishing techniques
concentrate on microdata with multiple categorical sensitive attributes or only one numerical sensitive attribute.
However, many real-world applications can contain multiple numerical sensitive attributes. Directly applying the
existing privacy-preserving techniques for single-numerical-sensitive-attribute and multiple-categorical-sensitiveattributes
often causes unexpected disclosure of private information. These techniques are particularly prone to
the proximity breach, which is a privacy threat specific to numerical sensitive attributes in data publication. In
this paper, we propose a privacy-preserving data publishing method, namely MNSACM, which uses the ideas of
clustering and Multi-Sensitive Bucketization (MSB) to publish microdata with multiple numerical sensitive attributes.
We use an example to show the effectiveness of this method in privacy protection when using multiple numerical sensitive attributes.
Introduction
The collection of digital information by governments,
corporations, and individuals has enabled knowledge
discovery and information-based decision making.
Publishing data for analysis from a table containing
personal records, while maintaining individual privacy,
is becoming a problem of increasing importance
today. The objective is to limit the privacy disclosure risk to an acceptable level while maximizing the
benefit due to publication of the data. The traditional
approach of anonymization is to remove identification
fields such as social security number and name. A
common anonymization approach is generalization,
which replaces quasi-identifier values with values
that are less-specific but semantically consistent. As
a result, more records will have the same set of
quasi-identifier values. In 2002, Sweeney[1] proposed
the k-anonymity model for privacy protection where
the corresponding attributes that leak information are
suppressed or generalized so that, for every record
in the modified table, there are at least k
1 other
records that have exactly the same values for the quasiidentifiers.
There are many successful applications[2–6]
based on k-anonymity. However, while k-anonymity
protects data against identity disclosure, it is insufficient
to prevent attribute disclosure.
To address this limitation of k-anonymity,
Machanavajjhala et al.[7] introduced a new notion of privacy, called l-diversity, which requires that the
distribution of a sensitive attribute in each equivalence
class has at least l “well represented” values. Li et al.[8]
proposed a novel privacy notion called t-closeness,
which requires that the distribution of a sensitive
attribute in any equivalence class is close to the
distribution of the attribute in the overall table (i.e.,
the distance between the two distributions is no more
than a threshold t). This effectively limits the amount
of individual specific information an observer can
learn. In addition, several principles were introduced,
such as .c; k/-safety[9] and ı-presence[10]. In 2006,
Xiao and Tao[11] proposed Anatomy, which is a data
anonymization approach that divides one table into
two for release. One table includes the original quasiidentifier
and a group id, and the other includes the
association between the group id and the sensitive
attribute values.
In this paper, we study privacy protection on
numerical sensitive attributes, such as salary and bonus.
We introduce a privacy-preserving data publishing
method for multiple numerical sensitive attributes—
MNSACM. This anonymization principle uses the
ideas of clustering and Multi-Sensitive Bucketization
(MSB). We use an example to show that this method can
provide effective protection with multiple numerical
sensitive attributes.
2 Preliminaries
2.1 Proximity breach
The motivation of this work is that, prior anonymization
principles are not effective at preventing the “proximity
breach”, which is a privacy threat specific to numerical
sensitive attributes (such as salary). Intuitively, a
proximity breach occurs when an adversary concludes,
with high confidence, that the sensitive value of a victim
individual must fall in a short interval, even though the
adversary has low confidence about the victim’s actual
value.
Consider an example, as shown in Table 1, where
a company intends to publish its payment records.
The publication must ensure that no adversary can
accurately infer the salary of any employee. Age and
Zipcode are Quasi-Identifier (QI) attributes and Table
2 demonstrates a generalized version of Table 1. As
seen from Table 2, an adversary can no longer uniquely
determine Andy’s salary because any tuple of the
first QI-group can belong to Andy. Hence, his salary can be 985, 1030, 1015, or 40 000. However, if an
adversary possesses the QI-values of Andy, he is able
to determine that Andy’s record is definitively in the
first QI-group of Table 2. Without further information,
he assumes that each tuple in the group has an equal
chance of belonging to Andy. Thus, he can confirm
that Andy’s salary is in the interval [985,1030] with a
75% probability. Consequently, the adversary arrives at
a privacy-intruding claim “Andy’s salary is very likely
around 1000.”
2.2 The (", m)-anonymity principle
For numerical sensitive attributes, Zhang et al.[12]
proposed the notion of .k; e/-anonymity in 2007. The
general idea is to partition the records into groups,
such that each group contains at least k different
sensitive values with a range of at least e. However,
.k; e/-anonymity ignores the distribution of sensitive
values within the range. If some sensitive values occur
frequently within a given subrange, the attacker can
still confidently infer the subrange in a group. In 2008,
Li et al.[13] introduced a new anonymization principle,
(", m)-anonymity, which eliminates the proximity
breach in publishing numeric sensitive attributes. This
principle is based on a natural rationale given a QIgroup
G, for every sensitive value x in G, at most 1=m
of the tuples in G can have sensitive values “similar”
to x. However, this paper concentrates on microdata
that contains only a single sensitive attribute. Directly
applying the (", m)-anonymity principle to microdata
with multiple numerical sensitive attributes often causes
unexpected private disclosure of information.
Model and Method
3.1 Related works
3.1.1 Approaches of multi-sensitive MSB
In 2008, Yang et al.[14] discussed the problem of secure
publishing of sensitive data, which contains multiple
attributes. They proposed a multi-dimensional bucket
grouping approach based on the idea of lossy join,
called MSB. In addition, they used suppression ratio
to measure information loss, which is the proportion of
suppression records in the total records. The calculation
method is as follows:
supRatio D
ns
kT k
;
where ns
is the number of suppression records kT k
is the number of total records. Clearly, a smaller
suppression ratio implies a fewer the number of
suppression records, thereby, resulting in smaller
information loss. Ideally, the suppression ratio is 0. In
this paper, we also use suppression ratio to analyze
efficiency of the MNSACM algorithm.