07-09-2016, 04:24 PM
1453650262-IntroductonCopy.doc (Size: 187 KB / Downloads: 5)
INTRODUCTION
Data mining has attracted more and more attention in recent years, probably because of the popularity of the ``big data' 'concept. Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. As a highly application-driven discipline, data mining has been successfully applied to many domains, such as business intelligence, Web search, scientific discovery, digital libraries, etc.
2.1. THE PROCESS OF KDD
The term ``data mining'' is often treated as a synonym for another term ``knowledge discovery from data'' (KDD) which highlights the goal of the mining process. To obtain useful knowledge from data, the following steps are performed in an iterative way (see Fig. 1):
Step 1: Data pre processing. Basic operations include data selection (to retrieve data relevant to the KDD task from the database), data cleaning (to remove noise and inconsistent data, to handle the missing data fields, etc.)and data integration (to combine data from multiple sources).
Step 2: Data transformation. The goal is to transform data into forms appropriate for the mining task, that is, to end useful features to represent the data. Feature selection and feature transformation are basic operations.
Step 3: Data mining. This is an essential process where intelligent methods are employed to extract data patterns (e.g. association rules, clusters, classification rules, etc).
Step 4: Pattern evaluation and presentation. Basic operations include identifying the truly interesting patterns which represent knowledge, and presenting the mined knowledge in an easy-to-understand fashion.
2.2. THE PRIVACY CONCERN AND PPDM
Despite that the information discovered by data mining can be very valuable to many applications, people have shown increasing concern about the other side of the coin, namely the privacy threats posed by data mining [2]. Individual's privacy may be violated due to the unauthorized access to personal data, the undesired discovery of one's embarrassing information, the use of personal data for purposes other than the one for which data has been collected, etc. For instance, the U.S. retailer Target once received complaints from a customer who was angry that Target sent coupons for baby clothes to his teenager daughter. However, it was true that the daughter was pregnant at that time, and Target correctly inferred the fact by mining its customer data. From this story, we can see that the convict between data mining and privacy security does exist. To deal with the privacy issues in data mining, a sub-field of data mining, referred to as privacy preserving data mining (PPDM) has gained a great development in recent years. The objective of PPDM is to safeguard sensitive information from unsolicited or unsanctioned disclosure, and meanwhile, preserve the utility of the data. The consideration of PPDM is two-fold. First, sensitive raw data, such as individual's ID card number and cell phone number, should not be directly used for mining. Second, sensitive mining results whose disclosure will result in privacy violation should be excluded. After the pioneering work of numerous studies on PPDM have been conducted.
2.3. USER ROLE-BASED METHODOLOGY
Current models and algorithms proposed for PPDM mainly focus on how to hide those sensitive information from certain mining operations. However, as depicted in Fig. 1, the whole KDD process involve multi-phase operations. Besides the mining phase, privacy issues may also arise in the phase of data collecting or data pre processing, even in the delivery process of the mining results. In this paper, we investigate the privacy aspects of data mining by considering the whole knowledge-discovery process. We present an overview of the many approaches which can help to make proper use of sensitive data and protect the security of sensitive information discovered by data mining. We use the term ``sensitive information'' to refer to privileged or proprietary information that only certain people are allowed to see and that is therefore not accessible to everyone. If sensitive information is lost or used in any way other than intended, the result can be severe damage to the person or organization to which that information belongs. The term ``sensitive data'' refers to data from which sensitive information can be extracted. Throughout the paper, we consider the two terms ``privacy'' and ``sensitive information'' are interchangeable. In this paper, we develop a user-role based methodology to conduct the review of related studies. Based on the stage division in KDD process (see Fig. 1), we can identify four different types of users, namely four user roles, in a typical data mining scenario
• Data Provider: the user who owns some data that are desired by the data mining task.
• Data Collector: the user who collects data from data providers and then publish the data to the data miner.
• Data Miner: the user who performs data mining tasks on the data.
• Decision Maker: the user who makes decisions based on the data mining results in order to achieve certain goals.
In the data mining scenario depicted in Fig. 2, a user represents either a person or an organization. Also, one user can play multiple roles at once. For example, in the Target story we mentioned above, the customer plays the role of data provider, and the retailer plays the roles of data collector, data miner and decision maker. By differentiating the four different user roles, we can explore the privacy issues in data mining in a principled way. All users care about the security of sensitive information, but each user role views the security issue from its own perspective. What we need to do is to identify the privacy problems that each user role is concerned about, and to end appropriate solutions the problems. Here we briefly describe the privacy concerns of each user role. Detailed discussions will be presented in following sections.
1) DATA PROVIDER
The major concern of a data provider is whether he can control the sensitivity of the data he provides to others. On one hand, the provider should be able to make his very private data, namely the data containing information that he does not want anyone else to know, inaccessible to the data collector. On the other hand, if the provider has to provide some data to the data collector, he wants to hide his sensitive information as much as possible and get enough compensations for the possible loss in privacy.
2)DATA COLLECTOR
The data collected from data providers may contain individuals' sensitive information. Directly releasing the data to the data miner will violate data providers' privacy, hence data modification is required. On the other hand, the data should still be useful after modification, otherwise collecting the data will be meaningless. Therefore, the major concern of data collector is to guarantee that the modified data contain no sensitive information but still preserve high utility.
3) DATA MINER
The data miner applies mining algorithms to the data provided by data collector, and he wishes to extract useful information from data in a privacy-preserving manner. As introduced in Section I-B, PPDM covers two types of protections, namely the protection of the sensitive data themselves and the protection of sensitive mining results. With the user role-based methodology proposed in this paper, we consider the data collector should take the major responsibility of protecting sensitive data, while data miner can focus on how to hide the sensitive mining results from untrusted.
4) DECISION MAKER
As shown in Fig. 2, a decision maker can get the data mining results directly from the data miner, or from some Information Transmitter. It is likely that the information transmitter changes the mining results intentionally or unintentionally, which may cause serious loss to the decision maker. Therefore, what the decision maker concerns is whether the mining results are credible. In addition to investigate the privacy-protection approaches adopted by each user role, in this paper we emphasize a common type of approach, namely game theoretical approach, that can be applied to many problems involving privacy protection in data mining. The rationality is that, in the data mining scenario, each user pursues high self-interests in terms of privacy preservation or data utility, and the interests of different users are correlated. Hence the interactions among different users can be modelled as a game. By using methodologies from game theory, we can get useful implications on how each user role should behaviour in an attempt to solve his privacy problems.
DATA PROVIDER
3.1. CONCERNS OF DATA PROVIDER
A data provider owns some data from which valuable information can be extracted. In the data mining scenario depicted in Fig. 2, there are actually two types of data providers: one refers to the data provider who provides data to data collector, and the other refers to the data collector who provides data to data miner. To differentiate the privacy protecting methods adopted by different user roles, here in this section, we restrict ourselves to the ordinary data provider, the one who owns a relatively small amount of data which contain only information about himself. Data reporting information about an individual are often referred to as ``micro data'' . If a data provider reveals his micro data to the data collector, his privacy might be comprised due to the unexpected data breach or exposure of sensitive information. Hence, the privacy concern of a data provider is whether he can take control over what kind of and how much information other people can obtain from his data. To investigate the measures that the data provider can adopt to protect privacy, we consider the following three situations:
1) If the data provider considers his data to be very sensitive, that is, the data may reveal some information that he does not want anyone else to know, the provider can just refuse to provide such data. Effective access control measures are desired by the data provider, so that he can prevent his sensitive data from being stolen by the data collector.
2) Realizing that his data are valuable to the data collector (as well as the data miner), the data provider may be willing to hand over some of his private data in exchange for certain benefit, such as better services or monetary rewards. The data provider needs to know how to negotiate with the data collector, so that he will get enough compensation for any possible loss in privacy.
3) If the data provider can neither prevent the access to his sensitive data nor make a lucrative deal with the data collector, the data provider can distort his data that will be fetched by the data collector, so that his true information cannot be easily disclosed.
3.2. APPROACHES TO PRIVACY PROTECTION
1) LIMIT THE ACCESS
A data provider provides his data to the collector in an active way or a passive way. By ``active'' we mean that the data provider voluntarily opts in a survey initiated by the data collector, or will in some registration forms to create an account in a website. By ``passive'' we mean that the data, which are generated by the provider's routine activities, are recorded by the data collector, while the data provider may even have no awareness of the disclosure of his data. When the data provider provides his data actively, he can simply ignore the collector's demand for the information that he deems very sensitive. If his data are passively provided to the data collector, the data provider can take some measures to limit the collector's access to his sensitive data. Suppose that the data provider is an Internet user who is afraid that his online activities may expose his privacy.
To protect privacy, the user can try to erase the traces of his online activities by emptying browser's cache, deleting cookies, clearing usage records of applications, etc. Also, the provider can utilize various security tools that are developed for Internet environment to protect his data. Many of the security tools are designed as browser extensions for ease of use. Based on their basic functions, current security tools can be categorized into the following three types:
1) Anti-tracking extensions. Knowing that valuable information can be extracted from the data produced by users' online activities, Internet companies have a strong motivation to track the users' movements on the Internet. When browsing the Internet, a user can utilize an anti-tracking extension to block the trackers from collecting the cookies. Popular anti-tracking extensions include Disconnect, Do Not Track Me, Ghostery, etc. A major technology used for ant tracking is called Do Not Track (DNT) , which enables users to opt out of tracking by websites they do not visit. A user's opt-out preference is signalled by an HTTP header field named it means the user does not want to be tracked (opt out).Two U.S. researchers created a prototype add-on supporting DNT header for the Firefox web browser in 2009. Later, many web browsers have added support for DNT. DNT is not only a technology but also a policy framework for how companies that receive the signal should respond. The W3C Tracking Protection Working Group is now trying to standardize how websites should response to user's DNT request.
2) Advertisement and script blockers. This type of browser extensions can block advertisements on the sites, and kill scripts and widgets that send the user's data to some unknown third party. Example tools include Ad Block Plus,No Script, Flash Block etc.
3) Encryption tools. To make sure a private online communication between two parties cannot be intercepted by third parties, a user can utilize encryption tools, such as MailCloak9 and TorChat,10 to encrypt his emails, instant messages, or other types of web traffic. Also, a user can encrypt all of his internet traffic by using a VPN (virtual private network) service. In addition to the tools mentioned above, an Internet user should always use anti-virus and anti-malware tools to protect his data that are stored in digital equipment such as personal computer, cell phone and tablet. With the help of all these security tools, the data provider can limit other's access to his personal data. Though there is no guarantee that one's sensitive data can be completely kept out of the reach of untrustworthy data collectors, making it a habit of clearing online traces and using security tools does can help to reduce the risk of privacy disclosure.
2) TRADE PRIVACY FOR BENEFIT
In some cases, the data provider needs to make a trade off between the loss of privacy and the benefits brought by participating in data mining. For example, by analysing a user's demographic information and browsing history, a shopping website can offer personalized product recommendations to the user. The user's sensitive preference may be disclosed but he can enjoy a better shopping experience. Driven by some benefits, e.g. a personalized service or monetary incentives, the data provider may be willing to provide his sensitive data to a trustworthy data collector, who promises the provider's sensitive information will not be revealed to an unauthorized third-party. If the provider is able to predict how much benefit he can get, he can rationally decide what kind of and how many sensitive data to provide. For example, suppose a data collector asks the data provider to provide information about his age, gender, occupation and annual salary. And the data collector tells the data provider how much he would pay for each data item. If the data provider considers salary to be his sensitive information, then based on the prices offered by the collector, he chooses one of the following actions:
i) not to report his salary, if he thinks the price is too low;
ii) to report a fuzzy value of his salary, e.g. ``less than 10,000 dollars'', if he thinks the price is just acceptable;
iii) to report an accurate value of his salary, if he thinks the price is high enough. For this example we can see that, both the privacy preference of data provider and the incentives offered by data collector will affect the data provider's decision on his sensitive data. On the other hand, the data collector can make profit from the data collected from data providers, and the profit heavily depends on the quantity and quality of the data. Hence, data providers' privacy preferences have great influence on data collector's profit. The profit plays an important role when data collector decides the incentives.
That is to say, data collector's decision on incentives is related to data providers' privacy preferences. Therefore, if the data provider wants to obtain satisfying benefits by “selling” his data to the data collector, he needs to consider the effect of his decision on data collector's benefits (even the data miner's benefits), which will in turn affects the benefits he can get from the collector. In the data-selling scenario, both the seller (i.e. the data provider) and the buyer (i.e. the data collector) want to get more benefits, thus the interaction between data provider and data collector can be formally analyzed by using game theory . Also, the sale of data can be treated as an auction, where mechanism design theory can be applied. Considering that different user roles are involved in the sale, and the privacy-preserving methods adopted by data collector and data miner may have influence on data provider's decisions, we will review the applications of game theory and mechanism design in Section VI, after the discussions of other user roles.
3) PROVIDE FALSE DATA
As discussed above, a data provider can take some measures to prevent data collector from accessing his sensitive data. However, a disappointed fact that we have to admit is that no matter how hard they try, Internet users cannot completely stop the unwanted access to their personal information. So instead of trying to limit the access, the data provider can provide false information to those untrustworthy data collectors. The following three methods can help an Internet user to falsify his data:
1) Using ``sock puppets'' to hide one's true activities. A sockpuppet12 is a false online identity though which a member of an Internet community speaks while pretending to be another person, like a puppeteer manipulating a hand puppet. By using multiple sock puppets, the data produced by one individual's activities will be deemed as data belonging to different individuals, assuming that the data collector does not have enough knowledge to relate different sock puppets to one specific individual. As a result, the user's true activities are unknown to others and his sensitive information (e.g. political preference) cannot be easily discovered.
2) Using a fake identity to create phony information. In 2012, Apple Inc. was assigned a patient called ``Techniques to pollute electronic profiling'' which can help to protect user's privacy. This patent discloses a method for polluting the information gathered by` `network eavesdroppers'' by making a false online identity of a principal agent, e.g. a service subscriber. The clone identity automatically carries out numerous online actions which are quite different from a user's true activities. When a network eavesdropper collects the data of a user who is utilizing this method, the eavesdropper will be interfered by the massive data created by the clone identity. Real information about of the user is buried under the manufactured phony information.
2.Using security tools to mask one's identity. When a user signs up for a web service or buys something online, he is often asked to provide information such as email address, credit card number, phone number, etc. A browser extension called Mask Me, which was release by the online privacy company Abine, Inc. in 2013, can help the user to create and manage aliases (or Masks) of these personal information. Users can use these aliases just like they normally do when such information is required, while the websites cannot get the real information. In this way, user's privacy is protected.
DATA COLLECTOR
4.1. CONCERNS OF DATA COLLECTOR
A data collector collects data from data providers in order to support the subsequent data mining operations. The original data collected from data providers usually
contain sensitive information about individuals. If the data collector doesn't take sufficient precautions before releasing the data to public or data miners, those sensitive information may be disclosed, even though this is not the collector's original intention. For example, on October 2, 2006, the U.S. online movie rental service Net_ix14 released a dataset containing movie ratings of 500,000 subscribers to the public for a challenging competition called ''the Net Prize". The goal of the competition was to improve the accuracy of personalized movie recommendations. The released data set was supposed to be privacy-safe, since each data record only contained a subscriber ID (irrelevant with the subscriber's real identity), the movie info, the rating, and the date on which the subscriber rated the movie. However, soon after the release, two researchers from University of Texas found that with a little bit of auxiliary information about an individual subscriber, e.g. 8 movie ratings (of which2 may be completely wrong) and dates that may have a 14-day error, an adversary can easily identify the individual's record(if the record is present in the data set). From above example we can see that, it is necessary for the data collector to modify the original data before releasing them to others, so that sensitive information about data providers can neither be found in the modified data nor be inferred by anyone with malicious intent. Generally, the modification will cause a loss in data utility. The data collector should also make sure that sufficient utility of the data can be retained after the modification, otherwise collecting the data will be a wasted effort. The data modification process adopted by data collector, with the goal of preserving privacy and utility simultaneously, is usually called privacy preserving data publishing (PPDP).
Extensive approaches to PPDP have been proposed in last decade. Fung et al. have systematically summarized and evaluated different approaches in their frequently cited survey .Also, Wong and Fu have made a detailed review of studies on PPDP in their monograph . To differentiate with their work, in this paper we mainly focus on how PPDP is realized in two emerging applications, namely social networks and location-based services. To make our review more self-contained, in next subsection we will introduce some basics of PPDP, e.g. the privacy model, typical anonymization operations, information metrics, etc, and then we will review studies on social networks and location-based services respectively.
4.2. APPROACHES TO PRIVACY PROTECTION
1) BASICS OF PPDP
PPDP mainly studies anonymization approaches for publishing useful data while preserving privacy. The original data is assumed to be a private table consisting of multiple records.
Each record consists of the following 4 types of attributes:
• Identifier (ID): Attributes that can directly and uniquely identify an individual, such as name, ID number and mobile number.
• Quasi-identifier (QID): Attributes that can be linked with external data to re-identify individual records, such as gender, age and zip code.
• Sensitive Attribute (SA): Attributes that an individual wants to conceal, such as disease and salary.
• Non-sensitive Attribute (NSA): Attributes other than ID,QID and SA.
Before being published to others, the table is anonymized, that is, identifiers are removed and quasi-identifiers are modified.
CHAPTER 4
DATA MINER
5.1. CONCERNS OF DATA MINER
In order to discover useful knowledge which is desired by the decision maker, the data miner applies data mining algorithms to the data obtained from data collector. The privacy issues coming with the data mining operations are twofold. On one hand, if personal information can be directly observed in the data and data breach happens, privacy of the original data owner (i.e. the data provider) will be compromised .On the other hand, equipping with the many powerful data mining techniques, the data miner is able to various kinds of information underlying the data.
5.2 APPROACHES TO PRIVACY PROTECTION
Extensive PPDM approaches have been proposed. These approaches can be classified by different criteria , such as data distribution,data modification method, data mining algorithm, etc.
Based on the distribution of data, PPDM approaches can be classified into two categories, namely approaches for centralized data mining and approaches for distributed data mining. Distributed data mining can be further categorized into data mining over horizontally partitioned data and data mining over vertically partitioned data . Based on the technique adopted for data modification, PPDM can be classified into perturbation-based, blocking-based, swapping based etc.
2) PRIVACY-PRESERVING CLASSIFICATION
Classification is a form of data analysis that extracts models describing important data classes. Data classification can be seen as a two-step process. In the step, which is called learning step, a classification algorithm is employed to build a classifier (classification model) by analyzing a training set made up of tuples and their associated class labels.
In the second step, the classifier is used for classification, i.e. predicting categorical class labels of new data. Typical classification model include decision tree, Bayesian model, support vector machine, etc.
3) PRIVACY-PRESERVING CLUSTERING
Cluster analysis is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in other clusters. Dissimilarities and similarities are assessed based on the attribute values describing the objects and often involve distance measures. Clustering methods can be categorized into partitioning methods, hierarchical methods, density-based methods, etc. Current studies on privacy-preserving clustering can be roughly categorized into two types, namely approaches based on perturbation and approaches based on secure multi-party computation (SMC).
DECISION MAKER
6.1 CONCERNS OF DECISION MAKER
The ultimate goal of data mining is to provide useful information to the decision maker, so that the decision maker can choose a better way to achieve his objective, such as increasing sales of products or making correct diagnoses of diseases .At glance, it seems that the decision maker has no responsibility for protecting privacy, since we usually interpret privacy as sensitive information about the original data owners (i.e. data providers). Generally, the data miner, the data collector and the data provider himself are considered to be responsible for the safety of privacy. However, if we look at the privacy issue from a wider perspective, we can see that the decision maker also has his own privacy concerns. The data mining results provided by the data miner are of high importance to the decision maker.
6.2. APPROACHES TO PRIVACY PROTECTION
To deal with the privacy issue proposed above, i.e. to prevent unwanted disclosure of sensitive mining results, usually the decision maker has to resort to legal measures.
1) DATA PROVENANCE
If the decision maker does not get the data mining results directly from the data miner, he would want to know how the results are delivered to him and what kind of modification may have been applied to the results, so that he can determine whether the results can be trusted.
2) WEB INFORMATION CREDIBILITY
Because of the lack of publishing barriers, the low cost of dissemination,and the lax control of quality, credibility of web information has become a serious issue.
CHAPTER 6
GAME THEORY IN DATA PRIVACY
7.1. GAME THEORY PRELIMINARIES
In above sections, we have discussed the privacy issues related to data provider, data collector, data miner and decision maker, respectively. Here in this section, we focus on the iterations among different users. When participating in a data mining activity, each user has his own consideration Based on the game model,they develop punishment policies which aim at getting the maximum possible participants involved in the game so that they can get maximum utilities. about the benefit he may obtain and the (privacy) cost he has to pay.
7.2. PRIVATE DATA COLLECTION AND PUBLICATION
If a data collector wants to collect data from data providers who place high value on their private data, the collector may need to negotiate with the providers about the ``price'' of the sensitive data and the level of privacy protection. In Adl et al. build a sequential game model to analyze the private data collection process. In the proposed model, a data user,who wants to buy a data set from the data collector, makesa price offer to the collector at the beginning of the game .If the data collector accepts the offer, he then announces some incentives to data providers in order to collect private data from them. Before selling the collected data to the data user, the data collector applies anonymization technique to the data, in order to protect the privacy of data providers at certain level.
7.3. PRIVACY PRESERVING DISTRIBUTED DATA MINING
1) SMC-BASED PRIVACY PRESERVING
DISTRIBUTED DATA MINING
A secure multi-party computation(SMC) is widely used in privacy preserving distributed data mining. In a SMC scenario, a set of mutually distrustful parties, each with a private input, jointly compute a function over their inputs. Some protocol is established to ensure
that each party can only get the computation result and his own data stay private. However, during the execution of the protocol, a party may take one of the following actions in order to get more benefits:
• Semi-honest adversary: one follows the established protocol and correctly performs the computation but attempts to analyze others' private inputs.
• Malicious adversary: one arbitrarily deviates from the established protocol which leads to the failure of computation.
• Collusion: one colludes with several other parties to expose the private input of another party who doesn't participate in the collusion.
Kargupta the SMC problem as a static game with complete information. By analyzing the Nash equilibriums, that if nobody is penalized for dishonest behavior, parties tend to collude. They also propose a cheap-talk based protocol to implement a punishment mechanism which can lead to an equilibrium state corresponding to no collusion. Miyaji et al propose a two-party secure set-intersection protocol in a game theoretic setting. They assume that parties are neither honest nor corrupt but acted only in their own self-interest. They show that the proposed protocol satisfied computational versions of strict Nash equilibrium and stability with respect to trembles.
propose a SMC-based algorithm for privacy preserving distributed association rule mining(PPDARM). The algorithm employs Shamir's secret sharing technique to prevent the collusion of parties. In Nanvati and Jinwala model the secret sharing in PPDARM as a repeated game, where a Nash
equilibrium is achieved when all parties send their shares and attain a non-collusive behavior. Based on the game model, they develop punishment policies which aim at getting themaximum possible participants involved in the game so that they can get maximum utilities.
2) RECOMMENDER SYSTEM
Personalized recommendation is a typical application of data mining. The recommendation system predicts users' preference by analyzing the item ratings provided by users, thus the user can protect his private preference by falsifying his ratings.
In the proposed gamemodel, users are treated as players , and the rating data provided to the recommender server are seen as users' strategies. It has been shown that the Nash equilibrium strategy for each user is to declare false rating only for one item, the one that is highly ranked in his private and less correlated with items for which he anticipates recommendation. To the equilibrium strategy, data exchange between users and the recommender server is modeled as an iterative process. At each iteration, by using the ratings provided by other users at previous iteration, each user computes a rating vector that can maximize the preservation of his privacy, with respect to a constraint of the recommendation quality. Then the user declare this rating vector to the recommender server. After several iterations, the process converges to a Nash equilibrium.
2) LINEAR REGRESSION AS A NON-COOPERATIVE GAME
Ioannidis and Loiseau study the privacy issue in linear regression modeling. They consider a setting where a data analyst collects private data from multiple individuals to build a linear regression model. In order to protect privacy, individuals add noise to their data, which affects the accuracy of the model.
The interactions among individuals are modeled as a non-cooperative game, where each individual selects the variance level of the noise to minimize his cost. The cost relates to both the privacy loss incurred by the release of data and the accuracy of the estimated linear regression model. It is shown that under appropriate assumptions on privacy and estimation costs, there exists a unique pure Nash equilibrium at which each individual's cost is bounded.
7.4. DATA ANONYMIZATION
Chakravarthy present an interesting application of game theory. They propose a k-anonymity method which utilizes coalitional game theory to achieve a proper privacy level, given the threshold for information loss.
E. ASSUMPTIONS OF THE GAME MODEL
In above discussions we have reviewed the game theoretical approaches to privacy issues in data mining. We present the basic elements of some proposed game models in Table 2.
Most of the proposed approaches adopt the following research paradigm: the elements of the game, namely the players, the actions and the payoffs; determine the type of the game: static or dynamic, complete information or incomplete information; solve the game to and equilibriums; analyze the equilibriums to obtain some implications for practice. Usually we have to make a few assumptions when developing the game model. Unreasonable assumptions or too many assumptions will hurt the applicability of the game model.
7.5. MECHANISM DESIGN AND PRIVACY PROTECTION
Mechanism design is a subfield of microeconomics and game theory. It considers how to implement good system-wide solutions to problems that involve multiple self interested agents with private information about their preferences for different outcomes . Incorporating mechanism design into the study of privacy protecting has recently attracted some attention. In a nutshell, a mechanism
the strategies available and the method used to select the outcome based on agents' strategies. Specifically, consider a group of n agents fig, and each agent i has a privately known type ti 2 T . A mechanism M V T n ! O is a mapping between (reported) types of the n agents, and some outcome space O. The agent's type ti determines its preferences over different outcomes. The utility that the agent i with type ti can get from the outcome o 2 O is denoted by ui .o; ti/. Agents are assumed to be rational, that is, agent i prefers outcome o1 over o2 when ui .o1; ti/ > ui .o2; ti/. The mechanism designer
designs the rules of a game, so that the agents will participate in the game and the equilibrium strategies of agents can lead to the designer's desired outcome. Mechanism design is mostly applied to auction design, where an auction mechanism how to determine the winning bidder(s) and how much the bidder should pay for the goods. In the context of data mining, the data collector,
who often plays the role of data miner as well, acts as the mechanism designer, and data providers are agents with private information. The data collector wants data providers to participate in the data mining activity, i.e. hand over their private data, but the data providers may choose to opt-out because of the privacy concerns. In order to get useful data mining results, the data collector needs to design mechanisms to encourage data providers to opt-in.
1) MECHANISMS FOR TRUTHFUL DATA SHARING
A mechanism requires agents to report their preferences over the outcomes. Since the preferences are private information and agents are self-interested, it is likely that the agent would report false preferences. In many cases, the mechanism is expected to be incentive compatible that is, reporting one's true preferences should bring the agent larger utility than reporting false preferences. Such mechanism is also called truthful mechanism. Researchers have investigated incentive compatible mechanisms for privacy preserving distributed data mining. In distributed data mining, data needed for the mining task are collected from multiple parties. Privacy-preserving methods such as secure multi-party computation protocols can guarantee that only the result is disclosed. However, there is no guarantee that the data provided by participating parties are truthful. If the data mining function is reversible, that is, given two inputs, x and x0 , and the result f .x/, a data provider is able to
calculate f?x0_, then there is a motivation for the provider to provide false data in order to exclusively learn the correct mining result. To encourage truthful data sharing,Nix and Kantarciouglu model the distributed data mining scenario as an incomplete information game andpropose two incentive compatible mechanisms. The mechanism, which is designed for non-cooperative game,
is a Vickrey-Clarke-Groves (VCG) mechanism. The VCG mechanism can encourage truthful data sharing for the riskaverse data provider, and can give a close approximation that encourages minimal deviation from the true data for the risk-neutral data provider. The second mechanism, which is
designed for cooperative game, is based on the Shapley value.When data providers form multiple coalitions, this mechanism can create incentives for entire groups of providers to
truthfully reveal their data. The practical viability of these two mechanisms have been tested on three data mining models ,namely naïve Bayesian classification, decision tree classification, and support vector machine classification. In their later work , Nix and Kantarciouglu investigate what kind of privacy-preserving data analysis (PPDA) techniques can be implemented in a way that participating parties have the incentive to provide their true private inputs upon engaging in the corresponding SMC protocols. Under the assumption that participating parties prefer to learn the data analysis result
correctly and if possible exclusively, the study shows that several important PPDA tasks including privacy-preserving association rule mining, privacy-preserving naïve Bayesian classification and privacy-preserving decision tree classification are incentive driven. Based on Nix and Kantarcioglu's work, Panoui et al employ the VCG mechanism to achieve privacy preserving collaborative classification. They consider three types of strategies that a data provider can choose: providing true data, providing perturbed data, or providing randomized data. They show that the use of the
VCG mechanism can lead to high accuracy of the data mining task, and meantime data providers are allowed to provide perturbed data, which means privacy of data providers can be preserved.
2) PRIVACY AUCTIONS
Aiming at providing support for some specific data mining task, the data collector may ask data providers to provide their sensitive data. The data provider will suffer a loss in privacy if he decides to hand over his sensitive data. In order to motivate data providers to participate in the task, the data collector needs to pay monetary incentives to data providers to compensate their privacy loss. Since different data providers assign different values to their privacy, it is natural for data collector to consider buying private data using an auction. In other words, the data provider can sell his privacy at auction. Ghosh and Roth [115] initiate the study of privacy auction in a setting where n individuals selling their binary data to a data analyst. Each individual possesses a private bit bi 2 f0; 1g representing his sensitive information (e.g. whether the individual has some embarrassing disease), and reports a cost function ci to the data analyst who wants to estimate the sum of bits Pni D1 bi. Differential privacy [80] is employed to quantify the privacy cost: ci ."/ D vi _", where vi is aprivately known parameter representing individual's value for privacy, and " is the parameter of differential privacy. The cost function determines the individual's privacy loss when his private bit bi is used in an "-differentially private manner. The compensation (i.e. payment) that an individual can get from the data analyst is determined by a mechanism which takes the cost parameters v D .v1; : : : ; vn/ and a collection of private bit values b D .b1; : : : ; bn/ as input. In an attempt to maximize his payment, an individual may misreport his value for privacy (i.e. vi), thus the data collector needs to design truthful mechanisms that can incentivize individuals to report their true privacy cost. Ghosh and Roth study the mechanism design problem for two models, namely insensitive value model and sensitive value model. Insensitive value model considers the privacy cost only incurred by bi and ignores the potential loss due to the implicit correlations between vi and bi. It is shown that truthful mechanism can be derived to help the data analyst achieve.
7.6 ASSUMPTIONS OF THE GAME MODEL.
In above discussions we have reviewed the game theoretical approaches to privacy issues in data mining. We present the basic elements of some proposed game models in Table 2.
Most of the proposed approaches adopt the following research paradigm:
the elements of the game, namely the players, the actions and the payoffs;
• determine the type of the game: static or dynamic, complete information or incomplete information;
• solve the game to and equilibriums;
• analyze the equilibriums to obtain some implications for practice.
CONCLUSION
Four different user roles that are commonly involved in data mining applications, i.e. data provider, data collector, data miner and decision maker. Each user role has its own privacy concerns, hence the privacy-preserving approaches adopted by one user role are generally different from those adopted by others: For data provider, his privacy preserving objective is to effectively control the amount of sensitive data revealed to others. To achieve this goal, he can utilize security tools to limit other's access to his data, sell his data at auction to get enough compensations for privacy loss, or falsify his data to hide his true identity. For data collector, his privacy-preserving objective is to release useful data to data miners without disclosing data providers' identities and sensitive information about them. To achieve this goal, he needs to develop proper privacy models to quantify the possible loss of privacy under different attacks, and apply anonymization techniques to the data. For data miner, his privacy-preserving objective is to get correct data mining results while keep sensitive information undisclosed either in the process of data mining or in the mining results. To achieve this goal, he can choose a proper method to modify the data before certain mining algorithms are applied to, or utilize secure computation protocols to ensure the safety of private data and sensitive information contained in the learned model. For decision maker, his privacy-preserving objective is to make a correct judgement about the credibility of the data mining results he's got. To achieve this goal, he can utilize provenance techniques to trace back the history of the received information, or build classifier to discriminate true information from false information.