05-03-2013, 04:56 PM
Query Profile Obfuscation by means of Optimal Query Exchange between Users
Query Profile Obfuscation.pdf (Size: 1.21 MB / Downloads: 26)
Abstract
We address the problem of query profile obfuscation by means of partial query exchanges between two users, in order for
their profiles of interest to appear distorted to the information provider (database, search engine, etc.). We illustrate a methodology to
reach mutual privacy gain, that is, a situation where both users increase their own privacy protection through collaboration in query
exchange. To this end, our approach starts with a mathematical formulation, involving the modeling of the users’ apparent profiles
as probability distributions over categories of interest, and the measure of their privacy as the corresponding Shannon entropy. The
question of which query categories to exchange translates into finding optimization variables representing exchange policies, for various
optimization objectives based on those entropies, possibly under exchange traffic constraints.
INTRODUCTION
User profiling is very common in information retrieval
systems, for example in Web search engines. The information
provider (IP) operating the information retrieval
system can derive substantial marketing benefits from
improved knowledge of the user interest profiles. For
the users themselves, profiling can result in improved
and more targeted search results; for example, thanks to
profiling, Amazon is able to present to each customer
a list of books which are likely to interest her. In the
more general setting of electronic commerce, profiling
customers by their interests is essential to customer
relationship management.
The negative side of user profiling is that the interests
and the query history of users may contain information
considered as private. For example, if a user has looked
up a certain disease, it can be inferred that either the
user, or someone close to her, suffers from that disease.
This is not an academic speculation: in 2006, 20 million
queries submitted by 658 000 users of the AOL search
engine were publicly disclosed; AOL claimed queries
to be properly protected against user re-identification.
Two New York Times journalists identified a user after
studying the released queries [2].
Contribution and Organization
The intended goal and main contribution of this paper
is a first step towards the modeling and analysis of
the question of whether mutual privacy gain may be
obtained from query exchange, and to what extent, for
the special case of two users. In a nutshell, we first model
user behavior in an information retrieval system by a
histogram of queries assigned to predefined categories.
The histogram of each user contains queries that from
the point of view of any external observer, including
the IP itself, appear to have been created by that user.
However, the histogram contains queries originated by
this user not exchanged with the other user, and queries
submitted on behalf of the other user. Next, we propose
to measure the privacy of such histogram, effectively
the apparent user profile, as the Shannon entropy of the
relative frequencies across the categories. Finally, modeling
exchange policies also as categorical histograms, we
quantify a number of specific query exchange strategies,
and find the optimal trade-off between the privacy of
one user and the other’s. The study of these strategies
is mainly theoretical, but illustrated numerically, and
draws upon concepts and techniques from the fields of
information theory and convex and numerical optimization.
We must stress that user profile perturbation, by
means of query exchange, may be readily integrable with
other privacy-protecting techniques in order to reinforce
them, particularly pseudonymization.
BACKGROUND
The literature on information retrieval provides numerous
solutions to user privacy [16], some of which we
touch upon, often extensible to scenarios other than the
ones intended. After surveying the state of the art in
such antiprofiling techniques, we proceed to review the
idea of coprivacy, related to the notion of mutual privacy
gain explored in this work.
Cryptographic methods for private information retrieval
(PIR) enable a user to privately retrieve the
contents of a database, indexed by a memory address
sent by the user, in the sense that it is not feasible for
the database provider to ascertain which of the entries
was retrieved [1], [10], [11], [26], [41]. Simply put, PIR
provides protocols whereby the IP does not learn what
item the user is retrieving. Yet, these protocols make a
number of problematic assumptions, some of which we
would like to mention. First, information-theoretically
secure PIR requires the user to “touch” all items to avoid
leaking any clues about the item the user is interested
in. Secondly, the IP is assumed to cooperate in PIR
protocols, which is dubious given that the IP normally
wishes to profile users. Thirdly, PIR assumes that the
IP stores items in a vector and that the user somehow
manages to find the address of the desired item. More
generally, these protocols are effectively limited to queryresponse
functions in the form of a finite lookup table
of precomputed answers.
PROBLEM FORMULATION
As stated in Sec. 1.1 and conceptually depicted in Fig. 1,
our work addresses the problem of privacy protection
against user profiling, by means of altering user profiles
from the perspective of an external attacker, exploiting
the possibility of query exchange among users. In the
following subsections we present our assumptions and
propose a measure of privacy. All of which will finally
enable us to formulate the problem of query exchange
mathematically, as an optimization problem whose objective
is our measure of privacy, and the variables
represent a particular choice of exchange policy.
Preliminary Assumptions and Attacker Model
A tractable model of a user’s activity with regard to
profiling, used for example in [17], [18], [20], [42], [44],
consists in representing profiles by histograms of relative
frequencies of queries within a predefined set of
categories of interest. In practice, this theoretical model
would involve the establishment of such categories and
a categorization procedure, which will impact the quantification
of privacy [24], [33], [42]. The histogram that
a user would originate by submitting queries directly to
the IP, if privacy were not a concern, will be called actual
profile. As this user decides to cooperate with another to
partly exchange queries, an external observer, such as
the IP itself, could only retrieve a perturbed version of
this histogram, containing some queries originated by
the user in question, and some submitted on behalf of
the other. We shall call the resulting perturbed histogram
apparent profile. This was informally depicted in Fig. 1.
THEORETICAL ANALYSIS
We devote this section to a partial characterization of the
privacy region, that is, the region of pairs of entropy values
associated with pairs of possible apparent profiles.
While partial, this characterization will suffice to identify
the optimal exchange policies when our objective is to
maximize, either the minimum privacy, or the minimum
privacy gain with respect to the original values.
We shall assume that for each category i = 1, . . . , n, at
least qi or q
i is positive. In other words, we shall assume
that n is the number of active categories. In practice, no
query exchange policy can modify the complete absence
of activity within a category in which none of the users
involved shows any interest. On a different note, full
traffic exchange will not be contemplated, as one of the
apparent profiles would have zero absolute activity, and
its entropy would become undefined. Instead, later in
this section, we shall consider the case of arbitrarily
low activity on a uniformly distributed profile, which
formally maximizes the entropy.