25-08-2017, 09:32 PM
The Information Bottleneck: Theory and Applications
Information Bottleneck:.pdf (Size: 1.34 MB / Downloads: 99)
Abstract
This thesis introduces the first comprehensive review of the Information Bottleneck (IB) method along
with its recent extension, the multivariate IB. The IB method was originally suggested in [82] as a new
information-theoretic approach for data analysis. The basic idea is surprisingly simple: Given a joint distri-
μ, find a compressed representation of , denoted by , that is as informative as possible about
bution ́
. This idea can be formulated as a variational principle of minimizing the mutual information, ́
μ
(which controls the compactness of the representation ), under some constraint on the minimal level of
μ . Hence, the fundamental trade-off between
mutual information that preserves about , given by ́
the complexity of the model and its precision is expressed here in an entirely symmetric form, where the
exact same concept of information controls both its sides. Indeed, an equivalent posing of the IB principle
μ
would be to maximize the information maintains about , where the (compression) information ́
is constrained to some maximal level.
As further shown in [82], this constrained optimization problem can be considered analogous to rate distor-
tion theory, but with an important distinction: the distortion measure does not need to be defined in advance,
μ. Moreover, it leads to a tractable mathematical
but rather naturally emerges from the joint statistics, ́
analysis which provides a formal characterization of the optimal solution to this problem. As an immedi-
ate implication, the IB method formulates a well defined information-theoretic framework for unsupervised
clustering problems, which is the main focus of this thesis. Nonetheless, it is important to keep in mind
that the same underlying principle of a trade-off between information terms may have further implications
in other related fields, as recently suggested in [37].
Introduction
In this chapter we provide a basic introduction to the remaining chapters. In the first section we present high
level descriptions of the fundamental trade-off between precision and complexity. One important variant of
this trade-off is formulated as the problem of unsupervised clustering, which is the main problem we address
in this thesis. In the next section we present the necessary preliminaries for our analysis. We conclude this
chapter by presenting a simple example in order to elucidate the central ideas that will be discussed later on.
Preliminaries
In this section we introduce the basic concepts required for the next chapters. We start with some nota-
divergence and
tions, and further state the definitions of entropy, mutual and multi information,
divergence. Most of this section is based on Chapter 3⁄4 in [20], Chapter ¿ in [5] (which provides a friendly
introduction to the concept of entropy), and a work in progress by Nemenman and Tishby [55], which
introduces a new axiomatic derivative of mutual and multi-information.
Entropy and related concepts
Consider the following situation. We are given a finite collection of documents, denoted by
A person chooses to read a single document out of this collection, and our task is to guess which document
was chosen. Without any prior knowledge, all guesses are equally likely. We now further assume that we
3⁄4 ,
have access to a definite set of (exhaustive and mutually exclusive) probabilities, denoted by ́ μ
for all the possible choices. For example, let us assume that longer documents are more probable than shorter
ones. More specifically, that the probability of choosing each document is proportional to the (known) num-
ber of words that occur in it. If all the documents consist of exactly the same number of words, ́ μ is
uniform and obviously we are back at the starting point where no guess is preferable. However, if one docu-
ment is much longer than all the others, ́ μ will have a clear peak for this document, hence our chances of
providing the correct answer will improve. How can we quantify the difference between these two scenarios
in a well defined way?
Mutual information and multi-information
Let us reconsider our previous example of trying to guess what document was chosen. However, we now
assume that we have access not only to the prior distribution ́ μ, but rather to a joint distribution of
with some other random variable, . For concreteness, if values correspond to all the possible document
values correspond to all the distinct words occurring in this document
identities, let us assume that
μ
collection. Thus, more formally stated, we assume that we have access to the joint distribution ́
which indicates the probability that a random word position in the corpus is equal to 3⁄4
while the
document identity is 3⁄4 . 3