07-10-2016, 12:03 PM
1458194672-20110534.full.pdf (Size: 496.01 KB / Downloads: 6)
Independent component analysis is a probabilistic
method for learning a linear transform of a random
vector. The goal is to find components that are
maximally independent and non-Gaussian (nonnormal).
Its fundamental difference to classical multivariate
statistical methods is in the assumption of
non-Gaussianity, which enables the identification
of original, underlying components, in contrast to
classical methods. The basic theory of independent
component analysis was mainly developed in the
1990s and summarized, for example, in our monograph
in 2001. Here, we provide an overview of
some recent developments in the theory since the year
2000. The main topics are: analysis of causal relations,
testing independent components, analysing multiple
datasets (three-way data), modelling dependencies
between the components and improved methods for
estimating the basic model.
1. Introduction
It is often the case that the measurements provided
by a scientific device contain interesting phenomena
mixed up. For example, an electrode placed on the
scalp as in electroencephalography measures a weighted
sum of the electrical activities of many brain areas.
A microphone measures sounds coming from different
sources in the environment. On a more abstract level,
a gene expression level may be considered the sum of
many different biological processes. A fundamental goal
in scientific enquiry is to find the underlying, original
signals or processes that usually provide important
information that cannot be directly or clearly seen in the
observed signals.
Independent component analysis (ICA; Jutten &
Hérault [1]) has been established as a fundamental way of analysing such multi-variate data. It learns a linear decomposition (transform) of the data,
such as the more classical methods of factor analysis and principal component analysis (PCA).
However, ICA is able to find the underlying components and sources mixed in the observed data
in many cases where the classical methods fail.
ICA attempts to find the original components or sources by some simple assumptions of their
statistical properties. Not unlike in other methods, the underlying processes are assumed to be
independent of each other, which is realistic if they correspond to distinct physical processes.
However, what distinguishes ICA from PCA and factor analysis is that it uses the non-Gaussian
structure of the data, which is crucial for recovering the underlying components that created
the data.
ICA is an unsupervised method in the sense that it takes the input data in the form of a
single data matrix. It is not necessary to know the desired ‘output’ of the system, or to divide the
measurements into different conditions. This is in strong contrast to classical scientific methods
based on some experimentally manipulated variables, as formalized in regression or classification
methods. ICA is thus an exploratory, or data-driven method: we can simply measure some
system or phenomenon without designing different experimental conditions. ICA can be used
to investigate the structure of the data when suitable hypotheses are not available, or they are
considered too constrained or simplistic.
Previously, we wrote a tutorial on ICA [2] as well as a monograph [3]. However, that material
is more than 10 years old, so our purpose here is to provide an update on some of the main
developments in the fields since the year 2000 (see Comon & Jutten [4] for a recent in-depth
reference). The main topics we consider below are:
— causal analysis, or structural equation modelling (SEM), using ICA (§3);
— testing of independent components for statistical significance (§4);
— group ICA, i.e. ICA on three-way data (§5);
— modelling dependencies between components (§6); and
— improvements in estimating the basic linear mixing model, including ICA using
time–frequency decompositions, ICA using non-negative constraints, and modelling
component distributions (§7).
We start with a very short exposition of the basic theory in §2.
2. Basic theory of independent component analysis
In this section, we provide a succinct exposition of the basic theory of ICA before going to recent
developments in subsequent sections.
(a) Definition
Let us denote the observed variables by xi(t), i = 1, ... , n, t = 1, ... , T. Here, i is the index
of the observed data variable and t is the time index, or some other index of the different
observations. The xi(t) are typically signals measured by a scientific device. We assume that they
can be modelled as linear combinations of hidden (latent) variables sj(t), j = 1, ... , m, with some
unknown coefficients aij,
xi(t) =m
j=1
aijsj(t), for all i = 1, ... , n. (2.1)
The fundamental point is that we observe only the variables xi(t), whereas both aij and si(t) are to
be estimated or inferred. The si are the independent components, whereas the coefficients aij are
called the mixing coefficients.
Identifiability
The main breakthrough in the theory of ICA was the realization that the model can be made
identifiable by making the unconventional assumption of the non-Gaussianity of the independent
components [5]. More precisely, assume the following.
— The components si are mutually statistically independent. In other words, their joint
density function is factorizable: p(s1, ... ,sm) =
j p(sj).
— The components si have non-Gaussian (non-normal) distributions.
— The mixing matrix A is square (i.e. n = m) and invertible.
Under these three conditions, the model is essentially identifiable [5,6]. This means that the mixing
matrix and the components can be estimated up to the following rather trivial indeterminacies:
(i) the signs and scales of the components are not determined, i.e. each component is
estimated only up to a multiplying scalar factor, and (ii) any ordering of the components is
not determined.
The assumption of independence can be seen as a rather natural ‘default’ assumption when
we do not want to postulate any specific dependencies between the components. It is also more
or less implicit in the theory of classical factor analysis, where the components or factors are
assumed uncorrelated and Gaussian, which implies that they are independent (more on this
below). A physical interpretation of independence is also sometimes possible: if the components
are created by physically separate and non-interacting entities, then they can be considered
statistically independent.
On the other hand, the third assumption is not necessary and can be relaxed in different ways,
but most of the theory makes this rather strict assumption for simplicity.
So, the real fundamental departure from conventional multi-variate statistics is to assume that
the components are non-Gaussian. Non-Gaussianity also gives a new meaning to independence:
for variables with a joint Gaussian distribution, uncorrelatedness and independence are
in fact equivalent. Only in the non-Gaussian case is independence something more than
uncorrelatedness. Uncorrelatedness is assumed in other methods such as PCA and factor analysis,
but this non-Gaussian form of independence is usually not.
As a trivial example, consider two-dimensional data that are concentrated on four points:
(−1, 0),(1, 0),(0, −1),(0, 1) with equal probability 1
4 . The variables x1 and x2 are uncorrelated
owing to symmetry with respect to the axes: if you flip the sign of x1, the distribution stays
the same, and thus we must have E{x1x2} = E{(−x1)x2}, which implies their correlation (and
covariance) must be zero. On the other hand, the variables clearly are not independent because if
x1 takes the value −1, we know that x2 must be zero.
© Objective functions and algorithms
Most ICA algorithms divide the estimation of the model into two steps: a preliminary whitening
and the actual ICA estimation. Whitening means that the data are first linearly transformed by a
matrix V such that Z = VX is white, i.e.
1
T
ZZT = I or 1
T
T
t=1
z(t)z(t)
T = I, (2.4)
where I is the identity matrix. Such a matrix V can be easily found by PCA: normalizing the
principal components to unit variance is one way of whitening data (but not the only one).
The utility of this two-step procedure is that after whitening, the ICA model still holds,
Z = VX = VAS = AS˜ or z = As ˜ , (2.5)
where the transformed mixing matrix A˜ = VA is now orthogonal [2,5]. Thus, after whitening, we
can constrain the estimation of the mixing matrix to the space of orthogonal matrices, which
reduces the number of free parameters in the model. Numerical optimization in the space of
orthogonal matrices tends to be faster and more stable than in the general space of matrices,
which is probably the main reason for making this transformation.
It is important to point out that whitening is not uniquely defined. In fact, if z is white, then
any orthogonal transform Uz, with U being an orthogonal matrix, is white as well. This highlights
the importance of non-Gaussianity: mere information of uncorrelatedness does not lead to a
unique decomposition. Because, for Gaussian variables, uncorrelatedness implies independence,
whitening exhausts all the dependence information in the data, and we can estimate the mixing
matrix only up to an arbitrary orthogonal matrix. For non-Gaussian variables, on the other hand,
whitening does not at all imply independence, and there is much more information in the data
than what is used in whitening.
For whitened data, considering an orthogonal mixing matrix, we estimate A˜ by maximizing
some objective function that is related to a measure of non-Gaussianity of the components. For a
tutorial treatment on the theory of objective functions in ICA, we refer the reader to Hyvärinen &
Oja [2] and Hyvärinen et al. [3]. Basically, the main approaches are maximum-likelihood estimation [7], and minimization of the mutual information between estimated components [5].
Mutual information is an information-theoretically motivated measure of dependence; so its
minimization is simply motivated by the goal of finding components that are as independent
as possible. Interestingly, both of these approaches lead to essentially the same objective function.
Furthermore, a neural network approach called infomax was proposed by Bell & Sejnowski [8]
and Nadal & Parga [9], and was shown to be equivalent to likelihood by Cardoso [10].
The ensuing objective function is usually formulated in terms of the inverse of A˜ , whose rows
are denoted by wT
i , as
L(W) =n
i=1
T
t=1
Gi(wT
i z(t)), (2.6)
where Gi is the logarithm of the probability density function (pdf) of si, or its estimate wT
i z. In
practice, quite rough approximations of the log-pdf are used; the choice G(u) = − log cosh(u),
which is essentially a smoothed version of the negative absolute value function −|u|, works well
in many applications. This function is to be maximized under the constraint of orthogonality of
the wi. The z(t) are here the observed data points that have been whitened.
Interestingly, this objective function depends only on the marginal densities of the estimated
independent components wT
i z(t). This is quite advantageous because it means we do not
need to estimate any dependencies between the components, which would be computationally
very complicated.
Another interesting feature of the objective function in (2.6) is that each term
t Gi(wT
i z(t))
can be interpreted as a measure of non-Gaussianity of the estimated component wT
i z. In fact, this
is an estimate of the negative differential entropy of the components, and differential entropy can
be shown to be maximized for a Gaussian variable (for fixed variance). Thus, ICA estimation is
essentially performed by finding uncorrelated components that maximize non-Gaussianity (see
Hyvärinen & Oja [2] and Hyvärinen et al. [3] for more details).
Such objective functions are then optimized by a suitable optimization method, the most
popular ones being FastICA [11] and natural gradient methods [12].