05-11-2016, 12:26 PM
1465235106-biclusteringalgorithmsAsurvey.pdf (Size: 322.37 KB / Downloads: 62)
Abstract
Analysis of large scale geonomics data, notably gene expression, has initially focused on
clustering methods. Recently, biclustering techniques were proposed for revealing submatrices
showing unique patterns. We review some of the algorithmic approaches to biclustering and
discuss their properties.
1 Introduction
Gene expression profiling has been established over the last decade as a standard technique for
obtaining a molecular fingerprint of tissues or cells in different biological conditions[18, 7]. Based
on the availability of whole genome sequences, the technology of DNA chips (or microarrays)
allowsthe measurement of mRNA levelssimultaneously for thousands of genes. The set (or vector)
of measured gene expression levels under one condition (or sample) are called the profile of that
condition. Gene expression profiles are powerful sources of information and have revolutionized
the way we study and understand function in biological systems [1].
Given a set of gene expression profiles, organized together as a gene expression matrix with
rows corresponding to genes and columns corresponding to conditions, a common analysis goal
is to group conditions and genes into subsets that convey biological significance. In its most
common form, this task translates to the computational problem known as clustering. Formally,
given a set of elements with a vector of attributes for each element, clustering aims to partition
the elements into (possibly hierarchically ordered) disjoint sets, called clusters, so that within each
set the attribute vectors are similar, while vectors of disjoint clusters are dissimilar. For example,
when analyzing a gene expression matrix we may apply clustering to the genes (as elements) given
the matrix rows (as attributes) or cluster the conditions (as elements) given the matrix columns (as
attributes). For reviews on clustering see an earlier chapter in this book. Analysis via clustering
makes several a-priori assumptions that may not be perfectly adequate in all circumstances. First,
clustering can be applied to either genes or samples, implicitly directing the analysis to a particular
Figure 1: Clustering and biclustering of a gene expression matrix. Clusters correspond to disjoint
strips in the matrix. A gene cluster must contain all columns, and a condition cluster must contain
all rows. Biclusters correspond to arbitrary subsets of rows and columns, shown here as rectangles.
Note that since gene (condition) clusters are disjoint, the rows (columns) of the matrix can be
reordered so that each cluster is a contiguous strip. Similar reordering of rows and columns that
shows all the biclusters as rectangles is usually impossible.
aspect of the system under study (e.g., groups of patients or groups of co-regulated genes). Second,
clustering algorithms usually seek a disjoint cover of the set of elements, requiring that no gene or
sample belongs to more than one cluster.
The notion of a bicluster gives rise to a more flexible computational framework. A bicluster is
defined as a submatrix spanned by a set of genes and a set of samples (compare Figure 1). Alternatively,
a bicluster may be defined as the corresponding gene and sample subsets. Given a gene
expression matrix, we can characterize the biological phenomena it embodies by a collection of
biclusters, each representing a different type of joint behavior of a set of genes in a corresponding
set of samples. Note that there are no a-priori constraints on the organization of biclusters and in
particular, genes or samples can be part of more than one bicluster or of no bicluster. The lack
of structural constrains on biclustering solutions allows greater freedom but is consequently more
vulnerable to overfitting. Hence, biclustering algorithms must guarantee that the output biclusters
are meaningful. This is usually done by an accompanying statistical model or a heuristic scoring
method that define which of the many possible submatrices represent a significant biological
behavior. The biclustering problem is to find a set of significant biclusters in a matrix.
In clinical applications, gene expression analysis is done on tissues taken from patients with
a medical condition. Using such assays, biologists have identified molecular fingerprints that can
help in the classification and diagnosis of the patient status and guide treatment protocols [2, 16].
In these studies, the focus is primarily on identifying profiles of expression over a subset of the
genes that can be associated with clinical conditions and treatment outcomes, where ideally, the
set of samples is equal in all but the subtype or the stage of the disease. However, a patient may be a
part of more than one clinical group, e.g., may suffer from syndrome A, have a genetic background
B and be exposed to environment C. Biclustering analysis is thus highly appropriate for identifying and distinguishing the biological factors affecting the patients along with the corresponding gene
subsets.
In functional genomics applications, the goal is to understand the functions of each of the genes
operating in a biological system. The rationale is that genes with similar expression patterns are
likely to be regulated by the same factors and therefore may share function. By collecting expression
profiles from many different biological conditions and identifying joint patterns of gene
expression among them, researchers have characterized transcriptional programs and assigned putative
function to thousands of genes [23, 11, 8]. Since genes have multiple functions, and since
transcriptional programs are often based on combinatorial regulation, biclustering is highly appropriate
for these applications as well.
An important aspect of gene expression data is their high noise levels. DNA chips provide only
rough approximation of expression levels, and are subject to errors of up to two-fold the measured
value [1]. Any analysis method, and biclustering algorithms in particular, should therefore be
robust enough to cope with significant levels of noise.
Below we survey some of the biclustering models and algorithms that were developed for gene
expression analysis. Our coverage is not exhaustive, and is biased toward what we believe are the
more practical methods. We attempt to cover at least one method from each class of algorithms
under development. We do not review methods that are based on extended biological models (e.g.,
inferring regulation or integrating data types [19, 24]), but focus on algorithms for biclustering
per-se. Throughout, we assume that we are given a set of genes a set of conditions , and a
gene expression matrix where is the expression level of gene in sample . We
assume that the matrix is normalized, though some of the algorithms below perform additional
normalization. A bicluster is defined by a subset of genes and a subset of
conditions (or samples) . Different algorithmic approaches to the biclustering problem use
different measures for the quality of a given biclustering solution. We therefore define the goal
function of each algorithm as part of its description.
2 Cheng and Church’s Algorithm
Cheng and Church were the first to introduce biclustering to gene expression analysis [6]. Their
algorithmic framework represents the biclustering problem as an optimization problem, defining a
score for each candidate bicluster and developing heuristics to solve the constrained optimization
problem defined by this score function. In short, the constraints force the uniformity of the matrix,
the procedure gives preference to larger submatrices and the heuristic is a relaxed greedy algorithm.
Cheng and Church implicitly assume that (gene, condition) pairs in a “good” bicluster have a
constant expression level, plus possibly additive row and column specific effects. After removing
row, column and submatrix averages, the residual level should be as small as possible. More
formally, given the gene expression matrix , a subset of genes and a subset of conditions ,
we define (row subset average) (column subset average) and
(submatrix average). We define the residue score of an element in a submatrix
Coupled Two-way Clustering
Coupled two-way clustering (CTWC), introduced by Getz, Levine and Domany [9], defines a
generic scheme for transforming a one-dimensional clustering algorithm into a biclustering algorithm.
The algorithm relies on having a one-dimensional (standard) clustering algorithm that can
discover significant (termed stable in [9]) clusters. Given such an algorithm, the coupled two-way
clustering procedure will recursively apply the one-dimensional algorithm to submatrices, aiming
to find subsets of genes giving rise to significant clusters of conditions and subsets of conditions
giving rise to significant gene clusters. The submatrices defined by such pairings are called stable
submatrices and correspond to biclusters. The algorithm, which is shown in Figure 3, operates on
a set of gene subsets and a set of condition subsets . Initially and . The
5
algorithm then iteratively selects a gene subset and a condition subset and applies
the one dimensional clustering algorithm twice, to cluster and on the submatrix . If
stable clusters are detected, their gene/condition subsets are added to the respective sets , . The
process is repeated until no new stable clusters can be found. The implementation makes sure that
each pair of subsets is not encountered more than once.
Note that the procedure avoids the consideration of all rows and column subsets, by starting
from an established row subset when forming subclusters of established column subsets, and vice
versa. The success of the coupled two-way clustering strategy depends on the performance of the
given one-dimensional clustering algorithm. We note that many popular clustering algorithms(e.g.
K-means, Hierarchical, SOM) cannot be plugged ”as is” into the coupled two-way machinery, as
they do not readily distinguish significant clusters from non-significant clusters or make a-priori
assumption on the number of clusters. Getz et al. have reported good results using the SPC
hierarchical clustering algorithm [10]. The results of the algorithm can be viewed in a hierarchical
form: each stable gene (condition) cluster is generated given a condition (resp. gene) subset. This
hierarchical relation is important when trying to understand the context of joint genes or conditions
behavior. For example, when analyzing clinical data, Getz et al. have focused on gene subsets
giving rise to stable tissue clusters that are correlative to known clinical attributes. Such gene sets
may have an important biological role in the disease under study.
The CTWC algorithm has been applied to a variety of clinical data sets (see, e.g., [17]),