22-02-2013, 03:01 PM
Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes
Transactional Database.pdf (Size: 1.69 MB / Downloads: 19)
Abstract
Binary (0,1) matrices, commonly known as transactional databases, can represent many application data, including genephenotype
data where “1” represents a confirmed gene-phenotype relation and “0” represents an unknown relation. It is natural to ask
what information is hidden behind these “0”s and “1”s. Unfortunately, recent matrix completion methods, though very effective in many
cases, are less likely to infer something interesting from these (0,1)-matrices. To answer this challenge, we propose INDEVI, a very
succinct and effective algorithm to perform independent-evidence-based transactional database transformation. Each entry of a (0,1)-
matrix is evaluated by “independent evidence” (maximal supporting patterns) extracted from the whole matrix for this entry. The value
of an entry, regardless of its value as 0 or 1, has completely no effect for its independent evidence.
INTRODUCTION
A key task in biomedicine in postgenomic era is to
understand the gene-phenotype relationships. In humans,
such knowledge leads to the discovery of genes
causing or relating to (disease) phenotypes. The genephenotype
relationships can be exactly described by a
binary matrix if we associate each phenotype with
confirmed causative genes. However, since our knowledge
on causative disease genes are still limited, this binary
matrix is far from complete. Recovering a matrix with partly
known entries is a general problem with growing interest.
Recent advances in matrix completion techniques [10], [31],
[15], [5], [27] provide very good solutions to many
applications. Thus, motivated by discovering unknown
gene-phenotype relationships, a natural question appears
here: is it possible to use current matrix completion
methods to recover such a binary matrix describing genephenotype
relationships? Unfortunately, we find that the
answer is no. The reason will be explained immediately in
the following section.
Related Work
Matrix Completion. The problem of matrix completion is
of general interest: given a sampling of m entries of a
matrix M, how to recover the complete matrix? Several
methods [10], [31], [15], [5], [27] have been proposed
recently to recover a low-rank matrix with some samplings.
It is even proven in [11] that the recovery is most likely to
be successful when the number of samplings is larger than
a threshold, which is a function of the matrix size and its
rank. However, this sample model does not apply to
transactional databases such as gene-phenotype data,
because “1” is the only nonzero value. Each entry is either
0, which means unsampled, or 1, which means sampled
and the result is positive. When there are enough sampled
entries being 1 only, there is little reason to believe an
unsampled entry to be a value other than 1 (or not close to
1 when floating-point operation applies). We tried Opt-
Space [31] on gene-phenotype data and the results (every
entry is getting close to 1 after some iterations) confirmed
our analysis.
Our Contributions
Motivated by the related work as discussed above, in this
work we aim at proposing a succinct computational method
to unbiasedly evaluate every entry in a (0,1)-matrix. Our
main contributions are
. We propose a novel concept to transform transactional
databases into matrices of features. It can be
regarded as an extension of the matrix completion
concept to (0,1)-matrices.
. Each original entry of the (0,1)-matrix, regardless of
its value as 0 or 1, has no effect on its independent
evidence. This makes our final results unbiased. In
addition, it is easy to obtain a complete and
unbiased fold-enrichment evaluation for human
gene-phenotype relationships, without the need to
do any extra leave-one-out validation (i.e., test a
known gene phenotype after removing such knowledge,
see [58]).
PROBLEM FORMULATION
Recently, by formulating the disease and gene relationships
as a bipartite graph, disease networks and the corresponding
gene networks have been derived [24], [38], [6]. This
also implies that by matching a group of related disease
phenotypes with a group of corresponding genes, we can
predict previously unknown gene and disease phenotype
relationships, and generate new hypotheses for experimental
and clinical study. Our problem formulation is motivated
by this notion.
INDEVI AND INDEVIRE ALGORITHMS
In this section, we mainly focus on solving the specific
problem proposed in Section 2. First, we show how to find
independent evidence Sði; jÞ and calculate Fði; jÞ for one
entry ði; jÞ in the (0,1)-Matrix. Then, we propose an
efficient algorithm INDEVI to calculate Fði; jÞ for all
entries in the (0,1)-Matrix. Since INDEVI calculates Fði; jÞ
for all entries without building Sði; jÞ for all entries, we
will discuss how to efficiently reconstruct Sði; jÞ, the
details of the independent evidence, either completely or
partly, for a desired entry ði; jÞ.
Rank Cutoff
Although every gene has a rank with respect to a given
phenotype, it remains a good question what percentage of
top ranked gene-to-phenotype relations are most significant
for prediction. To answer this question, we plot the fraction
of known gene-to-phenotype relations over the top ranked
gene-to-phenotype relations in Fig. 3, which shows the
aggregated results over all 5,560 phenotypes.
From Fig. 3, we can observe that at the beginning, the
percentage of known gene-to-phenotype relations increases
sharply when rank increases. However, the increase rate
changes significantly twice: one is around rank 200-400, and
the other is around rank 1,000. For the two rate changes, we
found that more than 70 percent of known gene-tophenotype
relations are ranked within top 16 percent
gene-to-phenotype relations, and more than 98 percent of
known gene-to-phenotype relations are ranked within top
60 percent gene-to-phenotype relations.
DISCUSSION AND CONCLUSION
In this paper, we present a novel transformation algorithm
for transactional databases. This method will have
impact on many biomedical data mining applications,
since many biomedical relationships can be formulated
using transactional databases and the lack of complete
knowledge in these relationships is a ubiquitous problem.
Using the independent-evidence-based transactional database
transformation approach, new hypotheses with
strong evidence can be generated and prioritized for
further experimental or clinical studies. We demonstrated
the effectiveness of this method using a relatively small
human gene-phenotype database to prioritize potential
clinically significant genes. Even though the genes in this
database are less than 10 percent of the known human
genome and many well-known disease genes are not
included (e.g., BRCA1), our algorithm was able to predict
missed relations for which many can be confirmed by
other sources in our case study.