03-07-2013, 04:43 PM
A LINK ANALYSIS EXTENSION OF CORRESPONDENCE ANALYSIS FOR MINING RELATIONAL DATABASES
A LINK ANALYSIS EXTENSION.doc (Size: 49.5 KB / Downloads: 22)
ABSTRACT
This work introduces a link analysis procedure for discovering relationships in a relational database or a graph, generalizing both simple and multiple correspondence analyses. It is based on a random walk model through the database defining a Markov chain having as many states as elements in the database. Suppose we are interested in analyzing the relationships between some elements (or records) contained in two different tables of the relational database. To this end, in a first step, a reduced, much smaller, Markov chain containing only the elements of interest and preserving the main characteristics of the initial chain is extracted by stochastic complementation. This reduced chain is then analyzed by projecting jointly the elements of interest in the diffusion map subspace and visualizing the results. This two-step procedure reduces to simple correspondence analysis when only two tables are defined and to multiple correspondence analysis when the database takes the form of a simple star-schema. On the other hand, a kernel version of the diffusion map distance, generalizing the basic diffusion map distance to directed graphs, is also introduced and the links with spectral clustering are discussed. Several data sets are analyzed by using the proposed methodology, showing the usefulness of the technique for extracting relationships in relational databases or graphs.
EXISTING SYSTEM
Some authors showed that applying a specific cut criteria to bipartite graphs leads to simple correspondence analysis. Notice that the second justification 1) leads to exactly the same mapping as the basic diffusion map while the third and fourth justifications, 2) and 3) lead to the same embedding space, but with a possibly different ordering and rescaling of the axis. More generally, these mappings are, of course, also related to graph embedding and nonlinear dimensionality reduction, which have been highly studied topics in recent years, especially in the manifold learning community, for recent surveys or developments.
PROPSED SYSTEM
This paper precisely proposes a link-analysis-based technique allowing discovering relationships existing between elements of a relational database or, more generally, a graph. More specifically, this work is based on a random walk through the database defining a Markov chain having as many states as elements in the database. Suppose, for instance, we are interested in analyzing the relationships between elements contained in two different tables of a relational database. To this end, a two-step procedure is developed. First, a much smaller, reduced, Markov chain, only containing the elements of interest typically the elements contained in the two tables and preserving the main characteristics of the initial chain, is extracted by stochastic complementation. An efficient algorithm for extracting the reduced Markov chain from the large, sparse, Markov chain representing the database is proposed. Then, the reduced chain is analyzed by, for instance, projecting the states in the subspace spanned by the right eigenvectors of the transition matrix; called the basic diffusion map in this paper), or by computing a kernel principal component analysis on a diffusion map kernel computed from the reduced graph and visualizing the results. Indeed, a valid graph kernel based on the diffusion map distance, extending the basic diffusion map to directed graphs, is introduced.
MODULES DESCRIPTION
A Kernel View of the Diffusion Map Distance
We now introduce1 a variant of the basic “diffusion map” model introduced by Nadler et al. and Pons and Latapy, which is still well-defined when the original graph is directed. In other words, we do not assume that the initial adjacency matrix A is symmetric in this section.
The Diffusion Map Distance
In our two-step procedure, a diffusion map projection, based on the so-called diffusion map distance, will be performed after stochastic complementation. Now, since the original definition of the diffusion map distance deals only with undirected, a periodic, Markov chains, it will first be assumed in that the reduced Markov chain, obtained after stochastic complementation, is indeed undirected, a periodic, and connected in which case the corresponding random walk defines an irreducible reversible Markov chain. Notice, that it is not required that the original adjacency matrix is irreducible and reversible; these assumptions are only required for the reduced adjacency matrix obtained after stochastic.
Analyzing the reduced markov chain with The basic diffusion map
Once a reduced Markov chain containing only the nodes of interest has been obtained, one may want to visualize the graph in a low-dimensional space preserving as accurately as possible the proximity between the nodes. This is the second step of our procedure. For this purpose, we propose to use the diffusion maps introduced. Interestingly enough, computing a basic diffusion map on the reduced Markov chain is equivalent to correspondence analysis in two special cases of interest: a bipartite graph and a star-schema database.
Simple Correspondence Analysis
As stated before, simple correspondence analysis aims to study the relationships between two random variables x1 and x2 (the features) having each mutually exclusive, categorical, outcomes, denoted as attributes. Suppose the variable x1 has n1 observed attributes and the variable x2 has n2 observed attributes, each attribute being a possible outcome value for the feature.
CONCLUSIONS AND FURTHER WORK
This work introduced a link-analysis-based technique allowing analyzing relationships existing in relational databases. The database is viewed as a graph, where the nodes correspond to the elements contained in the tables and the links correspond to the relations between the tables. A two-step procedure is defined for analyzing the relationships between elements of interest contained in a table, or a subset of tables. More precisely, this work 1) proposes to use stochastic complementation for extracting a sub graph containing the elements of interest from the original graph and 2) introduces a kernel-based extension of the basic diffusion map for displaying and analyzing the reduced sub graph. It is shown that the resulting method is closely related to correspondence analysis.