25-09-2014, 11:55 AM
link mining
link mining.docx (Size: 67.97 KB / Downloads: 13)
INTRODUCTON
Link mining is a fairly new research area that lies at the intersection of link analysis, hypertext and web mining, relational learning and inductive logic programming, and graph mining .However, and perhaps more important, it also represents an important and essential set of techniques for constructing useful applications of data mining in a wide variety of real and important domains, especially those involving complex event detection from highly structured data. It reviews progress in developing link mining techniques that can meet some of these requirements and outlines some needs that are not yet met and, therefore, are both open research challenges and potential opportunities for new application construction. central claim of this article is that link mining presents both challenges and opportunities. It presents challenges because data mining techniques for non-linked data are inadequate for similar problems with linked data and because the combinatorics of linked domains typically far exceed those of domains characterized by non-linked data. It presents opportunities because the structure of linked data provides both constraints on what can be inferred and additional information for inference than can be obtained from non-linked data.
Link mining refers to data mining techniques that explicitly consider links when building predictive or descriptive models of the linked data. Commonly addressed links mining tasks include object ranking, group detection, classification, link predictions and sub-graph discovery. While network analysis has been studied in depth in particular areas such as social network analysis, hypertext mining and web mining only recently has there been a cross-fertilization of ideas among these different communities. key emerging challenge for data mining is tackling the problem of mining richly structured, heterogeneous datasets. these kinds of datasets are best described as networks or graphs. The domains often consist of a variety of object types; the objects can be linked in a variety of ways. Thus, the graph may have different node and edge (or hyper-edge) types. Naively applying traditional statistical inference procedures, which assume that instances are independent, can lead to inappropriate conclusions about the data. Care must be taken that potential correlations due to links are handled appropriately. In fact, object linkage is knowledge that should be exploited.
LITERATURE SURVEY
Link mining is situated at the intersection of graph theory, machine learning, and web mining. this research is potentially useful in a wide range of application areas including bio-informatics, bibliographic analysis, financial analysis, national security, social network analysis, and internet search to name a few. In recent years, there has been a growing interest in learning from structured, real-world data. this type of data can be described by a graph where the nodes in the graph represent objects, and edges in the graph represent relationships between objects. A closely related line of work is hypertext and web page classification. this work has its roots in the information retrieval(IR) community. Hyper text collection has a rich structure that should be exploited to improve classification accuracy. In addition to words, hypertext has both incoming and outgoing links. Traditional IR document models do not make full use of the link structure of hypertext. In the web page classification problem, the web is viewed as a large directed graph. Our objective is to label the category of a web page, based on features of the current page and features of linked neighbors. With the use of linkage information, such as anchor text and neighboring text around each incoming link, better categorization results can be achieved. Chakrabarti proposed a probabilistic model to utilize both text and linkage information to classify a database of patents and a small web collection.
Traditional data mining approaches attempt to find patterns in a data set characterized by a collection of independent instances of a single relation. This is consistent with the classical statistical inference problem of trying to identify a model given a random sampling of an underlying distribution. A key challenge for machine learning is the problem of mining more richly structured data sets in a way that leverages the linkages between records . In this paradigm, which more accurately resembles real-world data, instances in the data set are relational where different samples are related to each other, either explicitly as typified by friendship relationships in a social network, or on the web by hyperlinks . However, in most large data sets, relationships also exist that are not explicitly annotated.
LINK MINING TASKS
As mentioned in the introduction, link mining puts a new twist on some classic data mining tasks, and also poses new problems. Here we provide a (non-exhaustive) list of possible tasks. We illustrate each of them using the following domains as motivations.
Web page collection: In a web page collection, the objects are web pages, and links are in-links, out-links and co citation links (two pages that are both linked to by the same page). Attributes include HTML tags, word appearances and anchor text.
Bibliographic domain: In a bibliographic domain, the objects include papers, authors, institutions, journals and conferences. Links include the paper citations, authorship and co-authorship, affiliations, and the appears-inrelation between a paper and a journal or conference.
Epidemiological Studies: In an epidemiology domain, the objects include patients, people they have come in contact with, and disease strains. Links represent contacts between people and which disease strain a person is infected with.
2 Link based Cluster Analysis
The goal in cluster analysis is to find naturally occurring subclasses. This is done by segmenting the data into groups, where objects in a group are similar to each other and are very dissimilar from objects in different groups. Unlike classification, clustering is unsupervised and can be applied to discover hidden patterns from data. This makes it an ideal technique for applications such as scientific data exploration, information retrieval, computational biology, web log analysis, criminal analysis and many others. there has been extensive research work on clustering in areas such as pattern recognition, statistics and machine learning. Hierarchical agglomerative clustering (HAC) and k-means are
two of the most common clustering algorithms there has been surprisingly little work done on this type of link mining. Subdue is the earliest line of research in this area.
2 Feature Construction
A second challenge is feature construction in the multi-relational setting. The attributes of an object provide a basic description of the object. Traditional classification algorithms are based on these types of object features. In a link-based approach, it may also make sense to use attributes of linked objects. Further, if the links themselves have attributes, these may also be used. This is the idea behind propositionalization .However, as others have noted, simply flattening the relational neighborhood around an object can be problematic. Several have noted that in hypertext domains, simply including words from neighboring pages degrades classification performance . A further issue is how to deal appropriately with relationships that are not one-to-one. In this case, it may be appropriate to compute aggregate features over the set of related objects.
Link Prediction
A fifth challenge is link discovery, or predicting the existence of links between objects. range of the tasks that we have described fall under the category of link prediction. A difficulty here is that the prior probability of a link among any set of individual is typically quite low. While we have had some success with simple probabilistic models of link existence, we believe this is an area where there is much research to be done.
4.6 Object Identity
A final challenge is identity detection. How do we infer aliases, i.e., determine that two objects refer to the same individual? As mentioned earlier, some work has been done in this area by several research communities, but there is a great deal of room for additional work. Another aspect of this challenge is whether our statistical models refer explicitly to individuals, or only to classes or categories of objects. In many cases, we’d like to model that a connection to a particular object or individual is highly predictive; on the other hand, if we’d like to have our models generalize and be applicable to new, unseen objects, we also have to be able to model with and reason about generic collections of objects.
CONCLUSION
There has been a growing interest in learning from linked data, which are described by a graph in which the nodes in the graph are objects and the edges/hyper-edges in the graph are links— or relations—between objects. Tasks include hypertext classification, segmentation, information extraction, searching and information retrieval, discovery of authorities and link discovery. Domains include the world-wide web, bibliographic citations, criminology and bio-informatics, to name just a few. Learning tasks range from predictive tasks, such as classification, to descriptive tasks, such as the discovery of frequently occurring sub-patterns. We have given a brief summary of some of the work in this area, and some of the challenges in link mining. Link mining is a promising new area where relational learning meets statistical modeling; we believe many new and interesting machine learning research problems lie at the intersection, and it is a research area “whose time has come” .In recent years, significant progress has been made in defining and addressing the core link mining challenges, yet much work remains to be done in refining and combining various approaches and solutions. The most important conclusion of this article is that while there are many link mining techniques that work well for individual link mining tasks, there is not yet a comprehensive framework that can support a combination of link mining tasks as needed for many real applications. The construction of successful and useful link mining applications is still very much an ad-hoc enterprise. Designing an effective architecture to support all necessary functions of an integrated application and providing a solution to usage a link mining for Semantic Web is also a key to success. Link mining tasks and challenges provide interesting insights catalyze new research directions.