01-01-2013, 11:59 AM
Web Data Mining
Web Data.ppt (Size: 85.5 KB / Downloads: 60)
Use of data mining techniques to automatically discover interesting and potentially useful information from Web documents and services.
Web mining may be divided into three categories:
1. Web content mining
2. Web structure mining
3. Web usage mining
Web details
More than 20 billion pages in 2008
Many more documents in databases accessible from the Web
More than 4m servers
A total of perhaps 100 terabytes
More than a million pages are added daily
Several hundred gigabytes change every month
Hyperlinks for navigation, endorsement, citation, criticism or plain whim
Graph terminology
Web is a graph – vertices and edges (V,E)
Directed graph – directed edges (p,q)
Undirected graph - undirected edges (p,q)
Strongly connected component - a set of nodes such that for any (u,v) there is a path from u to v
Breadth first search
Diameter of a graph
Average distance of the graph
Breadth first search - layer 1 consists of all nodes that are pointed by the root, layer k consists of all nodes that are pointed by nodes on level k-1
Diameter of a graph - maximum over all ordered pairs (u,v) of the shortest path from u to v
Citations
Lotka’s Inverse-Square Law - Number of authors publishing n papers is about 1/n2 of those with only one.
60% of all authors that make a single contribution.
Less than 1% publish 10 or more papers.
Most web pages are linked only to one other page (many not linked to any). Number of pages with multiple in-links declines quickly.
Rich get richer concept!
Web graph structure
Tendrils – cannot reach SCC and cannot be reached by it - about 20%
Unconnected – about 10%
The Web is hierarchical in nature. The Web has a strong locality feature. Almost two thirds of all links are to sites within the enterprise domain. Only one-third of the links are external. Higher percentage of external links are broken. The distance between local links tends to be quite small.
Web Content mining
Discovering useful information from contents of Web pages.
Web content is very rich consisting of textual, image, audio, video etc and metadata as well as hyperlinks.
The data may be unstructured (free text) or structured (data from a database) or semi-structured (html) although much of the Web is unstructured.