23-09-2016, 10:40 AM
1455801213-webMiningOverview1.ppt (Size: 1.07 MB / Downloads: 16)
What is Web Mining?
Discovering useful information from the World-Wide Web and its usage patterns
Applications
Web search e.g., Google, Yahoo,…
Vertical Search e.g., FatLens, Become,…
Recommendations e.g., Amazon.com
Advertising e.g., Google, Yahoo
Web site design e.g., landing page optimization
How does it differ from “classical” Data Mining?
The web is not a relation
Textual information and linkage structure
Usage data is huge and growing rapidly
Google’s usage logs are bigger than their web crawl
Data generated per day is comparable to largest conventional data warehouses
Ability to react in real-time to usage patterns
No human in the loop
The World-Wide Web
Huge
Distributed content creation, linking (no coordination)
Structured databases, unstructured text, semistructured
Content includes truth, lies, obsolete information, contradictions, …
Our modern-day Library of Alexandria
Size of the Web
Number of pages
Technically, infinite
Because of dynamically generated content
Lots of duplication (30-40%)
Best estimate of “unique” static HTML pages comes from search engine claims
Google = 8 billion, Yahoo = 20 billion
Lots of marketing hype
Number of unique web sites
Netcraft survey says 72 million sites
The web as a graph
Pages = nodes, hyperlinks = edges
Ignore content
Directed graph
High linkage
8-10 links/page on average
Power-law degree distribution
Power-laws galore
In-degrees
Out-degrees
Number of pages per site
Number of visitors
Let’s take a closer look at structure
Broder et al. (2000) studied a crawl of 200M pages and other smaller crawls
Bow-tie structure
Not a “small world”