Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Introduction to Web Mining
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
[attachment=71516]



What is Web Mining?


Discovering useful information from the World-Wide Web and its usage patterns
Applications
Web search e.g., Google, Yahoo,…
Vertical Search e.g., FatLens, Become,…
Recommendations e.g., Amazon.com
Advertising e.g., Google, Yahoo
Web site design e.g., landing page optimization


How does it differ from “classical” Data Mining?


The web is not a relation
Textual information and linkage structure
Usage data is huge and growing rapidly
Google’s usage logs are bigger than their web crawl
Data generated per day is comparable to largest conventional data warehouses
Ability to react in real-time to usage patterns
No human in the loop

The World-Wide Web


Huge
Distributed content creation, linking (no coordination)
Structured databases, unstructured text, semistructured
Content includes truth, lies, obsolete information, contradictions, …

Our modern-day Library of Alexandria




Size of the Web

Number of pages
Technically, infinite
Because of dynamically generated content
Lots of duplication (30-40%)
Best estimate of “unique” static HTML pages comes from search engine claims
Google = 8 billion, Yahoo = 20 billion
Lots of marketing hype
Number of unique web sites
Netcraft survey says 72 million sites


The web as a graph


Pages = nodes, hyperlinks = edges
Ignore content
Directed graph
High linkage
8-10 links/page on average
Power-law degree distribution


Power-laws galore


In-degrees
Out-degrees
Number of pages per site
Number of visitors
Let’s take a closer look at structure
Broder et al. (2000) studied a crawl of 200M pages and other smaller crawls
Bow-tie structure
Not a “small world”