Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

[attachment=71516]

What is Web Mining?

Discovering useful information from the World-Wide Web and its usage patterns
Applications
Web search e.g., Google, Yahoo,…
Vertical Search e.g., FatLens, Become,…
Recommendations e.g., Amazon.com
Advertising e.g., Google, Yahoo
Web site design e.g., landing page optimization

How does it differ from “classical” Data Mining?

The web is not a relation
Textual information and linkage structure
Usage data is huge and growing rapidly
Google’s usage logs are bigger than their web crawl
Data generated per day is comparable to largest conventional data warehouses
Ability to react in real-time to usage patterns
No human in the loop

The World-Wide Web

Huge
Distributed content creation, linking (no coordination)
Structured databases, unstructured text, semistructured
Content includes truth, lies, obsolete information, contradictions, …

Our modern-day Library of Alexandria

Size of the Web

Number of pages
Technically, infinite
Because of dynamically generated content
Lots of duplication (30-40%)
Best estimate of “unique” static HTML pages comes from search engine claims
Google = 8 billion, Yahoo = 20 billion
Lots of marketing hype
Number of unique web sites
Netcraft survey says 72 million sites

The web as a graph

Pages = nodes, hyperlinks = edges
Ignore content
Directed graph
High linkage
8-10 links/page on average
Power-law degree distribution

Power-laws galore

In-degrees
Out-degrees
Number of pages per site
Number of visitors
Let’s take a closer look at structure
Broder et al. (2000) studied a crawl of 200M pages and other smaller crawls
Bow-tie structure
Not a “small world”

Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

mkaasees