07-09-2016, 04:30 PM
1453650590-WebUsageMining.doc (Size: 95 KB / Downloads: 308)
Background and Motivation
With the explosive growth of information sources available on the World Wide Web and the rapidly increasing pace of adoption to Internet commerce, the Internet has evolved into a gold mine that contains or dynamically generates information that is beneficial to E-businesses. A web site is the most direct link a company has to its current and potential customers. The companies can study visitor’s activities through web analysis, and find the patterns in the visitor’s behavior. These rich results yielded by web analysis, when coupled with company data warehouses, offer great opportunities for the near future.
What is Web Mining?
Web mining can be broadly defined as discovery and analysis of useful information from the World Wide Web. Based on the different emphasis and different ways to obtain information, web mining can be divided into two major parts: Web Contents Mining and Web Usage Mining. Web Contents Mining can be described as the automatic search and retrieval of information and resources available from millions of sites and on-line databases though search engines / web spiders. Web Usage Mining can be described as the discovery and analysis of user access patterns, through the mining of log files and associated data from a particular Web site.
Why Web Usage Mining?
In this paper, we will emphasize on Web usage mining. Reasons are very simple: With the explosion of E-commerce, the way companies are doing businesses has been changed. E-commerce, mainly characterized by electronic transactions through Internet, has provided us a cost-efficient and effective way of doing business. The growth of some E-businesses is astonishing, considering how E-commerce has made Amazon.com become the so-called “on-line Wal-Mart”. Unfortunately, to most companies, web is nothing more than a place where transactions take place. They did not realize that as millions of visitors interact daily with Web sites around the world, massive amounts of data are being generated. And they also did not realize that this information could be very precious to the company in the fields of understanding customer behavior, improving customer services and relationship, launching target marketing campaigns, measuring the success of marketing efforts, and so on.
How to perform Web Usage Mining?
Web usage mining is achieved first by reporting visitors traffic information based on Web server log files and other source of traffic data (as discussed below). Web server log files were used initially by the webmasters and system administrators for the purposes of “how much traffic they are getting, how many requests fail, and what kind of errors are being generated”, etc. However, Web server log files can also record and trace the visitors’ on-line behaviors. For example, after some basic traffic analysis, the log files can help us answer questions such as “from what search engine are visitors coming? What pages are the most and least popular? Which browsers and operating systems are most commonly used by visitors?”
Web log file is one way to collect Web traffic data. The other way is to “sniff” TCP/IP packets as they cross the network, and to “plug in” to each Web server.
After the Web traffic data is obtained, it may be combined with other relational databases, over which the data mining techniques are implemented. Through some data mining techniques such as association rules, path analysis, sequential analysis, clustering and classification, visitors’ behavior patterns are found and interpreted.
The above is the brief explanation of how Web usage is done. Most sophisticated systems and techniques for discovery and analysis of patterns can be placed into two main categories, Pattern Analysis Tools and Pattern Discovery Tools, as discussed below in detail.
Pattern Analysis Tools
Web site administrators are extremely interested in questions like "How are people using the site?" "Which pages are being accessed most frequently?", etc. These questions require the analysis of the structure of hyperlinks as well as the contents of the pages. The end products of such analysis might include:
1. the frequency of visits per document,
2. most recent visit per document,
3. who is visiting which documents,
4. frequency of use of each hyperlink, and
5. most recent use of each hyperlink.
The techniques of Web usage patterns discovery, such as association, path analysis, sequential patterns, etc. (will be illustrated below in detail.
The common techniques used for pattern analysis are visualization techniques, OLAP techniques, Data & Knowledge Querying, and Usability Analysis. However, this paper mainly focuses on the Pattern Discoveries, and the Pattern Analysis will not be discussed further in detail.
Pattern Discovery Tools
Pattern Discovery Tools implement techniques from data mining, psychology, and information theory on the Web traffic data collected.
Data Pre-processing
Portions of Web usage data exist in sources as diverse as Web server logs, referral logs, registration-files and index server logs. This information needs to be integrated to form a complete data set for data mining. However, before the integration of the data, Web log files need to be cleaned/filtered, using techniques like filtering the raw data to eliminate outliers and/or irrelevant items, grouping individual page accesses into semantic units.
Filtering the raw data to eliminate irrelevant items is important for web traffic analysis. Elimination of irrelevant items can be accomplished by checking the suffix of the URL name, which tells you what format these kind of files are. For example, the embedded graphics can be filtered out from the Web log file, whose suffix is usually the form of “gif”, “jpeg”, “jpg”, “GIF”, “JPEG”, “JPG”, can be removed.
The next step is to integrate data from all sources to form a visitor profile data. Or we can say, the data in registration files (mainly visitors' demographic and household information) can be appended to log and forms data. The figure gives an example of data integration.
Pattern Discovery Techniques
Converting IP addresses to Domain Names
Every visitor to a Web site connects to the Internet through an IP address (for example, 198.227.55.153). Every IP address has a corresponding domain name, and these are linked through the Domain Name System (DNS). DNS can convert a domain name that a visitor entered in Web browser into a corresponding IP address. A visitor’s IP address can be converted into a domain name by using the DNS system in reverse, called a reverse DNS lookup.
You can hardly mine any knowledge merely from an IP number. However, if you convert the IP number into the domain name, some knowledge can be discovered. For example, you can estimate where visitors live by looking at the extension of each visitor’s domain name, such as .ca (Canada); .au (Australia); cn(China), etc.
Converting File Names to Page Titles
A well-designed site will have a title (between <title> and </title>) for every page. Rather than simply report the file names (URL) requested, a good system should look at these files and determine their titles. Page titles are much easier to read than URLs, so a good system should show page titles on reports in addition to URLs.
Path Analysis
Graph models are most commonly used for Path Analysis. In the graph models, a graph represents some relation defined on Web pages (or web), and each tree of the graph represents a web site. Each node in the tree represents a web page (html document), and edges between trees represent the links between web sites, while the edges between nodes inside a same tree represent links between documents at a web site.
When path analysis is used on the site as a whole, this information can offer valuable insights about navigational problems. Examples of information that can be discovered through path analysis are:
• 78% of clients who accessed /company/products/order.asp by starting at /company and proceeding through /company/whatsnew.html, and /company/products/sample.html ;
• 60% of clients left the site after four or less page references.
The first rule tells us that 78% of visitors decided to make a purchase after seeing the sample of the products. The second rule indicates an attrition rate for the site. Since many users don’t browse further than four pages into the site, it is tactful to ensure that most important information (product sample, for example) is contained within four pages of the common site entry points.
Grouping
Users usually can draw higher-level conclusions by grouping similar information. For example, grouping all Netscape browsers together and all Microsoft browsers together will show which browser is more popular on the site, regardless of minor versions. Similarly, grouping all referring URLs containing the word “Yahoo” shows how many visitors came from a Yahoo server. For example