22-09-2016, 12:44 PM
1455697501-PAPER.docx (Size: 523.45 KB / Downloads: 4)
Abstract: In recent times, due to the rapid usage of world wide web in a network, IP address are the information provider to the server manager to find out similar web pages (links) that are opened by users in a given session time.The availability of the data of web accessed is in human readable form generated by computer referred to as web log. Storing and retrieving the information from the log server is always a challenging task. Web mining is classified into three sub tasks such as, web content, web structure and web usage mining. This paper explains about the web usage mining based on server log files in an open source tool rapidminer.Here, the proposed work analyse the usage of web pages (i.e. browsing behaviour of user and browsing based on IP address) using two different clustering algorithms such as k-means, and random clustering which is incorporated in the tool rapid miner.
Introduction:
Data mining refers to extracting or mining useful knowledge from large amounts of data. It is the process of analyzing data from different perspectives and summarizing it into useful information. Users must able to collect the required information that flows through the Internet. Web Mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. Web is the only source for user for extracting required information through hyperlinks. Web Mining is divided into three classes based on the information extracted shown in fig.1. Information extracted like video, audio, text, image which is known as Web Content Mining. Also the information extracted from structure of web pages which is known as Web Structure Mining. Web Usage Mining is used to analyze the web access by the users based on the IP address and to cluster them based on the IP address and web page similarity. To perform analysis web usage data must be collected from the Web server log files. Website statistics are based on server logs. A server log is a simple text file which records activity on the server.
RapidMiner Tool:
RapidMiner is a software platform developed by the RapidMiner Company provides an integrated environment for data mining, text mining, web mining, predictive analytics and business analytics. RapidMiner tool is used to analyze the web access information which is used for Web Usage Mining. It is open source licensed software which provides data mining, web mining including data extraction, transformation, loading (ETL), data pre-processing, and visualization. The analyzed results can be viewed in the form of scatter plot, Bar graph, Pie chart, Histograms etc. RapidMiner is written in Java Programming Language. The advantage with this tool is to analyze the result without any coding. The tool contains inbuilt Operators which are performing a single task within the process and the output of each operator forms the input of the next one. Different datasets can be imported into the tool
such as excel, arff, text documents, web server log files etc. RapidMiner functionality can be extended with additional plug-ins which is made available via RapidMiner Marketplace. Web Usage Mining can be performed by adding plug-in Web Mining in Marketplace of tool.
Pre-processing:
Real world data are generally incomplete such as lacking attribute values, lacking certain attributes of interest, or containing only aggregate data, Noisy like containing errors or outliers and inconsistent data like containing discrepancies in codes or names. Hear in web log server data need to be complete to form clusters based on IP address or similar user. So the first task is to clean data using Replace missing values operator which is inbuilt in RapidMiner tool which further can be clustered as per user requirement. Also by applying pre-processing technique the invalid attributes which are not required for web usage analysis like images, audio, video etc. can be removed.