21-04-2012, 12:26 PM
E-MINE : A WEB MINING APPROACH
e-mine.ppt (Size: 295 KB / Downloads: 44)
ABSTRACT
In recent years government agencies and industrial enterprises are using the web as the medium of publication. Hence, a large collection of documents, images, text files and other forms of data in unstructured forms are available on the web. It has become increasingly difficult to identify relevant pieces of information since the pages are often cluttered with irrelevant content like advertisements, copyright notices, etc surrounding the main content. Thus, we propose a technique that mines the relevant data regions from a web page.
Introduction
Extracting the regularly structured data records from web pages is an important problem.
The main disadvantage with the existing automatic approaches is that the relevant information of a data record is contained in a adjacent segment of HTML code, which is not always true.
Thus, we propose a more effective method to mine the data region in a web page. eMine, finds the data regions formed by all types of tags.
Related Work
Related work, mainly in the area of mining data records in a web page is MDR (Mining Data Records).
MDR algorithm makes use of the HTML tag tree of the web page to extract data records from the page. However, an incorrect tag tree may be constructed due to the misuse of HTML tags, which in turn makes it impossible to extract data records correctly.
The Proposed Technique
It is an effective method, eMine, to mine the data region from a web page automatically.
The basic criteria which eMine uses the locations on the screen at which tags are rendered i.e. visual Information.
Algorithm eMine
Determine the height & width of all the bounding Rectangles in the HTML document.
Calculate the areas of all the Bounding Rectangles.
Identify the Maximum Rectangle from all the bounding Rectangles.
4.Identify the container within the Maximum Rectangle obtained from step 3.
5. Identify the Data Region in the container obtained from step 4.
6. Filter the Data Region obtained after step 5 for removal of some more irrelevant data.
Determining the Height and width of all bounding rectangles
In the first step, we determine the dimensions of all the bounding rectangles in the web page. Every <table> tag in a web page will be associated with a specific height and width attribute. We extract them.
If not specified, the MSHTML parsing and rendering engine of Microsoft Internet Explorer 6.0 can be used. This parsing and rendering engine of the web browser gives us the coordinates of a bounding rectangle
Identification of the largest rectangle
Based on the height and width of bounding rectangles obtained in the previous step, we determine the area of the bounding rectangles of each of the <body> tag. We then determine the largest rectangle amongst these bounding rectangles.
The reason for doing this is a sensible assumption that the largest bounding rectangle will always contain the most relevant data in that web page.