E-MINE : A WEB MINING APPROACH

project uploader · 21-04-2012, 12:26 PM

E-MINE : A WEB MINING APPROACH

.ppt

e-mine.ppt (Size: 295 KB / Downloads: 44)
ABSTRACT
In recent years government agencies and industrial enterprises are using the web as the medium of publication. Hence, a large collection of documents, images, text files and other forms of data in unstructured forms are available on the web. It has become increasingly difficult to identify relevant pieces of information since the pages are often cluttered with irrelevant content like advertisements, copyright notices, etc surrounding the main content. Thus, we propose a technique that mines the relevant data regions from a web page.
Introduction
Extracting the regularly structured data records from web pages is an important problem.
The main disadvantage with the existing automatic approaches is that the relevant information of a data record is contained in a adjacent segment of HTML code, which is not always true.
Thus, we propose a more effective method to mine the data region in a web page. eMine, finds the data regions formed by all types of tags.
Related Work
Related work, mainly in the area of mining data records in a web page is MDR (Mining Data Records).
MDR algorithm makes use of the HTML tag tree of the web page to extract data records from the page. However, an incorrect tag tree may be constructed due to the misuse of HTML tags, which in turn makes it impossible to extract data records correctly.
The Proposed Technique
It is an effective method, eMine, to mine the data region from a web page automatically.
The basic criteria which eMine uses the locations on the screen at which tags are rendered i.e. visual Information.
Algorithm eMine
Determine the height & width of all the bounding Rectangles in the HTML document.
Calculate the areas of all the Bounding Rectangles.
Identify the Maximum Rectangle from all the bounding Rectangles.
4.Identify the container within the Maximum Rectangle obtained from step 3.
5. Identify the Data Region in the container obtained from step 4.
6. Filter the Data Region obtained after step 5 for removal of some more irrelevant data.
Determining the Height and width of all bounding rectangles
In the first step, we determine the dimensions of all the bounding rectangles in the web page. Every <table> tag in a web page will be associated with a specific height and width attribute. We extract them.
If not specified, the MSHTML parsing and rendering engine of Microsoft Internet Explorer 6.0 can be used. This parsing and rendering engine of the web browser gives us the coordinates of a bounding rectangle
Identification of the largest rectangle
Based on the height and width of bounding rectangles obtained in the previous step, we determine the area of the bounding rectangles of each of the <body> tag. We then determine the largest rectangle amongst these bounding rectangles.
The reason for doing this is a sensible assumption that the largest bounding rectangle will always contain the most relevant data in that web page.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Pivot Vector Space Approach in Audio-Video Mixing	computer science crazy	0	12,308,095	25-08-2017, 09:32 PM Last Post: computer science crazy
	Desktop Streaming : A new Approach in Technical Support	Electrical Fan	0	9,570,304	25-08-2017, 09:32 PM Last Post: Electrical Fan
	A Novel approach for data hiding in the video motion using crypt analytical	mkaasees	0	339	06-10-2016, 12:11 PM Last Post: mkaasees
	A NOVEL APPROACH FOR IMAGE AND VIDEO STEGANOGRAPHY TECHNIQUE TO EMBED IMAGES	mkaasees	0	281	17-09-2016, 11:57 AM Last Post: mkaasees
	Extending market basket analysis with graph mining techniques	mkaasees	0	282	26-08-2016, 11:52 AM Last Post: mkaasees
	asp.net web development	mkaasees	0	243	09-08-2016, 09:53 AM Last Post: mkaasees
	Some Observations On Web-Based Recruitment By Selected Fortune 500 Companies	dhanabhagya	0	839	18-02-2016, 04:29 PM Last Post: dhanabhagya
	Super-Resolution of Remotely-Sensed Images Using a Learning-Based Approach	dhanabhagya	0	273	28-01-2016, 12:28 PM Last Post: dhanabhagya
	Detecting Near-Duplicated For Web Crawling	dhanabhagya	0	301	21-01-2016, 03:41 PM Last Post: dhanabhagya
	DATA MINING IN TELECOMMUNICATIONS	dhanabhagya	0	475	04-01-2016, 04:53 PM Last Post: dhanabhagya

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.