Eficient Near Duplicate Document Detection for Specialized Corpora

**seminar flower** · 18-09-2012, 01:56 PM

Eficient Near Duplicate Document Detection for Specialized Corpora

.doc

Eficient Near Duplicate.doc (Size: 363.26 KB / Downloads: 32)

Abstract

Knowledge of near duplicate documents can be adventagous to search engines, even those that only cover a small enterprise or specialized corpus. In this thesis, we investigate improvements to simhash, a signature-based method which can be used to eficiently detect near duplicate documents. We implement simhash in its original form, and demonstrate its effectiveness on a small corpus of newspaper articles, and improve its accuracy through utilizing external metadata and altering its feature selection approach. We also demonstrate the fragility of simhash towards changes in the weighting of features by applying novel changes to the weights. As motivation for performing this near duplicate detection, we discuss the impact it can have on search engines.

Introduction

The purpose of this research is to investigate methods to detect near duplicate documents and their impact on the search process in an enterprise setting, which includes crawling a corpus, indexing the documents, and ranking the results.
This thesis will detail what it means to be considered a "near duplicate" document, and describe how detecting them can be beneicial. After describing methods that are currently in use for such detection, we implement simhash [24] to demonstrate its effectiveness on a small, specialized corpus, and offer improvements to its algorithm.

Motivation

The driving motivation behind these near duplicate algorithms is to quickly and eficiently determine which documents in a large set are similar to each other. Knowing this information can lead to improvements in a variety of ways, depending on the application for which it is being used. Here we consider the case of a search engine working to serve a small corpus of documents.

Background and Related Work

When comparing two documents, it is easy to detect if they are exactly identical, using checksums or other comparison techniques. Detecting if a document varies slightly, in the case of near duplicates, is more dificult.
There are several approaches to detecting near duplicate documents that have been proposed and tested on the web which can be implemented and explored further. Here we look at how a few of these are accomplished, and discuss their relevance to the enterprise domain.

Enterprise Corpora

Enterprise corpora differ from regular documents on the web in their size, quality, and (lack of) diversity.

Deinition of Near Duplicate

We now consider exactly what it means for two documents to be near duplicates of each other. There are many different deinitions that others have used, each different depending on the particular use case and motivation. In a general sense, "near duplicates" are documents that differ only slightly in content.

Deinition 1

In the extreme case, near duplicates are those that appear identical to the user when viewed in a browser, despite the actual code being different. This case arises when there are non-visible changes in a document (such as HTML comments) or a common document that multiple other documents can map to (such as a print style that
eliminates most of the formatting).

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Cyber crime detection, investigation and prosecution pdf	seminar projects maker	1	958	20-09-2017, 11:31 AM Last Post: jaseela123
	MULTIVECTOR PORTABLE INTRUSION DETECTION SYSTEM PPT	seminar flower	1	1,665	15-09-2017, 03:23 PM Last Post: jaseela123
	Detection of Wormhole Attack in MANET	seminar code	1	683	13-09-2017, 01:08 PM Last Post: jaseela123
	Intrusion Detection System	Electrical Fan	0	9,999,386	25-08-2017, 09:32 PM Last Post: Electrical Fan
	Intrution Detection System	Computer Science Clay	0	8,763,371	25-08-2017, 09:32 PM Last Post: Computer Science Clay
	Applying Mobile Agents to Intrusion Detection and Response	electronics seminars	0	8,450,617	25-08-2017, 09:32 PM Last Post: electronics seminars
	Anomaly Detection	seminar projects crazy	0	11,119,292	25-08-2017, 09:32 PM Last Post: seminar projects crazy
	Intrution Detection System	computer science crazy	0	18,900,073	25-08-2017, 09:32 PM Last Post: computer science crazy
	Host-Based Intrusion Detection Using user signatures	nit_cal	0	10,471,756	25-08-2017, 09:32 PM Last Post: nit_cal
	ADAPTIVE MULTIPATH DETECTION	seminar projects crazy	0	13,599,395	25-08-2017, 09:32 PM Last Post: seminar projects crazy

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.