18-09-2012, 01:56 PM
Eficient Near Duplicate Document Detection for Specialized Corpora
Eficient Near Duplicate.doc (Size: 363.26 KB / Downloads: 32)
Abstract
Knowledge of near duplicate documents can be adventagous to search engines, even those that only cover a small enterprise or specialized corpus. In this thesis, we investigate improvements to simhash, a signature-based method which can be used to eficiently detect near duplicate documents. We implement simhash in its original form, and demonstrate its effectiveness on a small corpus of newspaper articles, and improve its accuracy through utilizing external metadata and altering its feature selection approach. We also demonstrate the fragility of simhash towards changes in the weighting of features by applying novel changes to the weights. As motivation for performing this near duplicate detection, we discuss the impact it can have on search engines.
Introduction
The purpose of this research is to investigate methods to detect near duplicate documents and their impact on the search process in an enterprise setting, which includes crawling a corpus, indexing the documents, and ranking the results.
This thesis will detail what it means to be considered a "near duplicate" document, and describe how detecting them can be beneicial. After describing methods that are currently in use for such detection, we implement simhash [24] to demonstrate its effectiveness on a small, specialized corpus, and offer improvements to its algorithm.
Motivation
The driving motivation behind these near duplicate algorithms is to quickly and eficiently determine which documents in a large set are similar to each other. Knowing this information can lead to improvements in a variety of ways, depending on the application for which it is being used. Here we consider the case of a search engine working to serve a small corpus of documents.
Background and Related Work
When comparing two documents, it is easy to detect if they are exactly identical, using checksums or other comparison techniques. Detecting if a document varies slightly, in the case of near duplicates, is more dificult.
There are several approaches to detecting near duplicate documents that have been proposed and tested on the web which can be implemented and explored further. Here we look at how a few of these are accomplished, and discuss their relevance to the enterprise domain.
Enterprise Corpora
Enterprise corpora differ from regular documents on the web in their size, quality, and (lack of) diversity.
Deinition of Near Duplicate
We now consider exactly what it means for two documents to be near duplicates of each other. There are many different deinitions that others have used, each different depending on the particular use case and motivation. In a general sense, "near duplicates" are documents that differ only slightly in content.
Deinition 1
In the extreme case, near duplicates are those that appear identical to the user when viewed in a browser, despite the actual code being different. This case arises when there are non-visible changes in a document (such as HTML comments) or a common document that multiple other documents can map to (such as a print style that
eliminates most of the formatting).