05-01-2013, 06:55 PM
A Genetic Programming Approach to Record Deduplication
Abstract:
Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce
brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that,
there have been significant investments from private and government organizations for developing methods for removing replicas from
its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality
information but also lead to more concise data and to potential savings in computational time and resources to process this data. In this
paper, we propose a genetic programming approach to record deduplication that combines several different pieces of evidence
extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or
not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the
suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming
approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the
burden of having to choose and tune this parameter.
About myself
i am ramya. i am studied in guntur rvr&jc college b.tech 3td year,
i am giving ppt in my college so i sholud know about this topic, wat is the actual content of this paaper i dont know, for better understanding of this paper send dis paper with ppt and content of thid topic