A Genetic Programming Approach to Record Deduplication

**seminar code** · 14-06-2014, 03:55 PM

A Genetic Programming Approach to Record Deduplication

.docx

Programming Approach.docx (Size: 20.09 KB / Downloads: 12)

Abstract

Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for developing methods for removing replicas from its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to more concise data and to potential savings in computational time and resources to process this data. we propose a genetic programming approach to record deduplication that combines several different pieces of evidence extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter

Existing System

The impact of this problem, it is important to list and analyze the major consequences of allowing the existence of “dirty” data in the repositories. A function used for record deduplication must accomplish distinct but conflicting objectives: it should efficiently maximize the identification of record replicas while avoiding making mistakes during the process. the existing approaches to replica identification depend on several choices to set their parameters, and they may not be always optimal. Setting these parameters requires the accomplishment. A major cause is the presence of duplicates, quasi replicas, or near-duplicates in these repositories, mainly those constructed by the aggregation or integration of distinct data sources. The problem of detecting and removing duplicate entries in a repository is generally known as record deduplication

Proposed System

They propose a number of algorithms for matching citations from different sources based on edit distance, word matching, phrase matching, and subfield extraction. a matching algorithm that, given a record in a file (or repository), looks for another record in a reference file that matches the first record according to a given similarity function. The matched reference records are selected based on a user-defined minimum similarity threshold. The individuals are handled and modified by genetic operations such as reproduction, crossover, and mutation , in an iterative way that is expected to spawn better individuals (solutions to the proposed problem) in the subsequent generations. we presented a GP-based approach to record deduplication. Our approach is able to automatically suggest deduplication functions based on evidence presentin the data repositories

Hardware Requirements

• System : Pentium IV 2.4 GHz.
• Hard Disk : 40 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• RAM : 256 Mb

Software Requirements

• Operating system : - Windows XP Professional.
• Front End : - Visual Studio.Net 2008

• Coding Language : - Visual C# .Net.

Domain: Knowledge And Data Engineering

The building, maintaining and development of knowledge-based systems It has a great deal in common with software engineering, and is used in many computer science domains such as artificial intelligence including databases, data mining, expert systems, decision support systems and geographic information systems. Knowledge engineering is also related to mathematical logic, as well as strongly involved in cognitive science and socio-cognitive engineering where the knowledge is produced by socio-cognitive aggregates (mainly humans) and is structured according to our understanding of how human reasoning and logic works.

Database administration

A database administrator (short form DBA) is a person responsible for the installation, configuration, upgrade, administration, monitoring and maintenance of databases in an organization. The role includes the development and design of database strategies, monitoring and improving database performance and capacity, and planning for future expansion requirements. They may also plan, co-ordinate and implement security measures to safeguard the database

Database Integration

A major cause is the presence of duplicates, quasi replicas, or near-duplicates in these repositories, mainly those constructed by the aggregation or integration of distinct data sources

Genetic Algorithms

The main aspect that distinguishes GP from other evolutionary techniques (e.g., genetic algorithms, evolutionary systems, genetic classifier systems) is that it represents the concepts and the interpretation of a problem as a computer program and even the data are viewed and manipulated in this way. We present a genetic programming (GP) approach to record deduplication. Our approach combines several different pieces of evidence extracted from the data content to produce a deduplication function that is able to identify whether two or more entries in a repository

Evolutionary Computing

Evolutionary programming is based on ideas inspired on the naturally observed process that influence virtually all living beings, the natural selection distinguishes GP from other evolutionary techniques (e.g., genetic algorithms, evolutionary systems, genetic classifier systems) is that it represents the concepts and the interpretation of a problem as a computer program and even the data are viewed and manipulated in this way

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Programming Paradigm ppt	seminar paper	1	2,437	15-09-2017, 11:13 AM Last Post: jaseela123
	A Multidimensional Sequence Approach to Measuring Tree Similarity	Projects9	1	8,499	09-09-2017, 10:19 AM Last Post: jaseela123
	Parallelization of Genetic Algorithms and its Applications	project topics	1	9,185,345	02-09-2017, 12:28 PM Last Post: jaseela123
	A Calculus Approach to Energy-Efficient Data Transmission With Quality-of-Service	project topics	1	159,496	28-08-2017, 01:12 PM Last Post: jaseela123
	Beginning Programming FOR DUMMIES 3RD EDITION	mkaasees	0	204	25-08-2017, 09:32 PM Last Post: mkaasees
	Aspect Oriented Programming	nit_cal	0	8,918,202	25-08-2017, 09:32 PM Last Post: nit_cal
	A Parallel Approach to XML Parsing	mechanical engineering crazy	0	14,755,420	25-08-2017, 09:32 PM Last Post: mechanical engineering crazy
	Verification of Cryptographic Protocals Using Logic Programming	nit_cal	0	8,190,190	25-08-2017, 09:32 PM Last Post: nit_cal
	A PROACTIVE APPROACH TO NETWORK SECURITY	nit_cal	0	8,201,340	25-08-2017, 09:32 PM Last Post: nit_cal
	Information Technology and Programming Concepts	mkaasees	0	317	11-11-2016, 01:44 PM Last Post: mkaasees

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.