14-06-2014, 03:55 PM
A Genetic Programming Approach to Record Deduplication
Programming Approach.docx (Size: 20.09 KB / Downloads: 12)
Abstract
Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for developing methods for removing replicas from its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to more concise data and to potential savings in computational time and resources to process this data. we propose a genetic programming approach to record deduplication that combines several different pieces of evidence extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter
Existing System
The impact of this problem, it is important to list and analyze the major consequences of allowing the existence of “dirty” data in the repositories. A function used for record deduplication must accomplish distinct but conflicting objectives: it should efficiently maximize the identification of record replicas while avoiding making mistakes during the process. the existing approaches to replica identification depend on several choices to set their parameters, and they may not be always optimal. Setting these parameters requires the accomplishment. A major cause is the presence of duplicates, quasi replicas, or near-duplicates in these repositories, mainly those constructed by the aggregation or integration of distinct data sources. The problem of detecting and removing duplicate entries in a repository is generally known as record deduplication
Proposed System
They propose a number of algorithms for matching citations from different sources based on edit distance, word matching, phrase matching, and subfield extraction. a matching algorithm that, given a record in a file (or repository), looks for another record in a reference file that matches the first record according to a given similarity function. The matched reference records are selected based on a user-defined minimum similarity threshold. The individuals are handled and modified by genetic operations such as reproduction, crossover, and mutation , in an iterative way that is expected to spawn better individuals (solutions to the proposed problem) in the subsequent generations. we presented a GP-based approach to record deduplication. Our approach is able to automatically suggest deduplication functions based on evidence presentin the data repositories
Hardware Requirements
• System : Pentium IV 2.4 GHz.
• Hard Disk : 40 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• RAM : 256 Mb
Software Requirements
• Operating system : - Windows XP Professional.
• Front End : - Visual Studio.Net 2008
• Coding Language : - Visual C# .Net.
Domain: Knowledge And Data Engineering
The building, maintaining and development of knowledge-based systems It has a great deal in common with software engineering, and is used in many computer science domains such as artificial intelligence including databases, data mining, expert systems, decision support systems and geographic information systems. Knowledge engineering is also related to mathematical logic, as well as strongly involved in cognitive science and socio-cognitive engineering where the knowledge is produced by socio-cognitive aggregates (mainly humans) and is structured according to our understanding of how human reasoning and logic works.
Database administration
A database administrator (short form DBA) is a person responsible for the installation, configuration, upgrade, administration, monitoring and maintenance of databases in an organization. The role includes the development and design of database strategies, monitoring and improving database performance and capacity, and planning for future expansion requirements. They may also plan, co-ordinate and implement security measures to safeguard the database
Database Integration
A major cause is the presence of duplicates, quasi replicas, or near-duplicates in these repositories, mainly those constructed by the aggregation or integration of distinct data sources
Genetic Algorithms
The main aspect that distinguishes GP from other evolutionary techniques (e.g., genetic algorithms, evolutionary systems, genetic classifier systems) is that it represents the concepts and the interpretation of a problem as a computer program and even the data are viewed and manipulated in this way. We present a genetic programming (GP) approach to record deduplication. Our approach combines several different pieces of evidence extracted from the data content to produce a deduplication function that is able to identify whether two or more entries in a repository
Evolutionary Computing
Evolutionary programming is based on ideas inspired on the naturally observed process that influence virtually all living beings, the natural selection distinguishes GP from other evolutionary techniques (e.g., genetic algorithms, evolutionary systems, genetic classifier systems) is that it represents the concepts and the interpretation of a problem as a computer program and even the data are viewed and manipulated in this way