25-08-2012, 01:04 PM
Efficient Techniques for Online Record Linkage
Efficient Techniques for Online1.doc (Size: 1.29 MB / Downloads: 49)
Introduction :
The record-linkage problem identifying and linking duplicate records arises in the context of data cleansing, which is a necessary pre-step to many database applications. Databases frequently contain approximately duplicate fields and records that refer to the same real-world entity, but are not identical.
Importance of data linkage in a variety of data-analysis applications, developing effective an efficient techniques for record linkage has emerged as an important problem. It is further evidenced by the emergence of numerous organizations (e.g., Trillium, First Logic , Vality, Data Flux) that are developing specialized domain specific record-linkage and data-cleansing tools.
The data needed to support these decisions are often scattered in heterogeneous distributed databases. In such cases, it may be necessary to link records in multiple databases so that one can consolidate and use the data pertaining to the same real world entity. If the databases use the same set of design standards, this linking can easily be done using the primary key (or other common candidate keys). However, since these heterogeneous databases are usually designed and managed by different organizations (or different units within the same organization), there may be no common candidate key for linking the records. Although it may be possible to use common non key attributes (such as name, address, and date of birth) for this purpose, the result obtained using these attributes may not always be accurate. This is because non key attribute values may not match even when the records represent the same entity instance in reality.
USER REQUIREMENTS :
Provide requirements of the system, user or business, taking into account all major classes/categories of users. Provide the type of security or other distinguishing characteristics of each set of users. List the functional requirements that compose each user requirement. As the functional requirements are decomposed, the highest level functional requirements are traced to the user requirements. Inclusion of lower level functional requirements is not mandatory in the traceability to user requirements if the parent requirements are already traced to them.
User requirement information can be in text or process flow format for each major user class that shows what inputs will initiate the system functions, system interactions, and what outputs are expected to be generated by the system. The scenarios should be comprehensive, to the extent that all user types and all major functions are covered. Give each user requirement a unique number. Typically, user requirements have a numbering system that is separate from the functional requirements. Requirements may be labeled with a leading “U” or other label indicating user requirements.
Non-Functional Requirements :
Non-Functional Requirements are not really requirements at all. Rather, they are constraints on implementing the functional requirements as defined in the use case documents and other documents or models. However, for the purposes of requirements management they are considered to be requirements and, as such, need to be tested. The first rule of defining non-functional requirements, therefore, is to ensure that they are testable. A requirement that cannot be tested may as well not be included as a requirement. The best way to ensure that non-functional requirements can be tested is to think about how they might be tested when they are written. If it is not obvious then ask a test developer how they would test it.
Other guidelines should include whether they are written in a way that is unambiguous and easy to understand. Some of the requirements may be very technical in nature so not all can be expected to be understood by non-technical people. However, those who will have to implement them and those who will have to test them should be able to understand them easily and completely.
Linked Record Health Data Systems
The goal of record linkage is to link quickly and accurately records that correspond to the same person or entity. Whereas certain patterns of agreements and disagreements on variables are more likely among records pertaining to a single person than among records for different people, the observed patterns for pairs of records can be viewed as arising from a mixture of matches and non matches. Mixture model estimates can be used to partition record pairs into two or more groups that can be labeled as probable matches (links) and probable non matches. A method is proposed and illustrated that uses marginal information in the database to select mixture models, identifies sets of records for clerks to review based on the models and marginal information, incorporates clerically reviewed data, as they become available, into estimates of model parameters, and classifies pairs as links, non links, or in need of further clerical review. The procedure is illustrated with five datasets from the U.S. Bureau of the Census. It appears to be robust to variations in record-linkage sites. The clerical review corrects classifications of some pairs directly and leads to changes in classification of others through re estimation of mixture models.
Efficient Private Record Linkage
Record linkage is the computation of the associations among records of multiple databases. It arises in contexts like the integration of such databases, online interactions and negotiations, and many others. The autonomous entities who wish to carry out the record matching computation are often reluctant to fully share their data. In such a framework where the entities are unwilling to share data with each other, the problem of carrying out the linkage computation without full data exchange has been called private record linkage. Previous private record linkage techniques have made use of a third party. We provide efficient techniques for private record linkage that improve on previous work in that (i) they make no use of a third party; (ii) they achieve much better performance than that of previous schemes in terms of execution time and quality of output (i.e., practically without false negatives and minimal false positives). Our software implementation provides experimental validation of our approach and the above claims.
CONCLUSION
In this paper, we develop efficient techniques to facilitate record linkage decisions in a distributed, online setting. Record linkage is an important issue in heterogeneous database systems where the records representing the same real-world entity type are identified using different identifiers in different databases. In the absence of common identifier, it is often difficult to find records in a remote database that are similar to a local enquiry record. Traditional record linkage uses a probability-based model to identify the closeness between records. The matching probability is omputed based on common attribute values. This, of course, requires that common attribute values of all the remote records be transferred to the local site. The communication overhead is significantly large for such an operation. We propose techniques for record linkage that draw upon previous work in sequential decision making.More specifically, we develop a matching tree for attribute acquisition and propose three different schemes of using this tree for record linkage.