13.3.10

An Adaptive Algorithm for Detection of Duplicate Records ppt

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats or any combination of these factors. In this article, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with a coverage of existing tools and with a brief discussion of the big open problems in the area


Introduction :

Databases play an important role in today's IT based economy. Many industries and systems depend on the accuracy of databases to carry out operations. Therefore, the quality of the information (or the lack thereof) stored in the databases, can have significant cost implications to a system that relies on information to function and conduct business. In an error-free system with perfectly clean data, the construction of a comprehensive view of the data consists of linking --in relational terms, joining-- two or more tables on their key fields. Unfortunately, data often lack a unique, global identifier that would permit such an operation. Furthermore, the data are neither carefully controlled for quality nor defined in a consistent way across different data sources. Thus, data quality is often compromised by many factors, including data entry errors (e.g., Microsft instead of Microsoft), missing integrity constraints (e.g., allowing entries such as EmployeeAge=567), and multiple conventions for recording information (e.g., 44 W. 4th St. versus 44 West Fourth Street). To make things worse, in independently managed databases not only the values, but the structure, semantics and underlying assumptions about the data may differ as well.


Often, while integrating data from different sources to implement a data warehouse, organizations become aware of potential systematic differences or conflicts. Such problems fall under the umbrella-term data heterogeneityData cleaning or data scrubbing, refer to the process of resolving such identification problems in the data. We distinguish between two types of data heterogeneity: structural and lexicalStructural heterogeneity occurs when the fields of the tuples in the database are structured differently in different databases. For example, in one database, the customer address might be recorded in one field named, say, addr, while in another database the same information might be stored in multiple fields such as streetcity,state, and zipcodeLexical heterogeneity occurs when the tuples have identically structured fields across databases, but the data use different representations to refer to the same real-world object (e.g., StreetAddress=44 W. 4th St. versus StreetAddress=44 West Fourth Street).
In this paper, we focus on the problem of lexical heterogeneity and survey various techniques which have been developed for addressing this problem. We focus on the case where the input is a set of structured and properly segmented records, i.e., we focus mainly on cases of database records. Hence, we do not cover solutions for the various other problems, such that ofmirror detection, in which the goal is to detect similar or identical web pages. Also, we do not cover solutions for problems such as anaphora resolution, in which the problem is to locate different mentions of the same entity in free text (e.g., that the phrase "President of the U.S." refers to the same entity as "George W. Bush"). We should note that the algorithms developed for mirror detection or for anaphora resolution are often applicable for the task of duplicate detection. Techniques for mirror detection have been used for detection of duplicate database records (see, for example, Set joins and techniques for anaphora resolution are commonly used as an integral part of deduplication in relations that are extracted from free text using information extraction systems.
The problem that we study has been known for more than five decades as the record linkage or the record matching problem in the statistics community. The goal of record matching is to identify records in the same or different databases that refer to the same real-world entity, even if the records are not identical. In slightly ironic fashion, the same problem has multiple names across research communities. In the database community, the problem is described as merge-purgedata deduplication, and instance identification; in the AI community, the same problem is described as database hardening and name matching. The names coreference resolutionidentity uncertainty, and duplicate detection are also commonly used to refer to the same task. We will use the term duplicate record detection in this paper.


The remaining part of this paper is organized as follows: In Data preparation, we briefly discuss the necessary steps in the data cleaning process, before the duplicate record detectionphase. Then, Field matching describes techniques used to match individual fields, and Record matching presents techniques for matching records that contain multiple fields. Efficiency describes methods for improving the efficiency of the duplicate record detection process and Tools presents a few commercial, off-the-shelf tools used in industry for duplicate record detection and for evaluating the initial quality of the data and of the matched records. Finally, Conclusions concludes the paper and discusses interesting directions for future research



For more to know.........
Duplicate Record Detection - Wikipedia, the free .




Many organizations collect large amounts of data to support their business and decision making processes. The data collected from various sources may have data quality problems in it. These kinds of issues become prominent when various databases are integrated. The integrated databases inherit the data quality problems that were present in the source database. The data in the integrated systems need to be cleaned for proper decision making. Cleansing of data is one of the most crucial steps. In this research, focus is on one of the major issue of data cleansing i.e. “duplicate record detection” which arises when the data is collected from various sources. As a result of this research study, comparison among standard duplicate elimination algorithm (SDE), sorted neighborhood algorithm (SNA), duplicate elimination sorted neighborhood algorithm (DE-SNA), and adaptive duplicate detection algorithm (ADD) is provided. A prototype is also developed which shows that adaptive duplicate detection algorithm is the optimal solution for the problem of duplicate record detection. For approximate matching of data records, string matching algorithms (recursive algorithm with word base and recursive algorithm with character base) have been implemented and it is concluded that the results are much better with recursive algorithm with word base


Click here for download ppt files.






An Adaptive Algorithm for Detection of Duplicate Records. (ppt)

No comments:

Post a Comment