The Merge-Purge Problem and the Data Cleaner



Merge/Purge is very interesting: every large repository of data that I have ever met always has numerous duplicate information about entities that are difficult to cull together without an intelligent "equational theory" that matches complex records. Our patented implementation of the Data Cleaner provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This is an especially important operation for direct marketing and fraud detection applications and all sorts of data mining applications.

Indeed, many organizations have learned that real-world data is very very dirty! (Data cleansing is a big business...If it isn't, it ought to be.) We studied the MERGE/PURGE (DE-DUPE, CONSOLIDATION&LINK ANALYSIS) problem for Citicorp's Technology Office and have now tested our solution against data from the State of Washington's Child Welfare Department (OCAR), besides our own statistically controlled generated data. The results are excellent: The Data Cleaner produces better, more accurate results in less time on cheap parallel hardware than a bohemeth mainframe (yep, a no-brainer win). The preliminary analysis by the Child Welfare Department indicates that our accuracy is considerably better than what they were able to achieve using other methods based purely on single scan clustering. Definitive quantitative measurements are difficult here simply because in real-world problems one does not know a priori what the best results are. Thus, our use of statistically generated data provides the best objective measure of performance. Nonetheless, the OCAR researchers issued a statement qualifying our results as excellent:

The OCAR staff have reported that their initial tests of the accuracy of record matching are surpassing their expectations. An initial test of our system using a sample of OCAR's data exhibited an accuracy rate of about 97%, a substantial improvement over OCAR's prior results, which achieved no more than 90% accuracy rate. Accuracy is measured simply as the percentage of correctly merged records out of the total number possible. The accuracy rate was established by the OCAR staff after a laborious "eye balling" of the resultant "cleaned" Database.

OCAR is currently in the process of deploying our Data Cleaner in their operational systems. OCAR's staff report that they are especially pleased with the rule-based equational-theory since it can be easily modified and recompiled as new characteristics of the data are discovered and as the characteristics of the data change in the future.

See the recent letter we received from OCAR detailing their analysis of our results by clicking here.

The 4.5 million record OCAR database was analyzed by our system on a Pentium PC in 5.6 hours! Our implementation has been ported to Unix-based workstation clusters, Linux and DOS (so also under Win 3.1, Win NT, and Win95, presumably).

(We've anounced these results on KDD-nuggets as promised.) You can download our current paper on the topic by clicking here.

The key to the efficacy of the system we built is to compute the closure over multiple scans of the database, where each scan sorts the database using different keys. This novel and simple technique dramatically reduces the error (missed matches) in each single scan. The transitive closure ensures that we capture duplicates that would otherwise be missed by a single scan process. (A newer version of our Merge/Purge paper is now available. Click here to see the detailed results (including a histogram that is quite illuminating) on the real-world database.)


Our Merge/Purge System will soon be available for wide distribution.
STAY TUNED.


BACK