At this point, the business could have gone in and cleaned up all the source data.
Instead, they chose to do the cleanup in their ETL process. The source system could
continue to hold misspellings of Ann Arbor, but the data in the warehouse would
be consistent. This was important because they wanted to analyze sales by city and
when someone chose Ann Arbor, they wanted all sales for that area to come up
without having to know to add in many misspellings of the city name.
The ETL process was written to handle all known misspellings of Ann Arbor.
Periodically, a new one would appear and the ETL would have to be updated.
Note that today, this would have been much easier thanks to the Fuzzy Lookup
transformation in SSIS, but at the time the customer had to do everything manually.
Given that there are such data quality issues inside any organization, it??™s important
for the business to come up with the rules for handling such issues. It??™s appealing
to be able to say that everything will be fixed in the warehouse, but sometimes the
effort to do so outweighs the benefits. Imagine a system in which bad records are
flagged as unknown. If the estimate to truly fix the data is 2000 man hours, and
less than one-tenth of one percent of the data is bad, is it worth the cost to fix? On
the other hand, if 40 percent of the data is bad, fixing the data probably has a much
higher priority.
Articles, whitepapers, and books can be written about data quality and the issues
around it.
Pages:
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89