ETL – When should we correct data?
In this article Six Key Decisions for ETL Architectures, the author stated “data quality cannot be improved by ETL alone. In this environment, the ETL team needs to be agile and proactive, partnering with the business and the IT teams that support the source systems”.
The key decision is where to correct the data. The best solution is clearly to have the data captured accurately in the first place. In most cases the data should be corrected back in the source system. Unfortunately, it is inevitable that poor quality data will reach the ETL system, and when this happens, there are three choices:
1) Halt the entire load process
2) Send the offending record(s) to a suspense file for later processing
3) Tag the data and pass it through.
The author recommended the third option of tagging the data with the error condition and let it pass through.
The author recommended using an audit dimension to tag the bad fact table and dimension data. In the case of missing or garbage data, it can be tagged with unique error values in the field itself.
If we let the bad fact or dimension data pass through with audit or error tagging, the, we are in a need for data quality reporting capabilities, which can flag the offending fact and dimension rows, indicating a need for resolution and, ideally, repair of the data in the source system.