Home > ETL Best Practices > ETL – When should we correct data?

ETL – When should we correct data?

In this article Six Key Decisions for ETL Architectures, the author stated “data quality cannot be improved by ETL alone. In this environment, the ETL team needs to be agile and proactive, partnering with the business and the IT teams that support the source systems”.

The key decision is where to correct the data. The best solution is clearly to have the data captured accurately in the first place. In most cases the data should be corrected back in the source system. Unfortunately, it is inevitable that poor quality data will reach the ETL system, and when this happens, there are three choices:

1) Halt the entire load process
2) Send the offending record(s) to a suspense file for later processing
3) Tag the data and pass it through.

The author recommended the third option of tagging the data with the error condition and let it pass through.

The author recommended using an audit dimension to tag the bad fact table and dimension data. In the case of missing or garbage data, it can be tagged with unique error values in the field itself.

If we let the bad fact or dimension data pass through with audit or error tagging, the, we are in a need for data quality reporting capabilities, which can flag the offending fact and dimension rows, indicating a need for resolution and, ideally, repair of the data in the source system.

Categories: ETL Best Practices
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: