Home > ETL Best Practices > QA your own ETL process – no more, no less

QA your own ETL process – no more, no less

The most common task in an ETL process is to determine what to update, what to delete and what to insert using criteria that are specific to what you are doing.

One of the routine tasks in the asset management master database ETL project is to use a composite key of host name + serial number to determine if an asset needs to be updated from a network auto-discovery tool, or needs to be created as a new asset. No assets will be deleted though. They can be de-commissioned in variety of ways if they meet certain criteria.

One QA technique I use routinely is based on an ETL principal, that is “ETL process should not create new data”, or what I called “no more, no less”.

Suppose I have the following 5 milestone points and staging.

  1. Extracting raw data
  2. Cleansing and standardizing and integrating
  3. Conforming data to the master database
  4. Delivering to the target
  5. Process reporting

If I start from the stage 1 and get these row counts from the delivering stage:

  1. Insert: 2,000
  2. Update: 300,000

Now if I re-start from stage 2 (skip stage 1), my process should not create any new data, since my source data remains the same. I’d expect my row counts look like this:

  1. Insert: 0
  2. Update: 302,000 (300,000 + 2,000) 

This simple QA technique helped me tremendously in

1) debugging my own process

2) also discovering new patterns in the data.

Categories: ETL Best Practices
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: