Home > ETL Best Practices > Data Validation – Gap Analysis

Data Validation – Gap Analysis

I guess it’s common for users to ask to validate their reports from legacy sources against the master database (or data warehouse) only after the system is in production.

We get this kind of requests a lot lately. Here is an example.

From an email:

Sherry, it seems like we may not be pulling all the software from xyz and loading into EAM.  AAA ran a comparison between EAM and using a xyz reporting tool and got a significant difference.  Can you take a look at this.  Maybe we have some reason for filtering out some of the data from xyz, which is why it is not loading. 

Here is my reply:


Thanks for your patience. When we load software installation data into EAM, we did put in a few filters.

1) Any software installation that has a last modify data older than 61 days are not loaded.

2) Any software installation that has name such as .msi or .tmp, or installed in the installer folder are not loaded

3) For any software that have multiple version of the same product, only the latest version is loaded.

The first filter plays a significant role here for what you see in EAM.

Here is one example. For host x1234567, xyz does show that it has software BBB installed. However, the last modify date on xyz shows 2010-08-26 20:14:24.120. The first filter we put in would prevent this record from being loaded into EAM, which explains why you don’t see the host on the EAM report.

We might need to modify the first filter to allow more software loaded into EAM. But I’ll leave that decision to my supervisor.  

After we removed the filter, we still have discrepancies. Here is where a data gap analysis comes in.

I sent out my analysis results with this email:

I’ve attached a file here with two tabs for you to review.

The first tab shows the 823 unique hits I get from EAM.

The second tab shows the analysis results for the 189 assets EAM is not reporting for software BBB. Three assets have a question mark in the Analysis column. It means that the reason is unknown to me at this point.

Just FYI – I did this in our development environment, which I kept in sync with the production EAM database.

While I am doing the gap analysis, I realize that I need to mention the PC software vendor standardization, and the device type standardization we put in the ETL process. These two standardization tables serve as data filters. If a software vendor is not on the vendor standardization table, the software products will not be loaded. If the device type is not on the device type standardization table, the hardware will not be loaded.

Here is where these two standardization tables reside. If we decide to load device type CCC from xyz, then  CCC device type needs to be added to the standardization table.



I am also coping the analysis result here:


The gap analysis is a manual process. Here is part of the script:


I ran this script repeatedly until no Analysis field is left blank. I have to know what to look for based on my knowledge about the source data, and the business rules we put into the ETL process.

Categories: ETL Best Practices
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: