Validations at end of ETL indicate missing data
At the end of each ETL job, I always run some sort of validation process to give me an indication that the data loading and processing are as expected or things have gone terribly wrong. The latter happened last Thursday when I received the validation result in my inbox.
Last Thursday also coincided with a “disk utilization” issue in the upstream server. So I waited until last Friday when the “disk utilization” issue was fixed in the upstream server to have the data reloaded. To my surprise, the data volume that was loaded on our side was still way below normal.
Steps to investigate missing data
It’s time to investigate. Blog writers at this point will usually “make the long story short”, I’ll do the same here, but with a few bullet points to highlight the steps I took to investigate.
- Pick one day for one employee: this is the most obvious data point to validate for my data. Our data contains detail data down to the granularity of per employee, per Interaction Resource ID, per connection ID, per 15 minutes interval per row. Picking one day for one employee will give me not too little and not too much data to check.
- Validate on the upstream source server: the table in question is a fact table and has a unique interaction resource fact ID. The unique list of the interaction resource fact ID is an obvious candidate as my data points for checking.
- Compare the interaction resource fact ID: between the upstream source server and our own server. Because the data volume is small enough, a simple query revealed that about 20% of the interaction resource fact IDs are not loaded into our own table.
- An ETL process design with an atomic operation in mind helps to pinpoint exactly where the problem might be: our fact tables usually are long and also wide, so it’s not very easy to visually see why these 20% of the data were not loaded. So it’s time to go to the step where the data was processed before the loading. Fortunately my ETL processes are designed with the atomic operation in mind, and I know exactly which step to look for the culprit.
- A WHERE clause was added to the processing stored procedure: a log entry in the procedure clearly says that a WHERE clause was added last Thursday. A quick running of the query inside the procedure shows that this WHERE clause filter out those 20% data.
ecda1.[CUSTOM_DATA_8] <> ‘null’
A quick check of the raw data shows that out of 1.3+ million rows, there is only one row that has the value ‘null’ in ecda1.[CUSTOM_DATA_8]. Then why 20% of the data were also filtered out along with the one ‘bad’ row?
Why a seemingly innocent WHERE clause can cause large amount of data missing
This post is getting longer than I wanted. I’ll stop now. In the next post, I’ll explain:
- Why the above WHERE clause not only filtered out one ‘bad’ row, but also took 20% of other rows along with it.
- Why in ETL, adding WHERE clause during data processing in production is almost never a good idea.
Validation of data loading and processing
One of the most often used validation method at the end of each ETL run is to run a cross reference checking on a couple of metrics, which entails finding two independent sources of the same metric.
Atomic Operation in ETL
Atomic operation, atomic programming, atomic database design, atomic transaction, etc., etc.. There are many explanations to these concepts. I am probably not qualified to give it a formal definition in the ETL design, but it’s a strategy that every ETL designer/developer should learn to practice. As an ETL designer/developer, our professional life depends on how well we understand the strategy and how well we apply it to every single task we design.
Slicer, Axes and Calculations Can All Filter Data in MDX
Every developer with SQL background knows how the WHERE clause works. You put some condition in the WHERE clause, such as TheDate = Yesterday (pseudo code), and it will only return data for yesterday.
Not so straightforward in MDX. We should expect more complex behaviors in MDX because of the multi-dimensional nature of the cubes.
But how different it can be.
There are many topics to explore, including why we prefer to call the WHERE clause slicer in MDX, how tuple is constructed, when default member is used, how slicer (WHERE clause if you insist) can be used to limit data, and how axes can be used, how the MDX engine decides when to use slicer, axes, and calculation formula to limit the data, etc..
Sub-Select Can Filter Data Too
In this blog, I’ll explore just one aspect of how we can use both the slicer and a sub-cube to limit the data, and where they are the same, and where they might give you different results.
Both Slicer and Sub-Select Produces the Same Result
Run these two queries (separating by GO), you get the same Internet Sales Amount from both queries, for July 1, 2008.
So our conclusion so far is that slicer and sub-select should give same results.
If you make such statement to some MDX experts, they will tell you that results from using a member in the slicer can be different from using the same member in a sub-select.
So what can be different?
While Sub-Select does not change the query context, the slicer does
Run this query pair. We are using the currentmember function to show what date we are currently at. In MDX’s term, we are checking the current member of the Date hierarchy in the query context (quite mouthful).
It turned out that the “current date” from the two queries is not the same. The first one with slicer says we are currently at just one day, July 1, 2008. The second one with the sub-select says we are actually currently at All Periods (all days in the entire Date dimension; the root member of the Date dimension).
Now we know that the query context (where we currently at) is different, depending on where we are putting our member, in the slicer or in the sub-select. The sub-select does nothing to change the query context, while the slicer changes the context according to the tuple (in the above example we only have one member in the tuple) we put in the slicer.
You might already figure out that we need to be careful now about the context.
When Sub-Select and Slicer Might Give Different Results
Here is an example where your filtered result might not be exactly what you want if you are using the sub-select.
When using the currentmember as a filter, You should expect the sub-select will give you the Internet Sales Amount for the entire date range in your Date dimension, not just from one day, July 1, 2008.
Confusing? Yes, it is. But no complaining?
For my other MDX blogs, visit https://bisherryli.wordpress.com/category/mdx-challenges/