Archive

Archive for the ‘ETL Best Practices’ Category

ETL #73–NULL or NOT NULL and SQL Interview Questions

November 25, 2016 Leave a comment

Today is the day after Thanksgiving. There are many things to be thankful so I decided to write a short post today.

The first thing to be thankful is that Tomislav and I have completed the third edition of our MDX book, MDX with Microsoft SQL Server 2016 Analysis Services Cookbook. The book is published by Packt Publishing and has been uploaded to all the major publishing channels.  

Recommend SQL interview questions on deskbright.com

The second thing to be thankful is the enthusiastic audience who have been faithfully reading my posts. I recently received two inquires, which are somehow related. One reader was a bit confused by what I said about the SQL NULL values and what I said about being careful of what we put in the WHERE clause. Another reader is from Deskbright.com, a new online learning platform. Deskbright.com offers a number of free online resources to help people learn business skills — such as SQL. They are wondering if I’d be willing to post a link to their page on SQL interview questions (https://www.deskbright.com/sql/sql-interview-questions/) on my site.

As I am browsing through the SQL interview questions on deskbright.com, I see the #3 question, “Why is this query not returning the expected results?”, and thought this is perfect for the question from the reader I mentioned previously. Instead of overwhelming readers, deskbright.com listed only 7 most common SQL interview questions.

I’d rather not to repeat what they have and would recommend their SQL interview questions to those of you who are still new to SQL or Business Intelligence.

MDX Cook book third edition

The full title of the book is MDX with Microsoft SQL Server 2016 Analysis Services Cookbook. By the time you see this post, the book should be on Amazon, PacktPub.com, and all other online technology book stores and the e-subscription sites.

See also

ETL #72–Your data can mysteriously disappear after a WHERE clause was added (2)

ETL #71–Your data can mysteriously disappear after a WHERE clause was added (1)

ETL #72–Your data can mysteriously disappear after a WHERE clause was added (2)

Why a seemingly innocent WHERE clause can cause large amount of data missing

In the previous post, ETL #71–Your data can mysteriously disappear after a WHERE clause was added (1), I talked about how I discovered a WHERE clause was added to a main procedure that processes the raw data prior to loading the data into the fact table.

ecda1.[CUSTOM_DATA_8] <> ‘null’

A quick check of the raw data shows that out of 1.3+ million rows, there is only one row that has string value of ‘null’ in ecda1.[CUSTOM_DATA_8], and also that about 20% of the rows has SQL NULL value in the field CUSTOM_DATA_8. It’s these 20% of the data that was also filtered out along with the one ‘bad’ row.

The missing 20% of the data has SQL NULL in the field

The field [CUSTOM_DATA_8] on table alias ecda1 has one row that has a string value of ‘null’. But about 20% of rows has a value of SQL NULL.

We all know a thing or two about the SQL NULL:

  1. SQL NULL is the term used to represent a missing value.
  2. A NULL value in a table is a value in a field that appears to be blank.
  3. It is very important to understand that a NULL value is different than a zero value or a field that contains spaces.

IS NOT NULL is implied whenever we add a filter

But the most important thing about the SQL NULL is this: IS NOT NULL is implied in the WHERE clause whenever we add a filter of either equality or inequality.

So if we add a filter as:

ecda1.[CUSTOM_DATA_8] <> ‘aStringValue’

Then, we really mean this:

ecda1.[CUSTOM_DATA_8] <> ‘aStringValue’

AND

ecda1.[CUSTOM_DATA_8] IS NOT NULL

If we add a filter as:

ecda1.[CUSTOM_DATA_8] = ‘aStringValue’

Then, we really mean this:

ecda1.[CUSTOM_DATA_8] = ‘aStringValue’

AND

ecda1.[CUSTOM_DATA_8] IS NOT NULL

Now you can understand why those 20% of rows with SQL NULL values are missing.

Adding a WHERE clause during data processing in the production environment of ETL is almost never a good idea

You probably will also understand why it is not a good idea to add WHERE clauses to the data processing stage in ETL.

In production, WHERE clauses should not be added unless it’s a documented business requirement.

In data discovery stage, WHERE clauses are used only for data discovery purpose and should not be carried over into production.

Handling of “bad” or missing dimensional data

If these “bad” data are part of the data you will use to slice and dice your data, then there are many ways you can handle them. How to handle these “bad” or missing dimensional data is beyond the scope of this short blog.

ETL #71–Your data can mysteriously disappear after a WHERE clause was added (1)

April 24, 2015 Leave a comment

Validations at end of ETL indicate missing data

At the end of each ETL job, I always run some sort of validation process to give me an indication that the data loading and processing are as expected or things have gone terribly wrong. The latter happened last Thursday when I received the validation result in my inbox.

Last Thursday also coincided with a “disk utilization” issue in the upstream server. So I waited until last Friday when the “disk utilization” issue was fixed in the upstream server to have the data reloaded. To my surprise, the data volume that was loaded on our side was still way below normal.

Steps to investigate missing data

It’s time to investigate. Blog writers at this point will usually “make the long story short”, I’ll do the same here, but with a few bullet points to highlight the steps I took to investigate.

  1. Pick one day for one employee: this is the most obvious data point to validate for my data. Our data contains detail data down to the granularity of per employee, per Interaction Resource ID, per connection ID, per 15 minutes interval per row. Picking one day for one employee will give me not too little and not too much data to check.
  2. Validate on the upstream source server: the table in question is a fact table and has a unique interaction resource fact ID. The unique list of the interaction resource fact ID is an obvious candidate as my data points for checking.
  3. Compare the interaction resource fact ID: between the upstream source server and our own server. Because the data volume is small enough, a simple query revealed that about 20% of the interaction resource fact IDs are not loaded into our own table.
  4. An ETL process design with an atomic operation in mind helps to pinpoint exactly where the problem might be: our fact tables usually are long and also wide, so it’s not very easy to visually see why these 20% of the data were not loaded. So it’s time to go to the step where the data was processed before the loading. Fortunately my ETL processes are designed with the atomic operation in mind, and I know exactly which step to look for the culprit.
  5. A WHERE clause was added to the processing stored procedure: a log entry in the procedure clearly says that a WHERE clause was added last Thursday. A quick running of the query inside the procedure shows that this WHERE clause filter out those 20% data.

ecda1.[CUSTOM_DATA_8] <> ‘null’

A quick check of the raw data shows that out of 1.3+ million rows, there is only one row that has the value ‘null’ in ecda1.[CUSTOM_DATA_8]. Then why 20% of the data were also filtered out along with the one ‘bad’ row?

Why a seemingly innocent WHERE clause can cause large amount of data missing

This post is getting longer than I wanted. I’ll stop now. In the next post, I’ll explain:

  • Why the above WHERE clause not only filtered out one ‘bad’ row, but also took 20% of other rows along with it.
  • Why in ETL, adding WHERE clause during data processing in production is almost never a good idea.

Validation of data loading and processing

One of the most often used validation method at the end of each ETL run is to run a cross reference checking on a couple of metrics, which entails finding two independent sources of the same metric.

Atomic Operation in ETL

Atomic operation, atomic programming, atomic database design, atomic transaction, etc., etc.. There are many explanations to these concepts. I am probably not qualified to give it a formal definition in the ETL design, but it’s a strategy that every ETL designer/developer should learn to practice. As an ETL designer/developer, our professional life depends on how well we understand the strategy and how well we apply it to every single task we design.

5th Blogging Year

December 31, 2014 Leave a comment

It’s hard to believe that it’s been 5 years since my first ever post, in which I simply spelled Business Intelligence. Thanks to everyone for being with me since then!

I blogged around my daily experience as a Data Warehouse and Business Intelligence developer. Most of my posts are about how to use the tools, i.e. the Microsoft SQL Server Business Intelligence stacks with more productivity and efficiency, of cause, also about the two major data query languages, SQL and MDX. I enjoy working as a DWH and BI developer for the creativity I found in all I do. I mostly praise the tools and the query languages we use, but occasionally my posts complain about the the inefficiency and even bugs in the tools.

In 2013, I co-authored the book MDX with SSAS 2012 Cookbook. As much as I’d love to write again and blog more, in 2014 I felt that I I needed more time for myself to relax and recharge, and to spend more time with my family.

Looking forward to next year, I would like to think that being a DWH and BI developer will continue to give me professional satisfactions; I do not need to find a new career. I also hope that 2015 is a year that will give me inspiration for a new book for DWH and BI developers.

My posts cover a lot of grounds in SSIS, SSRS, SSAS, SQL Jobs, SSMS, Visual Studio’s SSDT, and SQL and MDX. But it’s interesting that the top 5 posts that received the most views are all in SSIS, according to the WordPress’s annual report. More interestingly, these top SSIS posts are all about how to use variables in SSIS. I will consider writing about those topics again.

Once again, thanks for following and commenting on my posts in 2014. I look forward to seeing you all online again in 2015! Happy New Year!

 

Sherry Li

On the lighter side…

November 27, 2011 1 comment

There are a couple good comments from the readers recently, such as this one:

….You can not say that truncating a target table is a bad practice in general…

I totally agree with the above statement. Actually, truncating a target table is used in many of my ETL processes. Many of the processes I designed also do incremental loading only. So the reader is correct that how we design our ETL process really depends on the characteristics of the data we are loading. It becomes a bad practice when we use wrong design for the wrong data. In the particular example I gave in ETL #70 – The worst possible way to truncate your target table, truncating the target table was a bad practice indeed.

OK, enough about such serious topic. Let’s move on to something lighter. I recently moved to a different group within the same company. I’ve been in the same group for quite awhile, and feel pretty sad about leaving my co-workers and friends. My friend Hamid Y composed a poem for me. I am “publishing” the poem here with Hamid’s consent.

Being a SQL/BI developer for so long, but I still cannot explain to my family what exactly what I do. I recently told an Engineer wife that I work for a bank. She immediately started to consult me with her most frustrating banking questions, such as if she can join a credit union to avoid fees.

I am not sure if a x-ray machine or a sewing machine will help me to answer her question or not, but at least it’s on the lighter side of our daily SQL/BI work. Here it goes:

Dear Sherry,

I wrote this little poem for you…hope you like it..

You’re one of the smartest people I’ve seen
Pumping out SQL code and arithmetic mean

You found program bugs like a x-ray machine
And you saved this company more money than a teller machine

You kept this department together like a sewing machine
And making it run like a pinball machine

So now that you’re leaving our team
I’ll try not to make a scene

Just know that you’re loved and well-respected
By everybody in this department you affected

I am already missing people I’ve being working with day in and out. Thanks friends.

Categories: ETL Best Practices

ETL #70 – The worst possible way to truncate your target table

April 22, 2011 2 comments

In my previous post, Why truncating a target table is a bad practice in ETL?, I strongly suggest that we do not truncate our target tables.

In this blog, I’ll tell you the worst possible way I’ve seen to truncate a target table.

Before I get into that, I’ll share with you the experience a business user has shared with me:

….Sherry, can you check the report again? It has failed with the new data load today again.

….What caused it to do that? Has been happening sporadically lately where the whole report just goes down.

….I thought the issue has been resolved since last change.

Before the “last change”, the report has no staging data. It’s pulling large amount of data directly from our data source via OPENQUERY (you can read my other posts to know my opinions about  OPENQUERY). If you are lucky, the report will send back data within 5 minutes.

So the last change was made is to pull data every morning and store it in a table.

Here is how this table (in my opinion, this is your target table, not a staging table) is populated every morning:

image

It is truncated first, then with an attempt to populate it. Unfortunately, almost 5 out of 10 days, the OPENQUERY failed, leaving the target table empty 5 days out of 10. In those 5 lucky days, the target table will remain empty for about 10 minutes, which is how long it takes for the table to be fully populated.

That explains the business users’ experience with the report.

Do you still need more convincing not to truncate your target tables?

ETL Toolkit– Error Event Table and Audit Dimension

March 28, 2011 2 comments

Chapter 4 “Cleaning and Conforming” from Kimball’s The Data Warehouse ETL Toolkit takes about 50 pages, not an extremely long chapter. Here is a warning from the beginning of the chapter.

Please stay with us in this chapter. It is enormously important. This chapter
makes a serious effort to provide specific techniques and structure for an
often amorphous topic. The chapter is long, and you should probably read
it twice, but we think it will reward you with useful guidance for building
the data cleaning and conforming steps of your ETL system.

I have to confess that I read it twice, but never finished reading the entire chapter each time. That doesn’t mean that I totally ignored the data quality issue in my ETL design. Although I didn’t finish reading the entire chapter, I focused on these three sections:

image

Without the effort and hard work during the data profiling phase (or I called it data discovery phase), cleansing and auditing are not going anywhere. Here is an example from the IT Asset Management system from my previous project.

Network auto-discovery software are installed on servers. Multiple such software can be installed on the same server, collecting duplicate data or conflicting data. Only after lots of hard work of data profiling, we were able to create a strategy to resolve duplicate or conflicting data.

Looking at the Error event table schema suggested from the ETL Toolkit, it doesn’t not take long to come with a Screen Type of “Duplicate Composite Key” as a Screen Type for my data exception table. I called it data exception, instead of error event table.

By the way, the suggested schema here is a generic schema. I’d think that you can design your table any way you want to fit into your specific project.

image

As for the #2 Cleaning Deliverable “Audit Dimension”, my understanding is that it is just a summary from your data exception screens, with a score for each exception category. I’ve designed an audit table to summarize all the important measures for data exception and also for normal row counts, but I’ve never created scores for them. I guess I really didn’t know how to score them.

image

%d bloggers like this: