Home > ETL Best Practices > Kimball Group – an authoritative voice in data warehousing

Kimball Group – an authoritative voice in data warehousing

As I am moving into the incremental process design phase of my ETL process, I started to examine my overall architectural strategy so far.

Kimball Group is “the definitive source for dimensional data warehousing expertise” according to their website http://www.kimballgroup.com/.

Dimensional modeling is the most discussed topic in data warehousing. ETL process is merely a passage that we must go through to get to the wonderful world of multi-dimensional data warehouse.

I couldn’t find any book that devotes to ETL only. There are many books on data warehousing and ETL tools though. But I don’t feel deserted though, as an ETL developer, when coming to Kimball group’s web site. Besides a few highly recommended books on data warehousing, the Kimball Group web site hosts many good articles, a very active forum, webinars and courses.  

Here is a discussion that I found very interesting, because I just used the recommended design strategy in my own ETL process.

The DWH ETL Toolkit, On page 116, Chapter 4, Cleaning and Conforming, Assumptions, in the first bullet "…customer information from various sources initially can be staged in a table [. . .] whose configuration is the same regardless of the data source. Such a structure could be used for incoming data from various data sources deposited and queued for ETL work."

Some people questioned this strategy. Is this an all encompassing guideline. Is this describing a scenario where it just so happens(maybe by design) that the data structure of different data sources are identical, allowing us to import them to the same initial staging table(or tables)?

The answer from the forum is this:

“No, its good advice to be used in the general case.

The issue isn’t weather the sources are the same but that the staging table is the same. Think of the staging table design as the API between the staging process and the loading process. If you get it right, you need only develop one load process regardless of source. "Getting it right" involves abstracting natural keys, conforming data types and other such techniques so that similar data populates the same columns in the staging table. You wind up writing one staging process for each source and only one load process, rather than unique staging and load processes for each source.

Another way to look at it is the content of the staging table should conform to the ultimate target (fact and/or dimension table) with the only difference being it contains natural rather than surrogate keys (surrogate key assignment occurs in the load processes). After all, if you have different sources going into a single fact table, how much more complicated could it be to have those same sources go into a common staging table?”

To illustrate the above strategy of “one staging process, and one loading process”, here is what my design looks like in terms of staging steps:

image

Categories: ETL Best Practices
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: