Home > ETL Best Practices > Incremental Data Loading – 3 day rule

Incremental Data Loading – 3 day rule

Incremental loading is not a easy topic. I am just concentrating on a very narrow focus here.

Here is the context of this blog.

  1. Data Source: relational database in DB2
  2. Data Destination: a table in SQL database
  3. Reporting: a SSRS report with direct data pull from the above table
  4. ETL: Implemented in a SSIS package with some business rules built-in
  5. Daily Process: newly added data from the data source needs to be loaded into the destination table

Challenge of the ETL Process:

  1. Need to be re-startable: without any manual setting
  2. Need to be fast: the source can contain large number of records
  3. Need to be self-correctable: if source data was corrected and back dated, the destination data should be self-correctable

What it really means is to choose among the following options regarding the incremental daily loading:

  1. truncate and re-populate the destination table daily
  2. start from the last date from the destination table: I will need to pass the last date from my SQL table. I can either read it in as a user variable in my SSIS package, and pass it to my query, or I can just “ETL” the last date to the source DB2 database.
  3. just pull the previous day’s data from the source
  4. pull the previous 3 days’ data from the source

The first choice is simple enough, but will suffer from poor performance.

The second choice can be a little messy, but it sounds like a good choice.

The third choice is simple enough, but not a good one when considering that our SQL job can potentially fail every day, in which case, data will be missing due to job failure.

The last choice is simple, and it will pull data for the last 3 days even when your job failed in the last 2 days. And it also should be fast. Fortunately SQL jobs are fixed Monday to Friday in our environment.

For the sake of simplicity, I picked the last option for some of my ETL processes. So far, the 3 day rule has worked pretty smoothly. 

Categories: ETL Best Practices
  1. October 20, 2014 at 4:44 pm

    Nice thread! I got a business scenario that business users can delete the data in sources. These data has been populated into our DW. How to capture these deletions?

    Like

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: