I recently had a need to show three different view in a employee productivity report with the same data set, two summary views and one account level detail view.
For the two summary views, employees are listed regardless they have data or not in the date range. This will allow the report to do “forced ranking. I’ll blog more about “forced ranking” later. Because I had to show all the FTEs in different grouping, I had to create some dummy rows. I’ve written a post, SSRS – Drill down or Drill through 2? discussing the approach.
Now if I use the same data set to show the data in account level details, there is no need for forced ranking any more. Now I want to hide those dummy rows.
This is a simple task in SSRS. There is no need for a new data set. Create a new table in SSRS, using the same data set, with a filter.
The Expression for the table filter is this:
Note: IsNothing is VB’s equivalent to IsNull in T-SQL.
With the above filter, only rows that have a DECISION_DT will be shown in the account level detail view, and all the dummy rows will not.
Chapter 4 “Cleaning and Conforming” from Kimball’s The Data Warehouse ETL Toolkit takes about 50 pages, not an extremely long chapter. Here is a warning from the beginning of the chapter.
Please stay with us in this chapter. It is enormously important. This chapter
makes a serious effort to provide specific techniques and structure for an
often amorphous topic. The chapter is long, and you should probably read
it twice, but we think it will reward you with useful guidance for building
the data cleaning and conforming steps of your ETL system.
I have to confess that I read it twice, but never finished reading the entire chapter each time. That doesn’t mean that I totally ignored the data quality issue in my ETL design. Although I didn’t finish reading the entire chapter, I focused on these three sections:
Without the effort and hard work during the data profiling phase (or I called it data discovery phase), cleansing and auditing are not going anywhere. Here is an example from the IT Asset Management system from my previous project.
Network auto-discovery software are installed on servers. Multiple such software can be installed on the same server, collecting duplicate data or conflicting data. Only after lots of hard work of data profiling, we were able to create a strategy to resolve duplicate or conflicting data.
Looking at the Error event table schema suggested from the ETL Toolkit, it doesn’t not take long to come with a Screen Type of “Duplicate Composite Key” as a Screen Type for my data exception table. I called it data exception, instead of error event table.
By the way, the suggested schema here is a generic schema. I’d think that you can design your table any way you want to fit into your specific project.
As for the #2 Cleaning Deliverable “Audit Dimension”, my understanding is that it is just a summary from your data exception screens, with a score for each exception category. I’ve designed an audit table to summarize all the important measures for data exception and also for normal row counts, but I’ve never created scores for them. I guess I really didn’t know how to score them.
I’ve done ETL design and development in both integration projects and in reporting environment.
In master data integration projects, it’s not hard to distinguish staging area from the target area. In the staging area, truncating tables with old data and loading new data is a pretty common practice. I’ have not seen any truncating tables in the target area. This is a good thing. It means that developers understand the role of staging, and understand that loading data into target area needs to be carefully designed, or “reconciled”, as a consultant called it. Depending on the business rules, we will end up either creating new records in the target, or updating existing records, or deleting obsolete records.
In the reporting environment, however, developers have very different attitude and very different understanding of staging VS. target.
This is a very typical architecture I’ve seen so far. The flaw in this design is that there is no target area, only staging. When developers truncate the table in the staging, the report shows empty data between the time when table is truncated and new data finished loading. Or even worse, the report hangs when the loading takes a long time to finish.
You might argue the chance of the coincidence that users run the report at the same time when the data is in the middle of refreshing. Based on my experience, if I hear about it once from the business users, then the chance is big, because the possibility for it to happen again is always there.
Well, the above is not the only type of architectural design I’ve seen. Here is another one. There is no physical staging area in this design. The report is accessing the data source directly with T-SQL code with OPENQUERY. The developer argued that this type of design will eliminate the need of maintaining the staging area on the SQL server. That part is true. But it does not eliminate the emails and phone calls we get from the business users when they are so frustrated because the report takes for ever to load the data. Reporting Services routinely report time out when trying to retrieve large amount of data via OPENQUERY.
The third type of architectural design can avoid many of the problems of long-running reports, or reports with empty data. By separating target from staging, and not truncating the target tables, the reports will always look good.
Back to the title of this post, “why truncating target tables is a bad practice in ETL”. I have to confess that I was one of those developers who made such mistakes until I received enough emails and phone calls from our business users.
There is more to discuss on how to move data from staging to target. Stay tuned.
I’ve written a post SSIS – Use “Foreach Loop Container” and pass a variable to a OLE DB Source in a Data Flow. I have to confess that I don’t use Foreach loop container very often. Actually I’ve only used it once in my entire SQL development life. I know how wonderful they are, and how much other developers like it. Deep down, I am a SQL developer. I still resort to T-SQL whenever I feel a bit dizzy thinking about SSIS packages.
I start to feel dizzy when I need to loop through the Foreach loop container more than 10 times. I think that’s my arbitrary limit. Looping through 10 days means connecting to the data sources 10 times. I bet Integration Services does some tricks on pooling connections. Still I don’t quite like it, especially when the data volume is large.
Here is what I would rather do. I’d bring all 10 days ( or even more, depending on the project) worth of data from my data source. The use WHILE loop in T-SQL to process data for each day.
You might wonder what would drive me to process data for each day in a WHILE loop. That will be the topic for my next post.
I’ve written one post about SSIS – Enable Package Configuration for Deployment, and another one about SSIS – Add Configuration Files for Deployment.
I am afraid that after reading these two posts you are quite confused. You are probably screaming silently. Why is she talking about environment variables, together with configuration files? Do we need both, or just one of them? Why did my job still fail after I changed my password to DB2? Questions, one after another…
So here I have more writing to do to clarify the confusion.
First of all, after reading those two posts, you already know:
1) how to create the environment variables,
2) how to create the configuration files in XML, and
3) how to enable package configurations in BIDS.
4) You also know how to point to a configuration file for your SQL job step when executing a SSIS package.
The first 3 “hows” are important as a SSIS developer. There are a few more”secretes” regarding package configuration you will also need to know as an experienced SSIS developer.
Here are those “secretes” (assuming that you have done the first 3 hows in BIDS).
1) There are a couple of ways to test your package in BIDS. Here is one, by right-clicking on the package name in the Solution Explorer.
Have you ever thought about how you are connecting to your sources when you test you package in this manner?
Is the connection done through one of the configuration files?
Or by the login credential you entered in the Connection Manager?
The answer is by the package configuration files you’ve specified in your environment variables.
If this is your habit to test your package locally, then you will need to maintain your own package configuration files with the correct user id and password (or through Integrated Security mode).
2) Well, the above is not my favorite way for testing locally. I very often test just parts of my package at a time by right-clicking on a sequence container. When I test in this manner, how is my package connecting to the sources?
The correct answer this time is through the connection manager. So you better make sure that you have the correct credentials for each of your connection manager.
3) so what’s the third secrete? The third one has to do with the SQL Agent job you’ve created to execute your package.
Again, I assume that you have done all the right things in your BIDS locally, and have successfully uploaded your package to the Integration Services, and have created a job step to execute your package.
I will also assume that your server administrator has also done on the server exactly the same things you have done locally.
The secrete is that you do nothing in the job step. After all the hard work you have put in, you deserve a break.
By nothing, I mean putting nothing on the Configurations tab.
I also mean doing nothing on the Data sources tab. Do not check any of the check boxes for the connection managers.
By doing nothing, your job step will use the environment variables on the server to point to the correct package configuration files (DTSConfig files we’ve created), which have the correct credentials to connect to the data sources.
4) Last secrete. This is a secrete I’d rather not know. Adding package configuration files for your job step, is only necessary if you want to overwrite the package configuration files defined by your environment variables. There are case where you want to do this, but I am hoping that you will only need to do this for very few of the packages you’ve designed.
I am hoping that this post will leave you feel better about package configuration files and package deployment.
I came back to my old job, but am not working on exactly same projects. I am very happy that I have new puzzles to solve now.
Obviously employee compensation and productivity reporting has become a hot topic in the business recently.
I am going to blog on this subject as I get to understand more about the business needs.
After completing several ETL processes and reports so far, I gradually start to put all the puzzles together.
In its simplistic form, I can envision a subset of the data warehouse bus matrix. This matrix will revolve as I understand more.
The dimensions will involve hierarchies. All the facts basically are from a transactional table. But as we put together all the fact tables for reporting, we will get all three types of fact tables, i.e. the transactional fact tables, snapshot fact tables, and also accumulating fact tables. The third type will be the most common.
The natural granularity for our business transactional fact table will be one row for each action a team member does for each account. For the purpose of employee compensation and productivity reporting, the facts associated with the business process typically just include the counts of the actions, such as the counts all “declines”, or the counts of all “approvals”.
In the next blog I’ll talk about whether I need to normalize my fact tables by fact types.