Validations at end of ETL indicate missing data
At the end of each ETL job, I always run some sort of validation process to give me an indication that the data loading and processing are as expected or things have gone terribly wrong. The latter happened last Thursday when I received the validation result in my inbox.
Last Thursday also coincided with a “disk utilization” issue in the upstream server. So I waited until last Friday when the “disk utilization” issue was fixed in the upstream server to have the data reloaded. To my surprise, the data volume that was loaded on our side was still way below normal.
Steps to investigate missing data
It’s time to investigate. Blog writers at this point will usually “make the long story short”, I’ll do the same here, but with a few bullet points to highlight the steps I took to investigate.
- Pick one day for one employee: this is the most obvious data point to validate for my data. Our data contains detail data down to the granularity of per employee, per Interaction Resource ID, per connection ID, per 15 minutes interval per row. Picking one day for one employee will give me not too little and not too much data to check.
- Validate on the upstream source server: the table in question is a fact table and has a unique interaction resource fact ID. The unique list of the interaction resource fact ID is an obvious candidate as my data points for checking.
- Compare the interaction resource fact ID: between the upstream source server and our own server. Because the data volume is small enough, a simple query revealed that about 20% of the interaction resource fact IDs are not loaded into our own table.
- An ETL process design with an atomic operation in mind helps to pinpoint exactly where the problem might be: our fact tables usually are long and also wide, so it’s not very easy to visually see why these 20% of the data were not loaded. So it’s time to go to the step where the data was processed before the loading. Fortunately my ETL processes are designed with the atomic operation in mind, and I know exactly which step to look for the culprit.
- A WHERE clause was added to the processing stored procedure: a log entry in the procedure clearly says that a WHERE clause was added last Thursday. A quick running of the query inside the procedure shows that this WHERE clause filter out those 20% data.
ecda1.[CUSTOM_DATA_8] <> ‘null’
A quick check of the raw data shows that out of 1.3+ million rows, there is only one row that has the value ‘null’ in ecda1.[CUSTOM_DATA_8]. Then why 20% of the data were also filtered out along with the one ‘bad’ row?
Why a seemingly innocent WHERE clause can cause large amount of data missing
This post is getting longer than I wanted. I’ll stop now. In the next post, I’ll explain:
- Why the above WHERE clause not only filtered out one ‘bad’ row, but also took 20% of other rows along with it.
- Why in ETL, adding WHERE clause during data processing in production is almost never a good idea.
Validation of data loading and processing
One of the most often used validation method at the end of each ETL run is to run a cross reference checking on a couple of metrics, which entails finding two independent sources of the same metric.
Atomic Operation in ETL
Atomic operation, atomic programming, atomic database design, atomic transaction, etc., etc.. There are many explanations to these concepts. I am probably not qualified to give it a formal definition in the ETL design, but it’s a strategy that every ETL designer/developer should learn to practice. As an ETL designer/developer, our professional life depends on how well we understand the strategy and how well we apply it to every single task we design.
It’s hard to believe that it’s been 5 years since my first ever post, in which I simply spelled Business Intelligence. Thanks to everyone for being with me since then!
I blogged around my daily experience as a Data Warehouse and Business Intelligence developer. Most of my posts are about how to use the tools, i.e. the Microsoft SQL Server Business Intelligence stacks with more productivity and efficiency, of cause, also about the two major data query languages, SQL and MDX. I enjoy working as a DWH and BI developer for the creativity I found in all I do. I mostly praise the tools and the query languages we use, but occasionally my posts complain about the the inefficiency and even bugs in the tools.
In 2013, I co-authored the book MDX with SSAS 2012 Cookbook. As much as I’d love to write again and blog more, in 2014 I felt that I I needed more time for myself to relax and recharge, and to spend more time with my family.
Looking forward to next year, I would like to think that being a DWH and BI developer will continue to give me professional satisfactions; I do not need to find a new career. I also hope that 2015 is a year that will give me inspiration for a new book for DWH and BI developers.
My posts cover a lot of grounds in SSIS, SSRS, SSAS, SQL Jobs, SSMS, Visual Studio’s SSDT, and SQL and MDX. But it’s interesting that the top 5 posts that received the most views are all in SSIS, according to the WordPress’s annual report. More interestingly, these top SSIS posts are all about how to use variables in SSIS. I will consider writing about those topics again.
Once again, thanks for following and commenting on my posts in 2014. I look forward to seeing you all online again in 2015! Happy New Year!
OLEDB is the native client for SQL Server. Every SSIS developer is familiar with how to use it. Almost every task in Integration Services can use an OLEDB connection. As a matter of fact, it’s a preferable client to use if your backend is SQL Server.
However, there is one exception, that is the Transfer SQL Server Objects Task. This task uses an SMO Server connection only.
If you don’t know this before you use this transfer object task, there can be confusions during development and also during deployment.
Here are my lessons learned.
- Visual differences between the OLEDB and the SMO Server connections in Integration Services 2012 helps
In SSIS 2012, I am happy to see the visual differences. If you are still stuck in 2008, you will not see any visual differences.
- Name your SMO Server connections properly
A good practice is to name the SMO Server connection properly. I used such names as:
- SMOServer_Staging (this could be the source server for the transfer task)
- SMOServer_Reporting (this could be the destination server for the transfer task)
- Rename the SMO Server connection properly
In SSIS 2012, you have three difference places to create connections.
- At project level: connections are available to all packages in the project
- At package level: only available to the packages in which the connection is created
- At task level: also only available to the packages in which the connection is created
The best practice is to create connections at the project level.
However, no matter where you are creating the connections, the UI in the SSIS always prefix the connection with the server name:
- For OLEDB – servername.databasename: removing the servername is usually the best practice here, leaving just the database name.
- For SMO Server – servername: since we cannot just simply remove the servername, the best practice here is to rename it to what is suggested in the section 2.
- Confusions during deployment over where to change to production/test server
In 2008 , during deployment, I’ve seen DBAs done the following:
- Create new connection: because of the confusions about OLEDB vs. SMOServer connections, and even bigger confusions if your connections have your development server name in it, DBAs would most likely choose to create new connections.
- Rename the connection name: again if you have not named your SMO connections properly, they will tend to rename it.
Neither of the above practices is good. If you as a developer have done if properly, DBA’s will not need to mess too much with the connections:
- Named your SMO Server connections properly so they know the differences with the OLEDB connections.
- Your connection names do not have your development server name so they don’t need to rename it.
- In 2012, the deployment is a lot easier
In 2012, deployment is much more straightforward because
- Connections can be created at project level: there is only one place to mess with the connections, including the connection name, server name and database name, and bunch of other connection properties
- Of cause you can take even greater advantages of many features such as project level parameters if you have chosen to deploy packages to SSIS Catalog
More about SMO connection manager
Quote from the MSDN article:
An SMO connection manager enables a package to connect to a SQL Management Object (SMO) server. The transfer tasks that SQL Server Integration Services includes use an SMO connection manager. For example, the Transfer Logins task that transfers SQL Server logins uses an SMO connection manager.
When you add an SMO connection manager to a package, Integration Services creates a connection manager that will resolve to an SMO connection at run time, sets the connection manager properties, and adds the connection manager to the Connections collection on the package. The ConnectionManagerType property of the connection manager is set to SMOServer.
You can configure an SMO connection manager in the following ways:
- Specify the name of a server on which SQL Server is installed.
- Select the authentication mode for connecting to the server.
More about Transfer SQL Server Objects task
Besides, the data flow object, the Transfer SQL Server Objects task is another task you can use to transfer not just data but also other types of objects between instances of SQL Server. For more info, check out this MSDN article.
MSOLAP is a native OLE DB provider for Analysis Services
A common task during ETL is to process the OLAP dimensions and the cube after all the dimension and fact tables are loaded into the data warehouse. Integration Services provide two special control flow tasks for this purpose, the Analysis Services Processing Task and the Analysis Services Execute DDL Task.
Before you can use either of these two tasks, you will need to create a SSIS Connection Manager that knows where the Analysis Services server is. The Connection Manager is just the name of the GUI in Integration Services. Behind the fancy GUI, it’s just a connection string that uses the native data provider, the Analysis Services OLE DB provider (or just short as MSOLAP DLL).
Location in the connection string causing “connection cannot be made”
In the above screenshot, I not only provided the Server name but also the Location which I made it the same as the server. The testing of the connection in the GUI was successful.
After the GUI is completed, the final connection string looks like this.
Provider=MSOLAP.5;Data Source=localhost\MULTI2012;Integrated Security=SSPI;Initial Catalog=SSAS2012DEMO;Location=localhost\MULTI2012
The control flow task Analysis Services Execute DDL Task is usually the more flexible choice for processing Analysis Services objects than the Processing Task. But to makes sure that my connection string is correct, I used the Analysis Services Processing Task, which is the GUI version of the DDL task, to test the connection string. I ran into this “connection cannot be made” error.
It turned about the problem is in the Location. Removing the Location from the connection string solved the connection issue.
AS a matter of fact, the shortest connection string that can pass the connection is just this:
No explanation for Location in connection strings for MSOLAP and ADOMD.NET provider
I checked both the connection string document in MSDN library for MSOLAP and ADOMD.NET , and couldn’t find much information about what the Location is supposed to be.
In the first link for the ADOMD.NET provider it just simply states:
Location: Sets the name of the server.
In the second link for the MSOLAP provider, I couldn’t even find any mentioning of Location.
Three types of data providers for Analysis Services
There are really three types of data providers for Analysis Services.
Analysis Services provides three data providers for server and data access. All applications connecting to Analysis Services do so using one of these providers. Two of the providers, ADOMD.NET and Analysis Services Management Objects (AMO), are managed data providers. The Analysis Services OLE DB provider (MSOLAP DLL) is a native data provider.
More ways to access Analysis Services in SSIS than the DDL Task and the Processing Task
Check out this book if you want to know more than just the DDL Task and the Processing Task to access Analysis Services in SSIS.
Quote from the book:
* Using the control flow object Analysis Services Processing Task is the most straightforward
approach, and is discussed in the next section. This allows any SSAS objects (that involve
processing) and their processing method to be selected. A few drawbacks are listed in the next
* Although the Analysis Services Execute DDL Task may sound as if it only allows objects to be
modified, created, or deleted, its functionality goes way beyond Data Definition Language
(DDL). This task runs an SSAS XMLA script. XMLA includes the capability not just to run DDL,
but also query and process. Therefore, this task is very useful for SSAS processing, because the
XMLA can be modified in an SSIS package before it is executed.
* Another approach involves using the Execute Process Task to call the ASCMD executable that
comes with SSAS. This executable can run MDX, Data Mining Expressions (DMX), and XMLA,
and also contains some command – line switches to make your processing more dynamic.
The data flow in SSIS includes two destinations: the Dimension Processing destination and the
Partition Processing destination . These allow data directly from the pipeline to be pushed into
either an SSAS dimension or SSAS partition, respectively. Unlike all the other SSIS approaches
discussed here, this approach is the only one where data is pushed directly to SSAS. The
other approaches essentially tell SSAS to start pulling the data from a data source. Like the
other SSIS approaches, this approach is described later in this chapter.
* Finally, the programming API for SSAS, called Analysis Management Objects (AMO), allows the
interface into the SSAS object model and can leverage the processing methods to process any of
the SSAS objects described. To use AMO for processing, you must leverage either the Script Task
in the control flow, or the Script Component in the data flow.
The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.
Here’s an excerpt:
The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 120,000 times in 2013. If it were an exhibit at the Louvre Museum, it would take about 5 days for that many people to see it.
A reader asked me to share a C Sharp script I used in this post, SSIS #114 – How many ways can you do INSERT-UPDATE?.
This code snippet tried to determine if the original value of MyFirstColumn is different or not compared to the incoming data in the pipeline. It is used in the last UPDATE-DELTA design.
The Script Component needs some configuration. Without the proper configuration, the above code will not work.