I recently needed to modify an SSIS package to add a few new columns to a table. But I turned out that I had more than just adding columns to do.
I did three things prior to adding the columns:
1. I added three Sequence Containers as part of the Control Flow, and make them run in parallel.
2. I moved three staging tables into the schema that is designated as staging area.
3. I renamed the three staging tables to be consistent in naming.
In this post, I’ll focus on the Sequence Container part.
An SSIS package without Sequence Container
When you open the Toolbox, and take a closer look at the first three items in the Control Flow choices, you will see that they are all Containers of some sort. By the way, the Control Flow Items are not sorted alphabetically.
Why would Microsoft developers place these three items to try to catch our attentions first? I can only guess. My guess is that these three items must be the most commonly used Control Flow items. I cannot say for other SSIS developers, but, for me, this is true. Or at least one the Container is true, that is, Sequence Container.
The SSIS package that I need to modify looks like this. No Sequence Container.
Question 1 – Must each task be executed in the specified order?
With all the Precedence Constraints, I can see that each task must be executed in the order. Do we really need to execute each task in the specified order?
The answer is NO after some poking around each task.
Question 2 – Where are the “touching” points?
With the skeleton-like control sequence, I cannot really tell the “touching” points, i.e. when does the package “touch” the staging and when does it “touch” the production table(s).
In my ETL work, it’s important for me to distinguish the tasks that only “touch” the staging area, and the tasks that “touch” the target.
An SSIS package with 3 Sequence Containers in parallel
This is the final version of the package after I added 3 Sequence Containers.
1. Each Sequence Container has one specific task to do, that is, to populate one staging table.
2. These three tasks do not need to run in the sequence, instead, they can run in parallel.
3. Running tasks in parallel can increase the performance of the package, and also gives clear visual clues on each task’s duty.
4. The last step is to combine the three staging tables and populate a target table.
Give the Sequence Container a try in your own SSIS package design next time. Your co-workers will come to appreciate it.
Early this year, I had a post on SSIS – Pass a variable to a OLE DB Source in a Data Flow. Being able to make your ETL process dynamic by using dynamic variables will make you appreciate SSIS more, and enjoy working with it more.
In this post, I’ll show you one example from my work where my Execute SQL task is expecting a date time parameter in a parameterized SQL statement.
I have a variable varStartDate, which is always the first of the reporting month. I want to get a list of 2 moths prior to the reporting month.
In SQL, it’s something like this:
SELECT DATEADD(MONTH, -1, ‘2011-08-01’) AS MNTH_ID
SELECT DATEADD(MONTH, -2, ‘2011-08-01’) AS MNTH_ID
This will give me:
Once I have these two dates, I will pass them into a Foreach Loop Container to get data from a DB2 database.
The entire sequence of my control flow looks like this:
For the purpose of this post, I’ll focus on the second control flow task, where I create the MNTH_ID, based on the value from a variable, varStartDate, which is dynamically set in the first control flow task.
Step 1 – create a variable varStartDate
Of cause you need to create the variable varStartDate first. Since it’s date time field, I picked DateTime as Data Type.
The value of varStartDate is dynamically set in the first task. I have blogged before on how to use ”Full result set” or “Single row” to send dynamic values to variables.
I’ll skip this step in this post.
Step 2 – Use an appropriate parameter marker in the SQL query (OLE DB <—> ?)
This step is done in this Execute SQL task.
First, write a query in SQLStatement;
SELECT DATEADD(MONTH, -1, ?) AS MNTH_ID
SELECT DATEADD(MONTH, -2, ?) AS MNTH_ID
Instead of hard-coding the date, I used a question mark in the query. This question mark tells SSIS that we are expecting a variable.
Note: you might ask, “why did you use question mark?”. A technet article from Microsoft can answer your question.
According to this article, if I am using the OLE DB Connectio0n Type, the correct parameter marker is a question mark.
Step 3 – Map the variable varStartDate to appropriate parameter name
- Click Parameter Mapping
- To add a parameter mapping, click Add.
- Choose DATE as Data Type ( do not choose DBTIMESTAMP, DBDATE as Data Type if that’s what you are thinking. Choosing DATE will allow SSIS to map the date time value to the correct value to SQL Server.)
- Provide a name in the Parameter Name box. I chose 0 and 1 as the parameter names.
- You might ask, “how did you know 0 and 1 are the correct parameter names to use?”
Again, according to the above Microsoft technet article, 0, 1, 2, 3, … are the correct parameter names to use if I am using the OLE DB Connection Type.
In my previous post, SSIS #97 – When MERGE JOIN is your friend?, I showed you a scenario where Merge Join can be you real friend in your ETL work. I also showed you two ways to satisfy the sorting requirement by the Merge Join task.
One of the ways is to sort your data in a query in the OLE DB Source. Then you can just “tell” SSIS that your data set is indeed sorted. This is where I missed one important piece of information in my previous post, that is, sorted on what.
SSIS is a powerful tool for ETL, with a reputation of steep learning curve to go with it. This is one of the evidences, in my humble opinion.
OK, let’s have a recap.
Step 1 – Write a query with sorting in the OLE DB Source editor
Go the OLE DB Source editor.
I erased all the irrelevant info except the order by clause.
Step 2 – Go to the Advanced Editor for the OLE DB Source to set IsSorted to True
You need to tell SSIS two things:
1. The output data set is sorted
2. It is sorted on the column LOAN_NR
First, set IsSorted = True on the Input and Output Properties tab.
Step 3 – Set SortKeyPosition to 1
Next, expand the “Output Columns” branch, and select the column that corresponds to the first column in your ORDER BY clause. Set its “SortKeyPosition” property to the number 1.
Number 1 means that the column is the first column in my ORDER BY clause. Since I only have one column in my ORDER BY clause, my work is done. Otherwise, repeat the above step for each column in your ORDER BY clause, increasing the SortKeyPosition by one each time (1,2,3,…).
Now the Merge Join will understand and acknowledge that the incoming data is sorted and will not require a Sort transformation task between the source and a Merge Join component.
What does Merge Join in SSIS do?
Merge Join task in SSIS is actually a very straightforward task to use, if you know this:
- The Merge Join Transformation joins two data sets on a common key (or keys).
- It is the same as performing an SQL Join – the options you have are Inner Join, Full Outer Join and Left Outer Join.
What is the difference between Merge Join in SSIS and SQL Joins?
If you are a SQL developer, you would ask, “If I can do joins in SQL, why would I use merge Join”? On the other hand, if you are an open-minded SQL developer, you would ask, “when do I use Merge Join in SSIS, instead of SQL Join”? If you are an open-minded and a very stubborn SQL developer, you would ask, “what is the real differences between Merge Join in SSIS and SQL Join”? Here are the differences:
- In Merge Join, it is SSIS rather than SQL that is doing the join, so it’s an in-memory activity.
- Merge Join is a blocking transformation, that is, it needs to receive all data from all inputs before it can proceed to merge the data.
- Merge Join can do joins on data sets from difference data sources, and SQL joins cannot.
When did Merge Join in SSIS become my friend?
Your first reaction is performance hit, just like me, until you see the last difference. You want to give it a try! I don’t use Merge Join very often in my ETL work, because I am too much a STAGING person. But there are a few places where I would rather use Merge Join in SSIS, than staging any of those ugly/huge tables in DB2. They are ugly, because they are very difficult to use, and they are huge with 10-20 million records. I got frustrated with all the wrong where clauses that gave me the wrong/missing data. My life as a BI developer will be so much easier, only if I can:
- get all the data from the ugly/huge table in DB2 (with only the where clause I know for sure, so that no data will be missing)
- join to my nice/small table on SQL server
This is when I decided that I wanted to be friend with the Merge Join!
Data sets must be sorted when using Merge Join in SSIS
Once the decision is made, using and configuring the Merge Join is actually very straightforward, but only if you know this:
My data sets from DB2 and SQL Server must be sorted on the common key(s) first
Two ways to meet the sorting requirement
There are actually two ways to meet the sorting requirement:
- You can ask SSIS to sort the data set for you by using the Sort Transformation task, or,
- you can tell SSIS that your data sets are already sorted by setting the IsSorted property of the OLE DB Source Output to true (must use the advanced Editor)
Finally, I’ll show you some basic steps to make it work.
Steps when using Merge Join in SSIS
Below is the final screen shot. On the left, in the OLE DB Source, I have a query that creates a data set from DB2; on the right, in the OLE DB Source, I have a query that creates a data set from SQL Server.
The left data set can be a huge data set, which can have about 10 million records. The right data set is a very nice and small data set that has only about 5K records.
Step 1: Create your input data sets.
I assume that you know how to write queries for OLE DB Source to create data sets.
Step 2: Optionally, set IsSorted = True for your OLE DB Source Output
If you query has done the sorting, you can set the IsSorted property to true on the Advanced Editor.
On the Advanced Editor, go to the last tab, Input and Output Properties.
My example showed False. You can certainly set it to True if you have done the sorting in your query.
Step 3: Optionally, ask SSIS to sort the data sets
In my example, I asked SSIS to sort the data set on one column.
Remember, both of the data sets must be sorted on the same key(s).
Step 4: Configure the Merge Join
Here is the good news: there is no Advanced Editor for this Merge Join transformation task.
1) Join Type: pick from these three options. I picked Inner Join.
2) Pick Join Key from the two data sets. SSIS has picked the right key for me, because those are the key I’ve sorted the data set on.
3) Pick the output fields from the two data sets.
You should be done by now…but hang on…
Step 5: Two more things about Merge Join
If you look at the Merge Join properties, you will notice two more properties that might need your attention.
MaxBuffersPerInput – if your data sets are huge and the sorting is somehow incorrect, you will need to experiment with this number.
TreatNullsAsEqual –this property decides whether to join Null values or not. By default it does. In my example, there should be no NULL value. So I didn’t bother with this.
One last note, the result from the Merge Join is also sorted! This is a good news. If you need to use the Merge Join output in other transformations, there is no more sorting needed.
I am back from vacation, and have spent a couple of weeks catching up my workload. Now I am ready to finish the SQL Server BI Quiz “Mass Mailing Your Report”.
Here is my original blog, SSRS #55 – SQL Server BI Quiz “Mass Mailing Your Report”. Thanks to the following readers who have posted their answers on http://beyondrelational.com/quiz/sqlserver/bi/2011/questions/mass-mailing-your-report.aspx:
All of their answers mentioned about data-driven subscription engine in SSRS, but unfortunately, none has given details in implementation.
Since the quiz is all about the details in implementation, I have decided not to pick any winner.
In this post, I’ll blog the implementation using the data-driven subscription engine in Reporting Services.
I have seen developers using SSIS package to attach reports in emails. This approach requires spreadsheet templates and lots of scripting in SSIS.
With the data-driven subscription engine, you just need to create a SSRS report, and pass dynamic parameters to the SSRS report, and tell the data-driven subscription engine to send out the report as an attachment via email. There is no need for spreadsheet template to make the report pretty, and you can pass any values to the report parameters, and best of all, there is no need for any scripting.
There are four basic steps when working with data-driven subscription
- Create a SSRS report with the desired layout and parameters, and then
- Design a SQL table that will hold your dynamic parameters, and
- Pass the correct parameters to the SSRS report so the subscription engine does not reject them, and finally
- Tell the data-driven subscription engine to send the report (In my example, I’ll stay with email delivery option. With the file share option, it adds another layer of complexity to it, that is, using the correct service account to access the shared path when I get time, I’ll blog about it in a separate post.)
Step 1: Create a SSRS report with the desired layout and parameters
I assume that you already know how to create SSRS reports with parameters.
Step 2: Design a SQL table that will hold your dynamic parameters
- In my example, I have the begin date, end date and the supervisor ID as my report parameters. When the supervisor ID is 1, I need send report to Supervisor1@company.com; when the supervisor ID is 2, I need send report to Supervisor2@company.com.
I can also use this same table to set the begin and end date for a month-to-date report, rather than just a daily report.
Step 3: Pass the correct parameters to the SSRS report so the subscription engine does not reject them
On the Reporting Server, find your SSRS report. Under tab Subscription, click “New Data-driven Subscription”.
In the Step 1 window,
- 1) give the subscription a description.
- 2) Select E-mail as the delivery option
- 3) Select “Specify a shared data source”
On the Step 2 window, select the same data source your SSRS report uses.
On the Step 3 window, enter the SQL command as shown and click Validate button to make sure the command is good.
Table SUBSCRIBER.T_SUBSCRIBER_TEST stores all my parameter values shown in step 2. Change it to your own table.
In the Step 4 window, choose “Get the value from the database” for the following fields:
Also select Excel as the Render Format.
This is the fun part. When you click the drop-down box, a pick list showing all the fields from my table SUBSCRIBER.T_SUBSCRIBER_TEST showed up.
You can see that you can also make the email subject dynamic too.
On the Step 5 window,
- Use field ParamBeginDate as the begin date
- Use field ParamEndDate as the end date
- Use field ParamSupervisor as the Supervisor
On the step 6 window, select the second choice.
This is the last window for the subscription wizard. Instead of triggering the subscription with a fixed schedule, we will trigger it in a SQL job step right after my SSIS package finishes the daily ETL process.
Selecting Once will disable the fixed schedule.
Last step, i.e. Step 4, trigger the subscription in a SQL job step right after my SSIS package finishes the daily ETL process
Since I use SSIS packages to automate 99% of my SQL tasks, there is no surprise that I did this last step also in a SSIS package.
This SSIS package is executed right after my daily ETL process. The primary purpose of this package is to trigger the above subscription I just created.
There are three things I need to take care before I start the subscription.
1. I don’t want to send emails to users on weekends or during holidays. So I need to read in the holiday and the weekend flag. Then I used a Precedence Constraint to decide whether I need to continue to not,.
2. The second thing I need to take care of is the table SUBSCRIBER.T_SUBSCRIBER_TEST. I need to populate it with the correct begin date and end date, and possible supervisor IDs.
3. The last thing I need to take care of is to read the Subscription ID from the Reporting Server for my SSRS report.
This is the query you can use to get the Subscription ID from the Reporting Server for your SSRS report.
I save the Subscription ID to a user variable varSubscriptionID.
Finally, I use the AddEvent SQL command to trigger the subscription, of cause, with the Subscription ID (passed in by ?).
We are done!
Hope I didn’t confuse you or scare you away from using the data-driven subscription. If you are not comfortable with Integration Services, you do not need to use it. Simply triggering your subscription from a fixed schedule with a hard-coded Subscription ID in a SQL command is perfectly acceptable.