SSIS #97 – When MERGE JOIN is your friend?
What does Merge Join in SSIS do?
Merge Join task in SSIS is actually a very straightforward task to use, if you know this:
- The Merge Join Transformation joins two data sets on a common key (or keys).
- It is the same as performing an SQL Join – the options you have are Inner Join, Full Outer Join and Left Outer Join.
What is the difference between Merge Join in SSIS and SQL Joins?
If you are a SQL developer, you would ask, “If I can do joins in SQL, why would I use merge Join”? On the other hand, if you are an open-minded SQL developer, you would ask, “when do I use Merge Join in SSIS, instead of SQL Join”? If you are an open-minded and a very stubborn SQL developer, you would ask, “what is the real differences between Merge Join in SSIS and SQL Join”? Here are the differences:
- In Merge Join, it is SSIS rather than SQL that is doing the join, so it’s an in-memory activity.
- Merge Join is a blocking transformation, that is, it needs to receive all data from all inputs before it can proceed to merge the data.
- Merge Join can do joins on data sets from difference data sources, and SQL joins cannot.
When did Merge Join in SSIS become my friend?
Your first reaction is performance hit, just like me, until you see the last difference. You want to give it a try! I don’t use Merge Join very often in my ETL work, because I am too much a STAGING person. But there are a few places where I would rather use Merge Join in SSIS, than staging any of those ugly/huge tables in DB2. They are ugly, because they are very difficult to use, and they are huge with 10-20 million records. I got frustrated with all the wrong where clauses that gave me the wrong/missing data. My life as a BI developer will be so much easier, only if I can:
- get all the data from the ugly/huge table in DB2 (with only the where clause I know for sure, so that no data will be missing)
- join to my nice/small table on SQL server
This is when I decided that I wanted to be friend with the Merge Join!
Data sets must be sorted when using Merge Join in SSIS
Once the decision is made, using and configuring the Merge Join is actually very straightforward, but only if you know this:
My data sets from DB2 and SQL Server must be sorted on the common key(s) first
Two ways to meet the sorting requirement
There are actually two ways to meet the sorting requirement:
- You can ask SSIS to sort the data set for you by using the Sort Transformation task, or,
- you can tell SSIS that your data sets are already sorted by setting the IsSorted property of the OLE DB Source Output to true (must use the advanced Editor)
Finally, I’ll show you some basic steps to make it work.
Steps when using Merge Join in SSIS
Below is the final screen shot. On the left, in the OLE DB Source, I have a query that creates a data set from DB2; on the right, in the OLE DB Source, I have a query that creates a data set from SQL Server.
The left data set can be a huge data set, which can have about 10 million records. The right data set is a very nice and small data set that has only about 5K records.
Step 1: Create your input data sets.
I assume that you know how to write queries for OLE DB Source to create data sets.
Step 2: Optionally, set IsSorted = True for your OLE DB Source Output
If you query has done the sorting, you can set the IsSorted property to true on the Advanced Editor.
On the Advanced Editor, go to the last tab, Input and Output Properties.
My example showed False. You can certainly set it to True if you have done the sorting in your query.
Step 3: Optionally, ask SSIS to sort the data sets
In my example, I asked SSIS to sort the data set on one column.
Remember, both of the data sets must be sorted on the same key(s).
Step 4: Configure the Merge Join
Here is the good news: there is no Advanced Editor for this Merge Join transformation task.
1) Join Type: pick from these three options. I picked Inner Join.
2) Pick Join Key from the two data sets. SSIS has picked the right key for me, because those are the key I’ve sorted the data set on.
3) Pick the output fields from the two data sets.
You should be done by now…but hang on…
Step 5: Two more things about Merge Join
If you look at the Merge Join properties, you will notice two more properties that might need your attention.
MaxBuffersPerInput – if your data sets are huge and the sorting is somehow incorrect, you will need to experiment with this number.
TreatNullsAsEqual –this property decides whether to join Null values or not. By default it does. In my example, there should be no NULL value. So I didn’t bother with this.
One last note, the result from the Merge Join is also sorted! This is a good news. If you need to use the Merge Join output in other transformations, there is no more sorting needed.