Home > SSIS Best Practices > SSIS #97 – When MERGE JOIN is your friend?

SSIS #97 – When MERGE JOIN is your friend?

What does Merge Join in SSIS do?

Merge Join task in SSIS is actually a very straightforward task to use, if you know this:

  • The Merge Join Transformation joins two data sets on a common key (or keys).
  • It is the same as performing an SQL Join – the options you have are Inner Join, Full Outer Join and Left Outer Join.

What is the difference between Merge Join in SSIS and SQL Joins?

If you are a SQL developer, you would ask, “If I can do joins in SQL, why would I use merge Join”? On the other hand, if you are an open-minded SQL developer, you would ask, “when do I use Merge Join in SSIS, instead of SQL Join”? If you are an open-minded and a very stubborn SQL developer, you would ask, “what is the real differences between Merge Join in SSIS and SQL Join”? Here are the differences:

  • In Merge Join, it is SSIS rather than SQL that is doing the join, so it’s an in-memory activity.
  • Merge Join is a blocking transformation, that is, it needs to receive all data from all inputs before it can proceed to merge the data.
  • Merge Join can do joins on data sets from difference data sources, and SQL joins cannot.

When did Merge Join in SSIS become my friend?

Your first reaction is performance hit, just like me, until you see the last difference. You want to give it a try! I don’t use Merge Join very often in my ETL work, because I am too much a STAGING person. But there are a few places where I would rather use Merge Join in SSIS, than staging any of those ugly/huge tables in DB2. They are ugly, because they are very difficult to use, and they are huge with 10-20 million records. I got frustrated with all the wrong where clauses that gave me the wrong/missing data. My life as a BI developer will be so much easier, only if I can:

  • get all the data from the ugly/huge table in DB2 (with only the where clause I know for sure, so that no data will be missing)
  • join to my nice/small table on SQL server

This is when I decided that I wanted to be friend with the Merge Join!

Data sets must be sorted when using Merge Join in SSIS

Once the decision is made, using and configuring the Merge Join is actually very straightforward, but only if you know this:

My data sets from DB2 and SQL Server must be sorted on the common key(s) first

Two ways to meet the sorting requirement

There are actually two ways to meet the sorting requirement:

  • You can ask SSIS to sort the data set for you by using the Sort Transformation task, or,
  • you can tell SSIS that your data sets are already sorted by setting the IsSorted property of the OLE DB Source Output to true (must use the advanced Editor)  

Finally, I’ll show you some basic steps to make it work.

Steps when using Merge Join in SSIS

Below is the final screen shot. On the left, in the OLE DB Source, I have a query that creates a data set from DB2; on the right, in the OLE DB Source, I have a query that creates a data set from SQL Server.

The left data set can be a huge data set, which can have about 10 million records. The right data set is a very nice and small data set that has only about 5K records.

image

Step 1: Create your input data sets.

I assume that you know how to write queries for OLE DB Source to create data sets.

Step 2: Optionally, set IsSorted = True for your OLE DB Source Output

If you query has done the sorting, you can set the IsSorted property to true on the Advanced Editor.

image

On the Advanced Editor, go to the last tab, Input and Output Properties.

My example showed False. You can certainly set it to True if you have done the sorting in your query.

image

Step 3: Optionally, ask SSIS to sort the data sets

In my example, I asked SSIS to sort the data set on one column.

Remember, both of the data sets must be sorted on the same key(s).

image

Step 4: Configure the Merge Join

Here is the good news: there is no Advanced Editor for this Merge Join transformation task.

1) Join Type: pick from these three options. I picked Inner Join.

image

2) Pick Join Key from the two data sets. SSIS has picked the right key for me, because those are the key I’ve sorted the data set on.

3) Pick the output fields from the two data sets.

image

You should be done by now…but hang on…

Step 5: Two more things about Merge Join

If you look at the Merge Join properties, you will notice two more properties that might need your attention.

image

MaxBuffersPerInput – if your data sets are huge and the sorting is somehow incorrect, you will need to experiment with this number.  
TreatNullsAsEqual –this property decides whether to join Null values or not. By default it does. In my example, there should be no NULL value. So I didn’t bother with this.

One last note, the result from the Merge Join is also sorted! This is a good news. If you need to use the Merge Join output in other transformations, there is no more sorting needed.

Categories: SSIS Best Practices
  1. DEREK
    August 23, 2011 at 1:56 am

    I suppose it should be semi-block component if the input result sets are sorted. when keys are different in comparison, then go though the pipeline.

    Like

  1. August 26, 2011 at 4:20 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: