SSIS # 102 – Don’t be afraid to leverage TSQL in SSIS package
I hope so far you are convinced that you need to learn at least one ETL tool in order to stay competitive in the SQL/BI profession. I also hope that Microsoft SSIS is your choice.
I’ve worked with two types of people with very different views on ETL tools.
One is to be very skeptical about using an ETL tool over hand coded SQL code. They are even more skeptical when they see I am using stored procedures in a SSIS package. Their argument is that if all can be done in a stored procedure, why use both stored procedures and SSIS. Would it be simpler to just stick to one tool?
Another type tends to completely abandon TSQL code, in favor of data flow tasks only, such as Union All, Merge, Sort, Lookup, Slowly Changing Dimension etc.
To the first group of friends, I hope they have eventually been convinced that an ETL tool is the way to go for ETL work.
Some high level benefits of SSIS:
- visual representation of work
- control flow (Including multiple execution paths, and conditional execution path)
- data transformations from source to target
- event logging
- source and target do not need to be on the same server or same DBMS
- audit details
- package configurations for secure and easy deployment
- memory management
In this blog, I’d like share with you one very important lesson I’ve learned. That is never be afraid to leverage TSQL in SSIS packages.
Why use the Sort data flow task when you can write TSQL code to SORT the data?
Why use the Merge task when you can write TSQL code to JOIN tables?
Why bring too much from the source when you can write TSQL code with proper WHERE clause to limit it to only the data you need?
Why use Derived Column task when you can write TSQL code to derive the column?
Why use data flow at all if you can write a stored procedure that encapsulates all business logics and have it executed from a Execute SQL Task?
Set-based INSERT and UPDATE
This is not just because we want to leverage the SQL Server engine’s computing power, but also for another very good reason. A very simple reason. That is to prefer set-based INSERT and UPDATE over row-by-row in-memory operation.
If you are working with large amount of data, writing TSQL set-based code to do INSERT and UPDATE, can quickly become your only option. In this case, sending large changes into staging will be your first step.
A reader once asked me if using SCD (Slowly Changing Dimension) is the only way for loading dimensions. I hope you have drawn your own conclusion so far. No, it’s is not the only way. It’s not always the best way either.