SSIS #108 – The Makeover of the Lookup Data Flow Component in SSIS 2008
Lookup data flow component is a very popular design pattern in SSIS. The idea is to determine whether a record in the pipeline already exists in the intended destination table or not, and then decide the next action.
SSIS has many out-of-box components that are in ready-to-use mode. They can be used in many different scenarios with default configurations and achieve good performance with little tuning on our side. The Lookup data flow component is one of them, until you run into a surprising situation where the data flow takes much longer than you have anticipated. Configuration of the Lookup component is simple enough. After you have passed the first step of configuring the component, the main focus is on how to achieve the optimal performance.
The makeover of the Lookup Component in SSIS 2008
As popular as it is, I have not used it a lot in SSIS 2005. In SSIS 2008, however, I was pleasantly surprised by a few improvements.
The new UI make it easier for configuration. The Cache mode now has three different choices. The Connection type now includes a Cache connection manager, and the output now includes a “Redirect rows to no match output”.
The concept of caching is not difficult to understand. When using SSIS data flow components, Integration Services load the necessary data into the memory. When not all the data can be loaded into the memory at once, Integration Services load and process data one buffer at a time. See Understanding SSIS Data Flow Buffers on MSDN. The Cache mode tells Integration Services your choice of whether to load and process the lookup data all at once into memory, not at all, or somewhere between. Full cache, in most cases, will give you the optimal performance since all the lookup data will be loaded into the memory.
Cache connection manager
Cache mode is not really a new concept in SSIS 2008. Cache connection, however, is a major improvement in SSIS 2008. Now, we are able to use a cache for our lookup data prior to the the execution of the dataflow. Cache Connection is a new connection manager type in SSIS 2008.
With this new type of connection type in SSIS 2008, now we are able to populate our cache many different types of sources, such as text files, raw files, XML files, Excel files, a recordset, or anything that can be accessed using an ADO.Net provider. This is compared to SSIS 2005 where we were limited to use an OLE DB source. This means that the cache can be used in multiple lookups. Imagine when using the Foreach iteration in a loop in SSIS 2005, we would repopulate the cache for every iteration.In SSIS 2008, the cache needs only to be populated once. If you had performance issue with your Lookup components in SSIS 2005, now it’s time to re-examine it. The new cache connection manager might be a life saver.
Creating a new Cache connection is simple. Select New Connection, and then CACHE.
Redirect rows to no match output
Another improvement is the addition of Redirect rows to no match output. In SSIS 2005, unmatched rows must be redirected as error output. This is a cosmetic improvement, but it does make me smile when I do not need to use the red error output.
In one of my usages of the Lookup component, I would only take the unmatched records,and insert them into the destination table.
In the Connection tab, I’d use a query (instead of data from the entire table) to get only the lookup column.
The DayID is also the only column I need for output.
The new improvements in the Lookup components in SSIS 2008, especially the Cache connection type, makes it using the Lookup component much more fun.