SSIS – A Strategy for Performance
Here is a very useful article about the design strategy for performance in ETL: http://msdn.microsoft.com/en-us/library/cc966530.aspx
Performance is a make or break success factor for data integration solutions. Performance influences how well your solutions can scale and how quickly they can be adapted to changing business needs: increasing data volumes, shrinking batch windows, expanding data complexity, and emerging data quality standards.
The above article addresses four performance elements: operations, volume, application, and location.
Operations is the key factor of the performance strategy.
To reduce the size of the data set and improve the performance of the overall data load, you should configure data types early in the sequence of operations.
Data type optimizations do not just apply to file data sources. Setting data types early in the SSIS data load process, as appropriate, can provide performance benefits for all data sources.
Example: Operations to Load and Aggregate a Flat File into a SQL Server Table
- Read data from the text file
- Load data from the text file into the SSIS Data Pipeline
- Aggregate data
- Open a transaction on SQL Server
- Pass data to the SQL Server interface
- Load the data into each SQL Server destination table
- Commit the transaction
Shrinking Data Set Size – To illustrate the concept of shrinking data sets, consider the following example. Assume that the 5,000,000-record source text file has 100 columns of numbers between 1 and 999. If you leave the default string data type for these columns, each column will need 50 bytes per column per record. This means that the calculated size of each row will be approximately 5,000 bytes, plus some additional memory overhead that SSIS requires. So, with a volume of 5,000,000 records and 5,000 bytes per row, the total size of the file will be approximately 23 gigabytes (GB). Consider the difference in the size of the data set if you define the columns more accurately as 2-byte integers. This will reduce the size per row to 200 bytes and the total size of the file to 954 megabytes (MB), a savings of approximately 22 GB over the string data type. With this small change in data type size, you now have far less data to load into memory. Note that this example is included to illustrate the relative difference of data types and the savings involved with shrinking the size of the data set. It is not intended to be a hard and fast sizing guideline.
Maximizing Throughput – In addition to configuring data types, you can also maximize throughput by configuring SSIS buffer settings such as DefaultMaxBufferSize and DefaultMaxBufferRows.
Defining Parallel Operations – The third way to sharpen volume is to consider how you can parallelize operations to maximize efficiency. SSIS supports parallel processing of packages, tasks, and transformations. When you design parallel processes, you need to consider how parallelism applies to all data load operations. For example, if you need to load and aggregate multiple files into SQL Server, you might consider performing one or more of the following operations in parallel: file reading, aggregation, and destination table loading. Although SSIS supports the parallelism of all these operations, you still have to be mindful of your machine resources and the level of parallelism that they can support. You might build a highly parallel SSIS package and then later realize that your machine does not have enough CPU and memory to take full advantage of the parallelism.
Assessing Application Alternatives
Alternative 1 – BULK INSERT Solution – You could certainly use BULK INSERT to load the data from the source file into SQL Server. However, you would also need to include additional operations to aggregate the data before loading it into the final destination. One way to accomplish this would be to use BULK INSERT to load the data into a staging table, use Transact-SQL to aggregate the data from the staging table, and then use Transact-SQL to load into the destination table.
Alternative 2 – SSIS Solution – SSIS offers a one-stop shop where you can both aggregate and load data to the destination table. In addition, within SSIS you can add workflow to control the order of operations and even set up execution logging to record the progress of operations as well as the error handling to capture any invalid records.
When you compare the two previously listed alternatives, Alternative 2, the SSIS solution, eliminates the need for a staging table and encapsulates all data load logic and workflow into a single SSIS package. To achieve the same functionality with Alternative 1, the BULK INSERT solution, you would not only have to use Transact-SQL to aggregate data, but you might also have to consider using it to encapsulate all the data logic in a stored procedure. So, based on this comparison, the SSIS solution proves to be an excellent choice for the aggregate-and-load example, thus providing the right amount of functionality with the ability to centralize and manage workflow.