SSIS – Staging in Relational Environment or in Raw Files?
If you’ve ever used SQL Server Integration Services (SSIS), you will have come across raw files. Microsoft introduced them with SSIS as a mechanism for storing data on the local file system. The design of raw files enables data to be written very quickly but, unfortunately, to achieve this performance Microsoft has used a proprietary binary storage format.
Raw files are proprietary sources and destinations in the SSIS data flow that are only accessible through SSIS. The files are binary files that contain all the metadata information for the file (data types and column names) in the file header. Because this file is proprietary to SSIS, it’s the fastest source and destination.
When to use raw files?
I’ve read about using raw files for error storing. One of the big benefits of working with raw files is the ease with which you can add them to your packages. The raw file destination requires you to specify only the name of file to use and the columns you want to store. If you use a text file destination, then you have to build the structure of the file and ensure it makes your data flow. If you use an OLEDB destination, then you have to have a database and also a table created to store the data.
Their ease of configuration makes raw files ideal for storing errors in your data flows. All you need to do is direct your error flows to a raw file destination, specify and filename and select the columns you want to store. If errors do occur, then they will be stored in the raw files. Of cause, you will need to use SSIS to be able to interrogate the errors stored in these raw files.
Should you use raw files for all your staging needs in the ETL process? I doubt that’s a smart choice for the following reasons.
- Debugging is difficult. You have to rely on the built-in debugging tools in SSIS.
- Interrogate the data is almost impossible. A few blogs suggested to extract data from sources and stage them in raw files. This is fine if you will never need to interrogate the data.
I am mostly for “staging in relational environment” for the reason that I can profile the data easily, and interrogate the data easily during troubleshooting.
But I do see three areas where staging in the raw file can be a good choice:
- Store errors as described above
- Staging for temporary needs where you will not need to interrogate the staged data
- Breaking up a big extract into more practical files for later loading