You might already know where I stand in terms of using raw file or not. In a DI (Data Integration) project where data traceability and lineage is crucial, I find very little use for raw files. Well defined milestone points and staging in SQL database is what I prefer over staging in raw files.
However, coming back to a reporting team where process performance is more critical than data traceability, I find that raw file can be a life saving tool for the daily ETL processes.
So I ventured out and created a test package for the purpose of approval of concept. It is indeed straightforward enough.
- Extract a table from source (DB2) that contains about 10K records: I have no need to stage the data in SQL database or DB2 database.
- one of the column, software vendor (mname), needs to be standardized with a table in our SQL database.
- In order to get the standardized software vendor , I need to use Merge Join between the raw file and the SQL table tblStandardPCSWVendor.
- Finally, I need to stage the merged result (with the standard software vendor) to a SQL table tblMergeJointTest.
Control flows: here are the 3 simple steps.
- EXT_RAW_inv_application: this data flow step extract raw data from the source, and stage the result in a raw file, which path and file name is stored in a variable varRawFileInvApp.
- TRUNC_tblMergeJointTest: this Execute SQL Task step simply truncate my staging table tblMergeJointTest.
- STG_tblMergeJointTest: this data flow step uses the raw file from above as data source, merges with table tblStandardPCSWVendor, then sends result to staging table tblMergeJointTest.
Step 1: Define variables for raw file file folder and name:
First, define varWorkFolder: this variable defines the root folder for all other working folders.
Second, define varRawFileInvApp: because I don’t want to repeat my root folder, which is already defined in the first variable varWorkFolder, I’ll use an expression here. change the property EvaluateAsExpression to True, then type the following in the Expression.
@[User::varWorkFolder] + \\INVAPP.raw
Also, define the Namespace for both variables as RAW instead of the default User. This way it’s easy for me to pick them out from any drop down list.
Step 2: Data Flow EXT_RAW_inv_application:
This data flow has an OLE DB Source.
Here is the script. I included an Order By clause here, because I will use the raw file later in a Merge Join, which requires my sources being sorted.
Follow these steps to configure the Raw File Destination:
- AccessMode: File name from variable
- FinaNameVariable: pick RAW::varRawFileInvApp from the drop down list
- WriteOption: pick Create Once for the first time (after your package has run successfully once, change it to Truncate and Append, also change to False for ValidateExternalMetadata).
(Tip: if you need a release coordinator to do the production release for you, create a small raw file by limiting the number of rows in the source, and save the raw file. Your release coordinator can copy the small raw file into the right folder on teh production server)
Step 3: STG_tblMergeJointTest
This data flow has the following data steps.
Raw File Source: follow these steps to configure it.
SortKeyPosition: change it to 1 from 0 for column mname. This is the column that we have sorted the data set on. This is the only column we need for the Merge Join latter. SSIS will not try to validate if your raw file is actually sorted by the column.
OLE DB Source tblStandardPCSWVendor: the query looks like this, and it is sorted on column ExistingSoftwareVendor. This will be the column the Merge Join will be suing for Join with mname.
Note: if we do not sort the data set here, then we will need to use the Sort transformation to sort it before the Merge Join.
Merge Join Left Outer Join (mname joins ExistingSoftwareVendor): the Merge Join transformation should be very straightforward to configure. It is equivalent to SQL Join, with options of Left, Right, Full, and Inner Join.
The Join Key is automatically set for you, because SSIS knows the Sort Key on each data set at this point.
OLE DB Destination tblMergeJointTest: finally this step stages the Merge Join result in a SQL table tblMergeJointTest.
The above test package uses raw file as a destination first, then uses it as a data source, Merge Joins it with the standardization table. and finally stages the result in a SQL table.
Adding to the job description collection. Pay attention to the work deliverables.
Architect role would be more of a data analyst/modeler as they will be working with the SAS developers and act as the SME for the data models/database structures so when the SAS developer are developing they will reach out to this data architect/modeler/analyst to help them figure out where they would pull the data from/what data models and provide the general data architecture/modeling and analysis knowledge for the SAS developers
Need someone with strong data architecture, data modeling, sql background.
Data architect to assist with data mappings, data modeling, data profiling, query design, data flow design, etc between multiple databases across multiples platforms. Data Architect will be support and contribute to DI/ETL architecture and act as data SME (subject matter expert) with respect to data profiling assessments and participate in JAD session/interviews.
* Perform data profiling and analysis to assess data quality/patterns, recommend data cleansing rules, conforming data standardizing rules, and matching algorithm changes to CDI and data bus architecture (dimensions/facts)
* Database query Performance and Tuning expertise (P&T) on plsql blocks and sql statements.
* ETL Workflow and data mapping analysis, review, and tuning.
* Strong design artifact experience in documentation of data flow and data mapping and data exchange specifications.
* ETL development experience and technical adeptness at reviewing and initial prototyping tasks.
Lead Data Architect with solid project lifecycle, database design experience, and with complimentary Business Analyst skills. This position is focused on building databases for transactional systems and the chance to work with new technologies in a collaborative team environment.
Function as a DA lead. Responsible for the creation of the following work deliverables:
- Data Sourcing
- Data Profiling/Gap Analysis
- Client Model
- Data Dictionary
- Data Exchange Specification
- Data Flow Diagram
- DDL Scripts
- ETL Data Mapping
- Building conceptual, logical and physical database designs
- Participating in data development and analysis, and data dictionary management
- Maintaining new and existing database systems and troubleshooting issues as needed
- Ensuring data consistency and accuracy across all systems
- Collaborating with business analysts, ETL develops and other key internal personnel on a wide scope of data architecture functions
- Interface with business stakeholders
- Candidates must have solid experience working through the project development life cycle, building a conceptual, logical and physical database design.
- A background in data analysis in a data warehouse environment and data modeling is required.
- SQL skills and MS Office proficiency are needed.
- Ideal applicants would have experience in healthcare, Oracle and DB2.
- A Bachelor’s degree is needed.
- 4+ years experience with conceptual, logical, and physical database design, and data analysis skills.
- 4+ years experience with data modeling tools such as ERwin or PowerDesign.
- Healthcare experience, SAS, or experience working with large products is considered a plus.
After 8 months away from the bank, I recently went back to the same team, and got back my old job.
At the time of my leaving last year, it seemed that the direction was favoring Analysis Services (SSAS, or cube, as we all “affectionately” call it) and moving away from Reporting Services. As the chief cube designer told me, the cube has encapsulated all the business rules, therefore, it represents the “truth”, and the only truth of the data. He was very unhappy when I told him that my manager has not told me to use his “master” cube as the only data source yet.
After the first few days of getting re-acquainted with my old co-workers, my first real technical question is what happened to the cube. I see that majority of the SSRS reports I developed are still in use. The answer is “it died a harsh death”. One day, the word came from the above, abandoning the cube, no more SSAS, no more MDX, just old plain SQL and SSRS, and SSIS. The lesson is, according to my co-worker, we need more experience, more best practices, and more standards for developing cubes.
You can imagine how “disappointing” I am. Not really. I am glad that it has died because it wasn’t designed right to be the only truth. Hopefully we have all learned our lessons.
I wrote a blog, SSIS – Staging in Relational Environment or in Raw Files?, to argue that raw file format has it’s place in ETL process, but not the only choice. The blog is “inspired” by a SSIS package design I saw that has no milestone points and no staging, not in relational database, not in flat files either.
I read a blog today, 31 Days of SSIS – Raw Files Are Awesome (1/31). I agree that the raw file format can be incredibly useful. It can be used for either a data source or data destination.
Here are some of the scenarios that the author is suggesting where raw files can be awesome:
1) if you want to build an SSIS package that processes data in multiple steps, temporary versions of the data can be stored in raw files and passed between data flows. This will allow packages to be built such that they can be restarted at different points in the data processing.
2) Raw files can be used to exchange data between SSIS packages. You can develop one package to convert a clients file to a standardized structure in the raw file format. And then use a couple other packages to read that format in the next packages that process the data in a standard format. All of this can be done without hitting the database.
3) You can export data from a database to a raw file to use as a reference file for multiple SSIS packages. If you need to cycle through a group of IDs or states multiple time, store the data in a raw file and use it as a data cache. Sure you can use a Lookup Cache instead – but that doesn’t fit all scenarios and not everyone is using SQL Server 2008.
If you have written stored procedures as data source for SSRS reports, you must have noticed some weird behavior in the dataset designer. But after a while, you would figure out the reason for the odd behavior. The dataset designer initially does not have the “Contract” of the column definitions of data returned from the procedure, or the metadata that SSIS needs to properly populate the column definition.
Once you understand the reason, you would start to figure out a way to “fool” SSRS.
1) When you first set a data set in SSRS to use a stored procedure, go ahead click OK on the data source designer. It will throw an error because it is unable to obtain column definitions. No need to be scared of the error message. Finish by going to the Parameter tab and set up your parameters there.
2) Close the data set designer, and click the "Refresh fields" button on the tool bar. SSRS will prompt you for input parameters. Based on the results returned from the execution, SSRS will get the metadata needed and will be able to populate the column definitions.
Now you might be feeling more comfortable about writing stored procedures as data sources for SSRS reports. However, this is only one side of the story. Another side of the story is from within the stored procedure. You MUST take care in your procedure to return the SAME “contract” of columns under ALL possible scenarios. If you don’t, SSRS will throw error messages you might not understand.
Here is an example to show how you would take care in your procedure to cover all possible scenarios.
Here is simple template I would use. You can certainly write you the procedure with your own style. But the key issue is that you need to make sure that your stored procedure returns the same definitions of the columns from every possible branch. This is applicable when you have your return statement in IF statement.
As for SSIS, I use stored procedures in SQL execution tasks, not to return data set, but to process data and create milestone points and staging.
Many people have written in their blogs about calling stored procedures in data flow. In this case, the stored procedure is not creating physical staging, but return a data set to the data flow as a source. Similar problem about “contract” of column definitions happens here. Unfortunately SSIS isn’t easy to “fool” in this case. Without the metadata of the columns, the SQL Command Text editor will not allow you to more on.
Here is a blog, “31 Days of SSIS – No More Procedures (20/31)”, arguing that instead of using stored procedures, the best practice is to use table-valued functions in SSIS.
Here is it’s argument:
Whether the table-valued function is inline or multi-line, there is always a defined set of columns that will be returned. The SSIS data source knows specifically what will be returned from the function.
I mostly agree with this and will test it when I have time. However, table-valued functions are not procedures, and have it’s own limitations. Some of the limitations are well put in this blog, SQL SERVER – User Defined Functions (UDF) Limitations, such as
- No Access to Stored Procedure
- Prohibit Usage of Non-Deterministic Built-in Functions
- Returns Only One Result Set or Output Parameter
- an not Execute Dynamic SQL or Temporary Tables
- can not Return XML
- does not Support Error Handling.
Table-values functions have it’s place, such as perform repetitive task in SELECT statement or modularize frequently used code. It’s not really a tool for processing data with complex business rules.
I might still stick with stored procedures in the future. But it’s good to know all the options out there.
There are two options to me in order to control transactions in a SSIS package. One is the plain old T-SQL using TRY CATCH to trap errors and BEGIN/COMMIT/ROLLBACK TRANSACTION to handle transactions. There is a good article here, Handling SQL Server Errors.
Another choice is to use the built-in transaction support in Integration Service with the TransactionOption (Required, Supported, NotSupported) property at package level, container level and task level. Here is a good article on that, How To Use Transactions in SQL Server Integration Services SSIS.
However, using the built-in transaction support in SSIS requires the Distributed Transaction Coordinator (MSDTC) service running on the Windows server where SSIS resides.
The Distributed Transaction Coordinator (MSDTC) service is a component of modern versions of Microsoft Windows that is responsible for coordinating transactions that span multiple resource managers, such as databases, message queues, and file systems. MSDTC is included in Windows 2000 and later operating systems, and is also available for Windows NT 4.0.
1/23/2011: An update on my test on using TransactionOption (Required, Supported, NotSupported) property in SSIS
Result: unfortunately this solution didn’t not work for me so far. In every SSIS package I designed, I have a Execute SQL task, AUD_tblamETLAudit_OnError, in the OnError even handler at the package level to log the error into an audit table. For some reason unknown to me, as soon as I change the TransactionOption property at the package level from default “Supported” to “Required”, the Execute SQL task in the OnError event handler was never executed. This is a show stopper for me. By the way, I didn’t change the default transaction isolation level Serializable.
I will need to format my T-SQL with a simple CSS stylesheet. http://www.manoli.net is very nice site to help on that. However I will need to upgrade my wordpress blog in order to use custom CSS. I will look into that later.