Most of us know ETL is an acronym for Extract, Transform and Load even if you have never actively involved in any ETL projects. Some of us also know it by ECCD (Extract, Clean, Conform, and Deliver) if you have read the book The Data Warehouse ETL Toolkit, or by ADCID (Access, Discover, Cleanse, Integration, Deliver) is you have read publications from Informatica.
All of these acronyms are good reminders for ETL designers/developers to produce high-quality ETL work with vigorous disciplines. What do I mean by that? Here is an example. I use the ECCD acronym for creating milestone points and staging for the ETL processes I’ve designed.
What about other vocabularies we’ve heard about but not in these acronyms. I’ve talked about the sub-systems in another blog What is ETL?, such as data profiling, validation, standardization etc. What about the following practices? I am putting down my notes here because they are important issues for any ETL process.
Naming Conventions: this will include codings, prefixes, and standard descriptions across the project. Naming conventions will be different for different projects. The key is to stick to the naming convention religiously.
Modular Design: each ETL process should have up to six stages and be designed to carry out one part of the processing in each stage.
Failover and Recovery: large/time-consuming task should be designed as a separate job, to make sure that the core process is not a long-running process. When each stage in the main process the loads is not large the strategy for failover and recovery will be to restart the failed stage. The standard best practices of milestone points, staging and modular design will help tremendously. If the entire process is not large, re-start from the Extract from the beginning will certainly work too.
Error Handling: will target 100% error trapping of jobs. I’ve designed a component in SSIS to handle trapping and logging the error.
Notification: Email will be the primary notification vehicle and it will be integrated with the error handling and process reporting routines to ensure information about successful and unsuccessful runs are delivered to the administrator and stake holders. Again the component I designed in SSIS will perform the notification.
Release Management: Team Foundation Server is primary Version Control tool I’ve used so far. If you are using SQL Server Integration Services, there are some features there you can certainly take advantage of. See my other SSIS blogs for this.
Process Reporting: status and row counts of jobs will be retrieved for accurate process reporting. Again the component I designed in SSIS can perform the process/audit reporting.
Parameter Management: the ability to manage job parameters across environments so that components can be delivered and run without requiring any modifications. One of the parameters I used is a parameter to determine if the loading is Initial VS. Incremental. The parameters can be stored in a file or on a SQL table. You can designed a task to read the parameter(s) from the flat file or from a SQL table to determine the processing path.
Optimization: I’ve read about how multiple instances of jobs can be created to take advantage of a server farm (if you have one), but I’ve never seen how it’s done. Since SSIS is my primary driving tool, parallelism will be used when different types of extracts/processes can run at the same time. In my opinion, to design a performance-driven ETL process in SSIS, you need to have a suitably complex mix of SQL-based and SSIS-specific optimization techniques in place.
Metadata Management: in SSIS it’s not easy to see a central repository of metadata information. To compensate for this, I’ve some code to try to extract the metadata out from my SSIS packages. For the most part, I document the metadata manually in an Excel file.