What is ETL?
I want to take a break from the metadata to a more basic question, what is ETL?
We all know that ETL stands for Extract, Transformation, Loading. This is ETL in it’s simplest term. More importantly, ETL developers should expand this simple 3 steps to the following concepts. I’ve seen developers who have ignored many of the following concepts, therefore, many of the sub processes are missing in the final implementation.
What is ETL?
From the book The Data Warehouse ETL Toolkit: a simple mission of ETL is to get data out of the source and load it into the data warehouse. I am very satisfied with this definition of ETL, for a simple reason that ETL is not only for data warehousing projects.
In the same book: ETL can be defined as more actionable steps of Extract, Clean, Conform, and Deliver. So instead of ETL, should we use ECCD? Now this one is closer.
Here is another one from ETLGuru.com: ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data transportation, data transformation, data consolidation, data integration, data loading and data cleansing.
Here is another one from Informatica.com: ADCID?
Other concepts that the ETL developers should understand are: Standardization, Data Quality, Data Profiling, Data Staging, Exception & Error handling, Production Support & Change Management For ETL Processes, ETL Performance Tuning, ETL commercial tools, ETL Metadata Management.
Commonly missing sub processes in ETL process design
- Automation: if the ETL process must run manually, then it’s not an ETL process.
- Auditability: without the auditable traces of each run, the data quality will be in doubt.
- Data Profiling/discovery (of the source data): developers do not fully understand the source data because this sub process is missing.
- Data Cleansing: one example, duplicate data is introduced into the target system because this step is missing.
- Data Standardization: data is normalized without being standardized first.
- Data Integration: data from multiple sources are not consolidated, integrated and conformed into the target system.
- ETL Performance Tuning: dealing with large amount of data without properly designed staging strategy; attempting to accomplish complex transformations, lookups, merges in single step.