What does ETL do really? – Extract, Clean, Conform, Deliver (ECCD)
I recently bought a book, The Data Warehouse ETL Toolkit. Although it was published in 2004, it is still the best book I can find for ETL professionals, and the principles and techniques are still discussed every day.
Even the first paragraph of the Introduction impresses me by answering the question of what really ETL does.
Here is a simple view of the basic mission of the ETL system: to get data out of the source and load it into the data warehouse.
1) The Extract-Transform-Load (ETL) system is the foundation of the data
2) A properly designed ETL system extracts data from the source
3) enforces data quality and consistency standards,
4) conforms data
so that separate sources can be used together, and finally
5) delivers data in a presentation-ready format so that application developers can build
applications and end users can make decisions.
If that’s what ETL does, then the most natural questions would be:
- How do we properly design an ETL system?
- How do we extract data from sources systems?
- How do we enforce data quality?
- How do we enforce consistency standards?
- How do we conform data?
- How do we make sure that data from separate sources can be used together?
- How do we make the data presentation-ready?
I guess I am jumping ahead too much now. Now stay with what EL really does.
5) The ETL system makes or breaks the data warehouse.
6) Building the ETL system is a back room activity that is not very visible to end users,
7) It easily consumes 70 percent of the resources needed for implementation
and maintenance of a typical data warehouse.
The ETL system adds significant value to data. It is far more than plumbing
for getting data out of source systems and into the data warehouse.
Specifically, the ETL system:
- Removes mistakes and corrects missing data
- Provides documented measures of confidence in data
- Captures the flow of transactional data for safekeeping
- Adjusts data from multiple sources to be used together
- Structures data to be usable by end-user tools
Here is the complicated view of what really ETL does: the ETL system breaks into a thousand little subcases, depending on your own weird data sources, business rules, existing software, and unusual destination-reporting applications. The challenge for all of us is to tolerate the thousand little subcases but to keep perspective on the simple overall mission of the ETL system.
Ralph Kimball is indeed a very effective writer. The book turns the ETL into the more actionable steps of Extract, Clean, Conform, and Deliver. So instead of ETL, should we use ECCD?