Flat or sequential file processing vs. DBMS in staging – Metadata tracking
It has been the conventional wisdom that sorting, merging, deleting, replacing, and many other data-migration functions are much faster when they are performed outside the DBMS. Many utility programs are dedicated to text-file manipulation.
However, the metadata tracking is lost in the flat file structure.
If the metadata tracking (say, for compliance purposes) is as important as the transformations themselves, you should think instead about handling these operations through a dedicated ETL tool that can automatically supply the metadata context.
The ETL Toolkit book recommend that it’s most practical to use flat files over database tables for portions of the ETL process when the fundamental purpose of the process
is one of the following:
- Staging source data for safekeeping and recovery
- Sorting data
- Replacing/substituting text strings
- Referencing source data
Besides flat file processing and DBMDS tables processing, XML Data Sets are also used as a data format for ETL processing.
At this point, XML data sets are not generally used for persistent staging in ETL
systems. Rather, they are a very common format for both input to and output from
the ETL system. It is likely that XML data sets will indeed become a persistent
data-storage format in the ETL system and in the data warehouse queryable
tables, but the hierarchical capabilities of XML will have to be integrated more
deeply with relational databases before that will be common.
Using database tables is most appropriate especially when you don’t have a dedicated ETL tool. Using a database to store staging tables has several advantages:
- Apparent metadata
- Relational abilities
- Open repository
- DBA support
- SQL interface