Home > ETL Best Practices > Metadata—What Is It (1)?

Metadata—What Is It (1)?

As I am wrapping up the ETL process design and development, and the system is going into UAT phase, I start to work on producing the metadata for my back room staging area.

Before I do that, I read a few books and articles about metadata management in data warehouse and data integration projects. I took the following notes.

So Metadata—What Is It?

We all know that metadata is any data about data. This wasn’t very helpful to me, because it didn’t paint a clear picture in my mind as to what exactly this metadata stuff was. The Data Warehouse Toolkit book talked about the back room metadata and front room metadata.

Back room metadata is process related and guides the extraction, cleaning, and loading processes.

Front room metadata is more descriptive and helps make our query tools and report writers function smoothly.

These definitions, as helpful as they are, still fail to give me a clear picture what I need to do.

3 Types of Back Room Metadata

  • Process Metadata
  • Technical Metadata
  • Business Metadata

As an ETL designer and developer, I am more interested in the back room metadata.

According to the Toolkit book, Process Metadata, Technical Metadata, and Business Metadata are three types of back room source metadata.

image

As for the Business Metadata, ETL developers are not involved in creating business metadata. However, we should understand the purpose of the data we are working with and review business definitions as we need them.

In the current data integration project I am working in, fortunately, Business Metadata is quite well defined.

Technical Metadata

Technical metadata is the most interesting type of metadata to me as an ETL designer/developer. Generally, there are 5 types of technical metadata:

  1. System Inventory
  2. Data Models
  3. Data Definitions
  4. Business Rules
  5. ETL-Generated Metadata
1. System Inventory
  • Source databases
  • Staging-area tables (for example, extracted, cleaned, conformed, prepared for delivery)
  • Data warehouse presentation area

For each of the above 3 environment, we need to provide an entity relationship diagram (ERD). At a minimum, a listing that includes the following elements for each system is required.

  • Tables
  • Columns
  • Data types
  • Relationships
2. Data Models

Data models refers to the form of physical schema diagrams (either normalized
or dimensional). They are really just a graphical display of metadata.

3. Data Definitions

Data definitions draw attentions to the required consistency between each of their potential data stores. Each time your data touches down to a database or file, it is vulnerable to data truncation or corruption. If data definitions are not alike
between environments, we must explicitly convert the data to avoid catastrophe. The following data-definition metadata elements must be supplied to the ETL team.

  • Table name
  • Column name
  • Data type
  • Domain – The set of values allowed to be entered into a column is known as its domain. The domain can be enforced by a foreign key, check constraints, or the application on top of the database.
  • Referential integrity
  • Constraints
  • Defaults
  • Stored procedures
  • Triggers
4. Business Rules

We like to refer to business rules as technical because they are the essence of the ETL process—which is very technical. Each and every business rule must be coded in the ETL process. Business rules can include anything from allowed values to default values to calculations for derived fields. In source systems, business rules are enforced by stored procedures, constraints, or database triggers. But most often, business rules exist only in the application code.

Business rules must be incorporated into logical data mapping. Sometimes business rules are omitted from logical data mapping and go unnoticed until the first attempt at the ETL process is complete and the exclusions are detected by users during UAT. As new business rules are learned, the metadata in the logical data mapping must be updated to reflect the new rules.

5. ETL-Generated Metadata

This is my focus. More notes in the next blog “Metadata – What Is It (2)? – ETL-Generated Metadata”.

Categories: ETL Best Practices
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: