Archive

Archive for September, 2010

Oracle – Where is the tnsnames.ora file?

September 28, 2010 Leave a comment

The default location should be: oracle install\NETWORK\ADMIN

On my local PC, here it is

C:\oracle\product\10.2.0\db_1\NETWORK\ADMIN

Here are three examples of how a service name can be set up in the file. Is 1521 the default port number? The SID is the Oracle database name.

image

Categories: SSIS Best Practices

SSIS – Port Numbers for Firewall Filter

September 28, 2010 Leave a comment

Just want to write down the port numbers that a firewall needs to open for SQL Server or SSIS to work across different un-trusting domains.

  • 135 for SSIS
  • 1433 for SQL Server
Categories: SSIS Best Practices

ETL Vendors

September 28, 2010 1 comment

There is an interesting article from IT Toolbox on ETL vendors.

image

All 19 ETL tools that is a mix of expensive ETL tools like DataStage and Informatica PowerCenter, cheap tools like Pervasive or IKAN and open source tools like Pentaho, Clover ETL and Talend.

Here is an ETL Vendors Comparison from Adeptia.

Categories: ETL Best Practices

What is ETL?

September 27, 2010 1 comment

I want to take a break from the metadata to a more basic question, what is ETL?

We all know that ETL stands for Extract, Transformation, Loading. This is ETL in it’s simplest term. More importantly, ETL developers should expand this simple 3 steps to the following concepts. I’ve seen developers who have ignored many of the following concepts, therefore, many of the sub processes are missing in the final implementation.

What is ETL?

From the book The Data Warehouse ETL Toolkit: a simple mission of ETL is to get data out of the source and load it into the data warehouse. I am very satisfied with this definition of ETL, for a simple reason that ETL is not only for data warehousing projects.

In the same book: ETL can be defined as more actionable steps of Extract, Clean, Conform, and Deliver. So instead of ETL, should we use ECCD? Now this one is closer.

Here is another one from ETLGuru.com: ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data transportation, data transformation, data consolidation, data integration, data loading and data cleansing.

Here is another one from Informatica.com: ADCID?

image

image

image

image

image

image

 

Other concepts that the ETL developers should understand are: Standardization, Data Quality, Data Profiling, Data Staging, Exception & Error handling, Production Support & Change Management For ETL Processes, ETL Performance Tuning, ETL commercial tools, ETL Metadata Management.

Commonly missing sub processes in ETL process design

  • Automation: if the ETL process must run manually, then it’s not an ETL process.
  • Auditability: without the auditable traces of each run, the data quality will be in doubt.
  • Data Profiling/discovery (of the source data): developers do not fully understand the source data because this sub process is missing.
  • Data Cleansing: one example, duplicate data is introduced into the target system because this step is missing.
  • Data Standardization: data is normalized without being standardized first.
  • Data Integration: data from multiple sources are not consolidated, integrated and conformed into the target system.
  • ETL Performance Tuning: dealing with large amount of data without properly designed staging strategy; attempting to accomplish complex transformations, lookups, merges in single step.
Categories: ETL Best Practices

Metadata – What Is It (4)? – Extract Metadata from SSIS Packages

September 26, 2010 Leave a comment

In my last blog Metadata – What Is It (3)? – ETL Package/Task Metadata, I put down my notes about the naming conventions I am using for tasks/components/transformations in my current ETL design project.

Once I settled down on the naming conventions, 50% of my metadata task is completed. Another 50% of the effort is to actually extract the metadata I need from my SSIS packages.

There are some discussions on Internet about how to do this. There are commercial products out there too. But for my purpose (document my packages really), a commercial product will be an over kill. Luckily, the DTSX file is just a XML file. I can use simple XQUERY functions to extract what I need out of the dtsx file.

Goal

Extract task names from a SSIS package. Remember all my tasks are “encoded” with my naming conventions. So the name of the task will reveal 50% of the purpose of the task. I’ll complete rest of the metadata manually in a spreadsheet.

Solution

I am a SQL developer, so naturally I want to manipulate the dtsx file in T-SQL.

Before I can manipulate the dtsx file (or any xml file) in T-SQL, I need to load the file into SQL. I can do it in a couple of ways (I am using SQL 2008 for this purpose).

1. I can use OPENROWSET to BULK load the dtsx file into a variable of data type XML

image

DECLARE @PackageXML XML
SET @PackageXML =
(    SELECT    *
    FROM    OPENROWSET
        (    BULK ‘C:\etlEAMPC_EXT_ExtractRawDataSW.dtsx’, SINGLE_CLOB  ) AS xmlData )
SELECT @PackageXML

The result looks like this in SQL Management Studio:

image

SQL Management Studio is one of the best xml viewers. So click on the xml link, the xml file is opened in a new tab, and it looks like this.

image

2. I can use OPENROWSET to BULK load the dtsx file into a SQL table

image

USE tempdb
GO

IF EXISTS (select * from sys.tables where name = N’tblPackageXML’)
        DROP TABLE tblPackageXML;

GO

CREATE TABLE tblPackageXML(
        PackagePath    varchar(900)    NOT NULL PRIMARY KEY
,       PackageXML    XML             NOT NULL
);

INSERT INTO tblPackageXML (PackagePath, PackageXML)
SELECT ‘C:\etlEAMPC_EXT_ExtractRawDataSW.dtsx’, xmlData
FROM  (    SELECT    *
        FROM    OPENROWSET 
        (    BULK ‘C:\etlEAMPC_EXT_ExtractRawDataSW.dtsx’, SINGLE_CLOB)  AS xmlData  ) AS PackageXML (xmlData)
GO

SELECT * FROM tblPackageXML

The result looks like this in SQL Management Studio:

image

3. I can use the xp_cmdshell SQL system procedure to recursively look for all dtsx files on my hard drive, and load them all into a SQL table. I only have 5 SSIS packages, so the first 2 ways are enough for me.
4. Now I can use XQuery to retrieve exactly what I need from the PackageXML column on tblPackageXML

Here I am using the Query function to retrieve only the task names. There are a few things you need to understand in this script.

  • The declare part is needed because the the XML is a typed column (in other words, there is a schema DTS registered for this column). Once DTS namespace is declared, all elements need to be prefixed with DTS:.
  • // indicate that this is the root element. What is inside [@….] are the expressions I am telling Query function what kind of attributes I need or do not need. Here I am telling the Query function to look for (or not look for) under attribute ExecutableType.
  • /DTS:Property indicates that the element I am looking for under DTS:Executable is DTS:Property.
  • Again [@DTS:Name=”ObjectName”] indicates that the only attribute I need is Name=”ObjectName”. 

image

Here is the result from the above query:

image

If I click on XML link, I see this. Notice that now I only see the task names.

image

5. Now I need to create a new table with the extracted data from the XML column. I’ll use the Nodes method to create a new table tblTask where each Task Name becomes a row in that table.

Along with the Nodes function, I need to use the CROSS APPLY statement with the above temp table #task to achieve this.

image

The result looks like this. Notice that each package now expanded to multiple rows with each row representing one task in the package.

image

6. Finally I need to convert XML data type to nvarchar and use REPLACE function to strip off the characters I don’t need.

image

Result:

image 

7. Now I can copy the above result to a spreadsheet and complete the rest of my metadata document manually there.

Continue to my next blog “Metadata – What Is It (5)?”

Drop Database

September 24, 2010 Leave a comment

–A database can be dropped regardless of its state: offline, read-only, suspect, and so on.
–To display the current state of a database, use sys.databases catalog view
select * from sys.databases
where name in (‘EAMPC’)

–Database currently being used cannot be dropped.
–This means open for reading or writing by any user.
–To remove users from the database, use ALTER DATABASE to set the database to
–SINGLE_USER.
ALTER DATABASE EAMPC
SET SINGLE_USER
WITH ROLLBACK IMMEDIATE;

–Dropping a database deletes the database from an instance of SQL Server and
–deletes the physical disk files used by the database.
–If the database or any one of its files is offline when it is dropped,
–the disk files are not deleted. These files can be deleted manually by using
–Windows Explorer. To remove a database from the current server without
–deleting the files from the file system, use sp_detach_db.
drop database EAMPC;

Categories: SQL Challenge

Metadata – What Is It (3)? – ETL Package/Task Metadata

September 23, 2010 1 comment

From my last blog: Metadata – What Is It (2)? – ETL-Generated Metadata

image

In this blog, I’ll put down my notes on the ETL Package/Task Metadata.

ETL Task Metadata

What is exactly ETL task metadata? I define ETL task as the smallest unit in my SSIS package. What does each task do? Each task is really a source-to-target mapping that contains programming code. This can be very cryptic to business users or nontechnical end users. However, source-to-target mapping at task level is crucial for understanding the true data lineage of the data in the data management system. When credibility of the data is in question or its integrity needs to be proven, source-to-target mapping becomes the most important document.

SSIS Task and Component Naming

Before I set out to get task metadata, I want to get my naming convention straight. There are many proposals about the naming in for SSIS tasks and components. Most proposals are based on the naming of tasks and components in the Integration Services. Here are some popular examples.

image

I am personally not in favor of this kind of naming. The visual clue in the SSIS designer is enough to tell me if it’s a sequence container or it’s a data flow task. The abbreviations in the proposed prefixes also lose the meanings to me very soon.

In the data warehouse world, it’s popular to use Dim (Dimension) or Fact as either prefix or suffix to name  SSIS tasks and components.

I particularly like some of the naming proposals on the ETL Toolkit book. On a high level, ETL tasks generally fall into one of three categories, extraction, staging, and target.

image

In my ETL process, I introduced two more categories, AUDIT and VARIABLE.

  • AUDIT. AUD_<table name>. It indicates that the task’s purpose is to create audit entry.
  • VARIABLE. VAR_<variable name>. This is very specific to implementation in SSIS.

Here are a few examples.

  • A package. etlEAMPC_STG_CleanPCAssetSources.dtsx. STG indicates that this package does not touch the target, and it stays in the staging area only.
  • A task. image
  • A sequence container. image

Transformation Naming

Transformations are the components of an ETL task. The ETL Toolkit book recommended
the following naming conventions while building your ETL transformations:

image

image

In my ETL process, in addition to the prefix, I would add the stored procedure name if I am using a stored procedure for transformation.

Continue to my next blog “Metadata – What Is It (4)? – ETL Package/Task Metadata”

Categories: ETL Best Practices
%d bloggers like this: