According to the book The Data Warehouse ETL Toolkit:
The ETL system makes or breaks the data warehouse. Although building
the ETL system is a back room activity that is not very visible to end users,
it easily consumes 70 percent of the resources needed for implementation
and maintenance of a typical data warehouse.
Of cause ETL is not only used in data warehousing project. ETL is also commonly used in Data Integration project. According to statistics, ETL typically takes up between 50 and 70 percent of a data warehousing or data integration project.
In the current data integration project I am working on with a large client, there are 3 full time ETL designers/developers on the project, together with total 11 team members, including 2 project managers.
This is part of the table creation script that uses a computed column colChecksum. Column colChecksum uses SQL function Checksum() to create a check sum value for each row based on the field list that is passed to the Checksum() function.
Do not expect SQL function Checksum to give you unique value for each unique row. Different row can have same check sum value. But you need not to worry though. For the refreshing rate analysis, you would use the natural key together with the check sum value to determine if any record needs to be refreshed or not.
Here is an example that compares the current data to previous data, using the natural key Hostname plus the check sum computed column.
After the historical loading of the assets into the Asset Manager, I did an initial analysis to see how much volume need to be refreshed daily. The rate of 80% seemed too high for assets that come from the network auto-discovery tools. So I did one more analysis, excluding certain attributes. The 10% refreshing rate seems to be more reasonable and more manageable now.
The software installation data still needs to be analyzed.
ETL developer’s note:
- There are several candidates for the comparison key. The last modify date, the sequential number (primary key) from the source, or the natural key. I picked the natural key, i.e. the Device Name. The last modify date from source doe snot give me options to choose which attributes to compare. The sequential number that serves as the primary key from the source is not a good option either. Devices in the network can be dropped from the auto-discovery source, and come back with a different sequential number.
- Once the decision is made to use the natural key for comparison, the SQL function Checksum comes to the picture naturally. I’ve created a calculated column colChecksum for the tables I need to compare. See details in the next blog.
Sent: Friday, August 27, 2010 12:32 PM
Subject: Computer Hardware Daily Refreshing Rate Analysis
My initial analysis included all the attributes, and the refreshing rate seemed very high.
I did one more analysis. Here are my findings so far. The numbers seems to be manageable now. But now I am not clear why some of the attributes would change on a daily basis, such as Serial Number, CPU Core Number, etc.
- Marimba auto-discovery data;
- data are cleansed before the analysis (duplicates are removed, key values are standardized, only needed attributes are retrieved etc.);
- previous date – 8/26; 133,471 assets
- current date – 8/27; 133,549 assets
- Device Name is used as the comparison key
- Certain attributes are excluded in the comparison, such as last logon date, IP address, disk free space etc.
- Attributes included in the comparison is listed below
- The current date has 247 new assets, less than .2%
- 169 assets are no longer reporting since the previous date
- 13,031, about 10% assets have some of the following attributes changed since the previous date. Some attributes contribute to the changes more than others. The breakout is listed below.
Changes Breakout by Attributes
- DefaultGateway: 3197
- Logonuser: 1483
- username: 1264
- CPUSpeed: 1834 (might have to do with the rounding)
- IPMask: 1150
- WINS: 2893
- DNS: 130
- DomainName: 126
- TcpIpDomain: 23
- OS: 35
- BIOS: 22
- SerialNo: 34 (not sure why SerialNo would change if it’s still the same PC)
- PCModel: 26
- TcpIpDomain: 23
- CPUBrand: 32
- CPUType: 47
- CPUCores: 6
I think it’s common for different domains in enterprise network not to trust each other fully. What does that mean to SQL developers who need to work from their local PC and be able to reach out to SQL servers on different domains?
A basic rule here is that we need a SQL account when crossing untrusted domains, because Windows Authentication will not work.
Your DBA or system admin can create a SQL account for you, or you can create one for yourself if you have the RDP right to remote into the box and be able to use Windows Authentication to login to the SQL server.
Now I understand how a firewall should be configured to allow access to the Integration Services that reaches to a remote SQL server behind the firewall.
I also know where to authenticate my own account to the firewall so that my SSIS package will work when reaching out to a remote SQL server behind a firewall.
But obviously this is not the only way.
1) I can VPN into domain 1, then authenticate myself to the firewall, then be able to access the remote SQL server using a dedicated SQL account. Or,
2) I can have an account in domain 2, and login to domain 2, so I can access the SQL server using either a dedicated SQL account, or through Windows Authentication. Or,
3) I can belong to a group such as xxxADMIN, which is also the local admin on that remote server. This way I do not need to have a separate account on the domain 2.
The Windows Firewall system helps prevent unauthorized access to computer resources over a network connection. To access Integration Services through this firewall (when your SSIS sits in one network, and the target SQL server sits in a different network), you have to configure the firewall to enable access.
In the company I am working with, we need to send a request to the data center to open up the firewall on a port number for a specific server identifiable by it’s IP address in the network.
It will be slow, but it should work.
Here is an article on MSDN about Configuring a Windows Firewall for Integration Services Access.
Here is the actual access request to the data center:
There is a pretty good blog about how to resolve this issue. Here is the link: http://blog.sqlauthority.com/2009/05/21/sql-server-fix-error-provider-named-pipes-provider-error-40-could-not-open-a-connection-to-sql-server-microsoft-sql-server-error/
From both my SQL Management Studio (2005 and 2008 version), and a SSIS package, I got this connection error. See the screenshot below.
According to the DBA of the target database:
After some digging, the problem turned out to a domain + firewall issue. My PC is in a different domain from the target SQL server. The firewall between the two domains are not allowing my account to pass through (the target SQL database uses a SQL account, not the Windows Authentication).
A phone call to the Helpdesk leads to a firewall site where I can use my SSO (Single Sign On) account to authenticate my account to the firewall. Problem Solved. Partly.
There are still issues about connecting to SQL servers in a SSIS package that resides in a domain that is different from the target SQL server. How would the SSIS package pass through the firewall to reach to the target SQL Server?