Table of contents
|IsNumeric() funciton is too forgiven|
|Use Pattern Matching with LIKE keyword|
|LIKE keyword in T-SQL doesn’t support regular expression|
|Replace white spaces with empty characters|
IsNumeric() funciton is too forgiven
One of my SSIS packages failed recently with the following error message.
“…. failed with the following error: "Error converting data type varchar to bigint….”
It turned out that a field contained some white spaces (such as tab, line feed and carriage return control characters), and the cast (col as bigint) has failed.
I can guarantee that all ETL developers have used the IsNumeric() function to check if any strings such as ‘123456’ are true numbers or not. The function seems to be very convenient and easy to use. However, more often than not, you will get errors in your ETL packages when loading these fields into your data warehouse where these fields are expected to be Integer (or bigint or any other numeric ) data type.
You are sure that you have used the IsNumeric() function to exclude those that are not numeric. So what went wrong?
It turned out that the IsNumeric() function is a very forgiven function, while the Cast() (or convert() or the implicit conversion) function is not.
Use Pattern Matching with LIKE keyword
Check out this MSDN article, Pattern Matching in Search Conditions. There are 4 wildcards that we can use with the LIKE keyword.
Using the last one [^], together with the first one %, the following pattern would only allow numbers 0-9 and a decimal point in the string.
I ran the following test to compare the IsNumeric() function with a pattern matching using the LIKE keyword.
The last one with two decimal points has passed the pattern test. Feel free to experiment to see if you can use pattern matching to eliminate it.
LIKE keyword in T-SQL doesn’t support regular expression
The LIKE (or the CHARINDEX) keyword in T-SQL doesn’t support regular expression, but the pattern matching can be a good alternative when the IsNumeric() function is just not enough.
Replace white spaces with empty characters
You can also try to combine the pattern matching with replacing white spaces with empty characters. Tab, line feed and carriage return control characters have the values CHAR(9), CHAR(10) and CHAR(13) respectively.
cast(REPLACE(REPLACE(REPLACE(your_number_column,CHAR(9),”), CHAR(10),”),CHAR(13),”) as bigint)
I recently read an article on SQLCentralServer.com, The Job Posting – Do I really have to be the SQL God? The author was passing on some advices on how to interpret various SQL job postings.
Do you need to be the SQL God/Goddess? You absolutely do, as a SQL consultant/contractor, in order to be a respectable SQL profession and demand a higher rate. But the question is what kind of SQL god/goddess.
The author reviewed a concept that I thought it’s worth mentioning.
For SQL Server, there’s 5 general areas of work: Administration, T-SQL Development, ETL, Reporting, and Architecture. In most of these cases, you end up blending a few of these areas together, but a solid, 40 hr/week position is going to concentrate on one of these and incorporate a little bit of some of the others. For example, reporting will need some T-SQL and some ETL. Architecture requires a basic understanding of the other four. ETL will probably require a little architecture to build staging structures and some T-SQL to optimize some tasks.
Five general areas of SQL server work, blending a few of these areas together. I will say that I agree with this concept. the longer I am working on the SQL server platform, the more blended it has became. It’s a good thing for me, as a contractor, because it means that I’ve been down many different paths and gained experiences in wide range of areas. It’s also a good thing when it comes to job interview.
However, I always struggle with the question of “what title do I put on my resume”. When I was a full-time employee, I accepted the job title the company gave to me. Although essentially I’ve always been in SQL development, the job titles I received varied so much that people would think they are completely different jobs. Here are some of the job titles I had over the years: Data Analyst, Database Architect, Technical Advisor, Database Analyst, System Developer.
Now I am on my own, I no longer have a job title (It’s not entirely true. “Contingent worker” is usually attached to my email in the Global Address List.), until it comes to the resume time. What should I put in that blank space underneath my name? It’s been 6 years, and the situation is getting worse. It is panicky time even when I started to think about that blank space.
Fortunately the companies that hired me as a “Contingent worker” really don’t care about that blank space on my resume.
I’ll take that back. I never left that space blank. Here are some of the words I put in that space:
SQL Developer, Data Analyst, Database Analyst, ETL Developer
Maybe next time, I’ll consider leaving that space blank.
I’ve done ETL design and development in both integration projects and in reporting environment.
In master data integration projects, it’s not hard to distinguish staging area from the target area. In the staging area, truncating tables with old data and loading new data is a pretty common practice. I’ have not seen any truncating tables in the target area. This is a good thing. It means that developers understand the role of staging, and understand that loading data into target area needs to be carefully designed, or “reconciled”, as a consultant called it. Depending on the business rules, we will end up either creating new records in the target, or updating existing records, or deleting obsolete records.
In the reporting environment, however, developers have very different attitude and very different understanding of staging VS. target.
This is a very typical architecture I’ve seen so far. The flaw in this design is that there is no target area, only staging. When developers truncate the table in the staging, the report shows empty data between the time when table is truncated and new data finished loading. Or even worse, the report hangs when the loading takes a long time to finish.
You might argue the chance of the coincidence that users run the report at the same time when the data is in the middle of refreshing. Based on my experience, if I hear about it once from the business users, then the chance is big, because the possibility for it to happen again is always there.
Well, the above is not the only type of architectural design I’ve seen. Here is another one. There is no physical staging area in this design. The report is accessing the data source directly with T-SQL code with OPENQUERY. The developer argued that this type of design will eliminate the need of maintaining the staging area on the SQL server. That part is true. But it does not eliminate the emails and phone calls we get from the business users when they are so frustrated because the report takes for ever to load the data. Reporting Services routinely report time out when trying to retrieve large amount of data via OPENQUERY.
The third type of architectural design can avoid many of the problems of long-running reports, or reports with empty data. By separating target from staging, and not truncating the target tables, the reports will always look good.
Back to the title of this post, “why truncating target tables is a bad practice in ETL”. I have to confess that I was one of those developers who made such mistakes until I received enough emails and phone calls from our business users.
There is more to discuss on how to move data from staging to target. Stay tuned.
I’ve written a post SSIS – Use “Foreach Loop Container” and pass a variable to a OLE DB Source in a Data Flow. I have to confess that I don’t use Foreach loop container very often. Actually I’ve only used it once in my entire SQL development life. I know how wonderful they are, and how much other developers like it. Deep down, I am a SQL developer. I still resort to T-SQL whenever I feel a bit dizzy thinking about SSIS packages.
I start to feel dizzy when I need to loop through the Foreach loop container more than 10 times. I think that’s my arbitrary limit. Looping through 10 days means connecting to the data sources 10 times. I bet Integration Services does some tricks on pooling connections. Still I don’t quite like it, especially when the data volume is large.
Here is what I would rather do. I’d bring all 10 days ( or even more, depending on the project) worth of data from my data source. The use WHILE loop in T-SQL to process data for each day.
You might wonder what would drive me to process data for each day in a WHILE loop. That will be the topic for my next post.