MDX #46–6 different ways to detect a particular member in MDX script

May 14, 2015 Leave a comment

There are many great tips in MDX with SSAS 2012 Cookbook

The book MDX with SSAS 2012 Cookbook has many great tips for MDX script writers.

Here are two that are from Chapter 5 Navigation:

  • detecting a particular member by comparing object with keyword IS is better than comparing name
  • using SCOPE() statement in MDX script is a more “permanent” solution than calculations in MDX query

MDX script writers frequently need to include or exclude a particular member in a calculation. The first step is to determine the member exists in a hierarchy.

The book provided great recipes on how this can be done in MDX queries. In this blog, I’ll focus on how this can be done in MDX scripts.

If you have questions about the following concepts, please refer to Chapter 5 Navigation:

  • Iteration on query axes
  • Currentmember function
  • IS keyword
  • SCOPE() statement
  • Member’s unique name
  • Why we should avoid comparing names
  • Why using SCOPE() statement is a better way in MDX script

6 different ways it can be done in MDX script

Suppose that we need to detect the NA member in the Color hierarchy of the Product dimension. The result should show us TRUE for color NA only.

Color Member is detected
Black
Blue
Grey
Multi
NA TRUE
Red
Silver
Silver/Black
White
Yellow
Assembly Components

 

Here are 6 different ways it can be done in MDX script (there are more ways if you insist; see the screen shot below).

  1. Member is detected 1 – Name: compare the Name property of the current member
  2. Member is detected 2 – INTERSECT-COUNTING: intersect with the current member and then use Count()
  3. Member is detected 3 – Uniquename: compare the Uniquename property of the current member
  4. Member is detected 4 – VALUE: compare the full value of the current member
  5. Member is detected 5 – IS: use IS to compare member object
  6. Member is detected 6 – SCOPE: use SCOPE() statement

The solutions are increasingly better from 1 to 6, with number 1 being the worst, and number 6 much better. Avoid number 1 and 2; Number 3 is similar to number 4 and 5; Number 6 is a better choice in MDX script.

Become familiar with the SCOPE statement

Check out this link:

http://sqlbits.com/Sessions/Event8/Fun_with_Scoped_Assignments_in_MDX

Chris Webb did a wonderful job in this “Fun with Scoped Assignments in MDX” video.

 

SCOPE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ETL #72–Your data can mysteriously disappear after a WHERE clause was added (2)

Why a seemingly innocent WHERE clause can cause large amount of data missing

In the previous post, ETL #71–Your data can mysteriously disappear after a WHERE clause was added (1), I talked about how I discovered a WHERE clause was added to a main procedure that processes the raw data prior to loading the data into the fact table.

ecda1.[CUSTOM_DATA_8] <> ‘null’

A quick check of the raw data shows that out of 1.3+ million rows, there is only one row that has string value of ‘null’ in ecda1.[CUSTOM_DATA_8], and also that about 20% of the rows has SQL NULL value in the field CUSTOM_DATA_8. It’s these 20% of the data that was also filtered out along with the one ‘bad’ row.

The missing 20% of the data has SQL NULL in the field

The field [CUSTOM_DATA_8] on table alias ecda1 has one row that has a string value of ‘null’. But about 20% of rows has a value of SQL NULL.

We all know a thing or two about the SQL NULL:

  1. SQL NULL is the term used to represent a missing value.
  2. A NULL value in a table is a value in a field that appears to be blank.
  3. It is very important to understand that a NULL value is different than a zero value or a field that contains spaces.

IS NOT NULL is implied whenever we add a filter

But the most important thing about the SQL NULL is this: IS NOT NULL is implied in the WHERE clause whenever we add a filter of either equality or inequality.

So if we add a filter as:

ecda1.[CUSTOM_DATA_8] <> ‘aStringValue’

Then, we really mean this:

ecda1.[CUSTOM_DATA_8] <> ‘aStringValue’

AND

ecda1.[CUSTOM_DATA_8] IS NOT NULL

If we add a filter as:

ecda1.[CUSTOM_DATA_8] = ‘aStringValue’

Then, we really mean this:

ecda1.[CUSTOM_DATA_8] = ‘aStringValue’

AND

ecda1.[CUSTOM_DATA_8] IS NOT NULL

Now you can understand why those 20% of rows with SQL NULL values are missing.

Adding a WHERE clause during data processing in the production environment of ETL is almost never a good idea

You probably will also understand why it is not a good idea to add WHERE clauses to the data processing stage in ETL.

In production, WHERE clauses should not be added unless it’s a documented business requirement.

In data discovery stage, WHERE clauses are used only for data discovery purpose and should not be carried over into production.

Handling of “bad” or missing dimensional data

If these “bad” data are part of the data you will use to slice and dice your data, then there are many ways you can handle them. How to handle these “bad” or missing dimensional data is beyond the scope of this short blog.

ETL #71–Your data can mysteriously disappear after a WHERE clause was added (1)

April 24, 2015 Leave a comment

Validations at end of ETL indicate missing data

At the end of each ETL job, I always run some sort of validation process to give me an indication that the data loading and processing are as expected or things have gone terribly wrong. The latter happened last Thursday when I received the validation result in my inbox.

Last Thursday also coincided with a “disk utilization” issue in the upstream server. So I waited until last Friday when the “disk utilization” issue was fixed in the upstream server to have the data reloaded. To my surprise, the data volume that was loaded on our side was still way below normal.

Steps to investigate missing data

It’s time to investigate. Blog writers at this point will usually “make the long story short”, I’ll do the same here, but with a few bullet points to highlight the steps I took to investigate.

  1. Pick one day for one employee: this is the most obvious data point to validate for my data. Our data contains detail data down to the granularity of per employee, per Interaction Resource ID, per connection ID, per 15 minutes interval per row. Picking one day for one employee will give me not too little and not too much data to check.
  2. Validate on the upstream source server: the table in question is a fact table and has a unique interaction resource fact ID. The unique list of the interaction resource fact ID is an obvious candidate as my data points for checking.
  3. Compare the interaction resource fact ID: between the upstream source server and our own server. Because the data volume is small enough, a simple query revealed that about 20% of the interaction resource fact IDs are not loaded into our own table.
  4. An ETL process design with an atomic operation in mind helps to pinpoint exactly where the problem might be: our fact tables usually are long and also wide, so it’s not very easy to visually see why these 20% of the data were not loaded. So it’s time to go to the step where the data was processed before the loading. Fortunately my ETL processes are designed with the atomic operation in mind, and I know exactly which step to look for the culprit.
  5. A WHERE clause was added to the processing stored procedure: a log entry in the procedure clearly says that a WHERE clause was added last Thursday. A quick running of the query inside the procedure shows that this WHERE clause filter out those 20% data.

ecda1.[CUSTOM_DATA_8] <> ‘null’

A quick check of the raw data shows that out of 1.3+ million rows, there is only one row that has the value ‘null’ in ecda1.[CUSTOM_DATA_8]. Then why 20% of the data were also filtered out along with the one ‘bad’ row?

Why a seemingly innocent WHERE clause can cause large amount of data missing

This post is getting longer than I wanted. I’ll stop now. In the next post, I’ll explain:

  • Why the above WHERE clause not only filtered out one ‘bad’ row, but also took 20% of other rows along with it.
  • Why in ETL, adding WHERE clause during data processing in production is almost never a good idea.

Validation of data loading and processing

One of the most often used validation method at the end of each ETL run is to run a cross reference checking on a couple of metrics, which entails finding two independent sources of the same metric.

Atomic Operation in ETL

Atomic operation, atomic programming, atomic database design, atomic transaction, etc., etc.. There are many explanations to these concepts. I am probably not qualified to give it a formal definition in the ETL design, but it’s a strategy that every ETL designer/developer should learn to practice. As an ETL designer/developer, our professional life depends on how well we understand the strategy and how well we apply it to every single task we design.

SQL #58–Where are my own templates in SQL Server Management Studio 2012

January 27, 2015 2 comments

The secrete SQLFile.sql

We are all so used to clicking on the New Query icon (or the keyboard shortcut Ctrl+N). We know it will open a new query window in SSMS, an empty query window that is, but many of us never knew that something else was designed to happen before the new query window is opened.

I didn’t know that either until a co-worker told us a trick to open a new query window pre-filled with some code snippet.  To the credit of my co-worker, here it goes about the secrete SQLFile.sql file.

“….earlier today about the upcoming OLAP environment and the need for standardization, and mentioned that I have a template that SSMS automatically launches every time that I open a new tab.  The template includes a USE database statement, along with some verbiage for getting a CREATE PROCEDURE or VIEW statement started, and then a comment block, including Change History log, that would be part of the procedure or view.

I have found that it makes it much harder for me to leave out the comments, because I no longer have to open an old stored procedure to find a comment block with which to start.

This script is just something I threw together; please feel free to modify it to suit your needs (like, replacing my name w/yours).

Having said that, I hope we can come up with a standard template for all of us to use, with minimal differences from one developer to another, so that it will help us document our code as consistently as possible.”

To make this template available in SSMS 2008 (or R2), put SQLFile.sql in this folder:

C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\SqlWorkbenchProjectItems\Sql

In SSMS 2012, put SQLFile.sql in this folder:

C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\ManagementStudio\SqlWorkbenchProjectItems\Sql

Once you put your code snippet in the secrete SQLFile.sql file, your new query window will no longer be empty. It will be pre-filled with your template.

Built-in script templates

SQL Server Management Studio by far is the best tool I’ve ever used. It’s features, such as the Object Explorer, the IntelliSense, the Registered Servers Explorer, and of cause the Template Explorer, are all powerful features that can significantly increase your productivity in SQL development.

In every version I used, SQL Server Management Studio always comes with a bunch of SQL (and MDX) script templates, including templates to create tables, views, stored procedures, triggers, statistics, and functions etc. All you need to do is to open the Template Explorer, by going to View menu > Template Explorer or jus simple using keyboard shortcut CTRL+ALT+T. You’ll see all the built-in templates, .

Create our own custom templates

What I really like about the template feature is that it allows us to create our own custom templates. Although the template feature allows you to auto-create code by filling out the template parameters, I know most SQL developers only use it to store their own code snippets, or team code templates, without bothering with the parameters.

But using the template feature is not without frustration. What frustrates me the most is to figure out where my own custom templates went.

Another mystery – where are my own code templates?

I am running Windows 7, here is the path where Microsoft put all the custom template files.

C:\Users\DefaultUser\AppData\Roaming\Microsoft\SQL Server Management Studio\11.0\Templates\Sql

On Windows 7, the AppData folder is a hidden folder. This certainly added another layer of secrecy to the whole template feature.

On Windows 7, to get the hidden folders to show, enter this command in the command window. This is much faster than opening the Control Panel and looking for the right program wrapped in several folders.

Folder Options

Then choose Show hidden files, folders, and drives in the View tab, as shown in the following screenshot.

pict3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Easy to use after all

After all the mysteries are solved, the template feature in SSMS is extremely easy to use and very user friendly.

  • Create a folder: this option allows you to create your own folders under the root folder mentioned previously. Create as many folders or sub-folders to better organize your scripts.
  • Create Template: this option allows you create your template with your own code.
  • Open Template: this option allows you use the pre-filled code in your template.
  • Edit Template: this option allows you modify the code in your template.

 

 

 

SQL #57–A one-liner query with PARTITION BY

January 17, 2015 4 comments

What can be done in SAS must be possible in SQL

A friend of mine is a data professional in marketing data analysis with expertise in SAS and pretty good working knowledge of SQL. When he sent me questions about “how to do in SQL…”, I often sent back a query with either a sub query, or a JOIN, or a CTE, or with all of them.

My friend has been always happy to take my suggestions, until one day he sent me a code snippet and insisted that what can be done in SAS must be possible in SQL too.

Getting the MAX row (by date) in each ID

In the following example, each ID can have multiple rows. What he wants to do is to get the MAX row (by date) in each ID, and then get the status. He also wanted to get a row count for each ID.

The 3 rows highlighted in red should be the output.

 

id date status
101 1/3/2013 1
101 4/6/2012 3
101 11/23/2014 2
108 6/16/2007 3
108 1/24/2003 1
109 5/21/2014 1

A simple aggregate query with a GROUP BY will not work

A simple aggregate query with a GROUP BY, such as the one below, is not going to work, because each aggregate function MAX are independent of each other.

SELECT EventID ,      MAX(EventDate) AS max_EventDate ,      MAX(EventStatus) AS max_EventStatus ,      COUNT(*) AS row_count FROM   #test_max_row GROUP BY         EventID

SAS has solved the classic MAX VALUE vs. MAX ROW problem

This is a classic MAX VALUE vs. MAX ROW problem in SQL.

It turned out that SAS has solved this classic problem long time ago, with no sub query whatsoever.

Here is the code in SAS my friend sent to me.

data test;

input id date mmddyy10. status;

cards;

101 01/03/2013 1

101 04/06/2012 3

101 11/23/2014 2

108 06/16/2007 3

108 01/24/2003 1

109 05/21/2014 1

;

run;

* In SAS;

proc sql;

create table results as

select id, date format=mmddyy10., status, count(id) as cnt_id

from test

group by id

having date=max(date);

quit;

The trick in the above proc SQL with SAS-flavor is in the in the HAVING clause, date=max(date).

In standard SQL, the HAVING clause cannot have any column by itself and any columns in the HAVING clause must be companied by an aggregate function. In another word, the following is illegal in SQL:

having date=max(date)

I have programmed in SAS before for 2 years and have some basic understanding of proc SQL in SAS. So it’s not a surprise to me that SAS has its own implementation of SQL.

It’s easy in SQL too but…

To find the max row in SQL is not hard either, but most would require a JOIN or sub query. Most popular ways are:

  • Use Correlated query
  • Find the MAX(date) with GROUP BY, then JOIN to the original table
  • Use a sub query that uses ROW_NUMBER() OVER with PARTITION BY and ORDER BY.

ROW_NUMBER() OVER can be used to find the MAX row, and also to de-duplicate data

My favorite is the last one that uses the ROW_NUMBER() OVER. It’s very flexible, and can be used to not only find the MAX row, but also to de-duplicate data.

But none of the above is a one-liner query, meaning with single SELECT.

A one-liner query is possible in SQL

I agree with my friend that a one-liner query is possible in SQL. In the end, this is what I came out with.

Here is my first try.

pic1

 

 

 

 

 

 

 

I used the OVER() clause three times. This OVER() with PARTITION BY and ORDER BY sorted the rows by date in descending order in each ID. The I used the FIRST_VALUE() function to get the MAX date row. The first one gets the first value of the date, and the second one gets the first value of the status.

The last OVER() uses only a PARTITION by not the ORDER BY because the count() aggregate function does not care about any particular ordering.

The result has the same number of rows as the original table with each ID being repeated. By adding a DISTINCT word, the result is reduced to one ID per row.

Here is the final query:

pic2

 

 

 

 

Here is the query in text.

SELECT DISTINCT                 EventID ,       first_value(EventDate) OVER(PARTITION BY EventID ORDER BY EventDate DESC) AS max_EventDate ,       first_value(EventStatus) OVER(PARTITION BY EventID ORDER BY EventDate DESC) AS max_row_EventStatus ,       count(EventID) OVER(PARTITION BY EventID) AS row_count_for_the_EventID FROM   #test_max_row

Cannot recommend this

I didn’t have time to experiment with a large table to test the execution plan and query time against the other more conventional ways mentioned above, so I cannot recommend this yet. But I am not very optimistic about this query by just looking at it. The PARTITION BY was repeated three times. If you have more columns in your output, the PARTITION BY will be repeated even more times. SQL Server might have done some optimization in this kind of scenario, but I cannot be sure without further experimenting.

Another special thing about this query is that it didn’t use GROUP BY. PARTITION BY is essentially the same as GROUP BY, but without reducing the results. That’s why I had to use the DISTINCT in the SELECT. DISTINCT itself is also a GROUP BY in disguise. This is another reason I am not very optimistic about this query.

It’s certainly fun to experiment!

Learning SQL #1–Start from buying a Copy of SQL Server Developer Edition

January 9, 2015 1 comment

If learning SQL is in your New Year Resolution, then you will be happy to read this post.

We all make resolutions that we fail to keep. What I am suggesting here will give that failure minimum chance.

Here is what I am suggesting:

  • Get a copy of the SQL Server 2014 Developer Edition
  • Get a copy of  Microsoft SQL Server 2012 T-SQL Fundamentals By Itzik Ben-Gan.
Developer Edition

A friend recently mentioned to me that SQL Server is too expensive to buy. I was surprised that $50 is considered too expensive. Then she said that someone told her she needs the Enterprise Edition, which is beyond her budget.

What I am suggesting is the Developer Edition, which has the same functionalities as the Enterprise Edition, but licensed for use as a development and test system, not as a production server. With a price tag of about $50 (on Amazon) it is an ideal choice for people who build and test database applications, and of cause, for people who have just made a New Year Resolution to learn SQL.

T-SQL Fundamentals

The T-SQL Fundamentals book by Itzik Ben-Gan is also a wonderful book to have as your first SQL book. I took an advanced SQL class from Itzik two years ago. He is one of the best in the SQL circle.

Minimum to get started

There is more in the SQL Server 2014 Developer Edition than what you can imagine. But here is the minimum for you to get started:

  • SQL Server 2014 Database Engine (server instance): this is the relational database instance.
  • SQL Server 2014 Management Studio: this is the desktop tool that allows you to run SLQ queries again the database.

Other features on the CD are SSAS, SSIS and whole bunch of tools. If you are just starting to learn the SQL language, don’t bother yourself with other features.

Learning-by-doing

Like many of you, I’ve collected several bookshelves of professional books over the years. But I have to admit that most of them are sitting there collecting dusts. I am not trying to minimize the benefits of professional books. What I want to promote here is learning-by-doing.

    Through learning-by-doing your productivity is achieved through practice, self-perfection and minor innovations. Learning-by-doing can be measured by progress.

“For the things we have to learn before we can do them, we learn by doing them.”
Aristotle

“Delight” in SQL

Part of my own New Year Resolution is to read the Bible in One Year. To mirror what I made, I suggest that you also add these three resolutions:

  • Resolve to “delight” in SQL.
  • Resolve to focus in the spirit of SQL.
  • Resolve to enjoy your innovation.

SQL is a wonderful language that I never stopped learning.

Go beyond just the syntax, ponder on how it is different from other programing languages you’ve already known.

Soon or later, you will start to write SQL queries without checking your books first. So start to innovate now.

Have fun learning!

 

5th Blogging Year

December 31, 2014 Leave a comment

It’s hard to believe that it’s been 5 years since my first ever post, in which I simply spelled Business Intelligence. Thanks to everyone for being with me since then!

I blogged around my daily experience as a Data Warehouse and Business Intelligence developer. Most of my posts are about how to use the tools, i.e. the Microsoft SQL Server Business Intelligence stacks with more productivity and efficiency, of cause, also about the two major data query languages, SQL and MDX. I enjoy working as a DWH and BI developer for the creativity I found in all I do. I mostly praise the tools and the query languages we use, but occasionally my posts complain about the the inefficiency and even bugs in the tools.

In 2013, I co-authored the book MDX with SSAS 2012 Cookbook. As much as I’d love to write again and blog more, in 2014 I felt that I I needed more time for myself to relax and recharge, and to spend more time with my family.

Looking forward to next year, I would like to think that being a DWH and BI developer will continue to give me professional satisfactions; I do not need to find a new career. I also hope that 2015 is a year that will give me inspiration for a new book for DWH and BI developers.

My posts cover a lot of grounds in SSIS, SSRS, SSAS, SQL Jobs, SSMS, Visual Studio’s SSDT, and SQL and MDX. But it’s interesting that the top 5 posts that received the most views are all in SSIS, according to the WordPress’s annual report. More interestingly, these top SSIS posts are all about how to use variables in SSIS. I will consider writing about those topics again.

Once again, thanks for following and commenting on my posts in 2014. I look forward to seeing you all online again in 2015! Happy New Year!

 

Sherry Li

Follow

Get every new post delivered to your Inbox.

Join 196 other followers

%d bloggers like this: