What can be done in SAS must be possible in SQL
A friend of mine is a data professional in marketing data analysis with expertise in SAS and pretty good working knowledge of SQL. When he sent me questions about “how to do in SQL…”, I often sent back a query with either a sub query, or a JOIN, or a CTE, or with all of them.
My friend has been always happy to take my suggestions, until one day he sent me a code snippet and insisted that what can be done in SAS must be possible in SQL too.
Getting the MAX row (by date) in each ID
In the following example, each ID can have multiple rows. What he wants to do is to get the MAX row (by date) in each ID, and then get the status. He also wanted to get a row count for each ID.
The 3 rows highlighted in red should be the output.
A simple aggregate query with a GROUP BY will not work
A simple aggregate query with a GROUP BY, such as the one below, is not going to work, because each aggregate function MAX are independent of each other.
SELECT EventID , MAX(EventDate) AS max_EventDate , MAX(EventStatus) AS max_EventStatus , COUNT(*) AS row_count FROM #test_max_row GROUP BY EventID
SAS has solved the classic MAX VALUE vs. MAX ROW problem
This is a classic MAX VALUE vs. MAX ROW problem in SQL.
It turned out that SAS has solved this classic problem long time ago, with no sub query whatsoever.
Here is the code in SAS my friend sent to me.
input id date mmddyy10. status;
101 01/03/2013 1
101 04/06/2012 3
101 11/23/2014 2
108 06/16/2007 3
108 01/24/2003 1
109 05/21/2014 1
* In SAS;
create table results as
select id, date format=mmddyy10., status, count(id) as cnt_id
group by id
The trick in the above proc SQL with SAS-flavor is in the in the HAVING clause, date=max(date).
In standard SQL, the HAVING clause cannot have any column by itself and any columns in the HAVING clause must be companied by an aggregate function. In another word, the following is illegal in SQL:
I have programmed in SAS before for 2 years and have some basic understanding of proc SQL in SAS. So it’s not a surprise to me that SAS has its own implementation of SQL.
It’s easy in SQL too but…
To find the max row in SQL is not hard either, but most would require a JOIN or sub query. Most popular ways are:
- Use Correlated query
- Find the MAX(date) with GROUP BY, then JOIN to the original table
- Use a sub query that uses ROW_NUMBER() OVER with PARTITION BY and ORDER BY.
ROW_NUMBER() OVER can be used to find the MAX row, and also to de-duplicate data
My favorite is the last one that uses the ROW_NUMBER() OVER. It’s very flexible, and can be used to not only find the MAX row, but also to de-duplicate data.
But none of the above is a one-liner query, meaning with single SELECT.
A one-liner query is possible in SQL
I agree with my friend that a one-liner query is possible in SQL. In the end, this is what I came out with.
Here is my first try.
I used the OVER() clause three times. This OVER() with PARTITION BY and ORDER BY sorted the rows by date in descending order in each ID. The I used the FIRST_VALUE() function to get the MAX date row. The first one gets the first value of the date, and the second one gets the first value of the status.
The last OVER() uses only a PARTITION by not the ORDER BY because the count() aggregate function does not care about any particular ordering.
The result has the same number of rows as the original table with each ID being repeated. By adding a DISTINCT word, the result is reduced to one ID per row.
Here is the final query:
Here is the query in text.
SELECT DISTINCT EventID , first_value(EventDate) OVER(PARTITION BY EventID ORDER BY EventDate DESC) AS max_EventDate , first_value(EventStatus) OVER(PARTITION BY EventID ORDER BY EventDate DESC) AS max_row_EventStatus , count(EventID) OVER(PARTITION BY EventID) AS row_count_for_the_EventID FROM #test_max_row
Cannot recommend this
I didn’t have time to experiment with a large table to test the execution plan and query time against the other more conventional ways mentioned above, so I cannot recommend this yet. But I am not very optimistic about this query by just looking at it. The PARTITION BY was repeated three times. If you have more columns in your output, the PARTITION BY will be repeated even more times. SQL Server might have done some optimization in this kind of scenario, but I cannot be sure without further experimenting.
Another special thing about this query is that it didn’t use GROUP BY. PARTITION BY is essentially the same as GROUP BY, but without reducing the results. That’s why I had to use the DISTINCT in the SELECT. DISTINCT itself is also a GROUP BY in disguise. This is another reason I am not very optimistic about this query.
It’s certainly fun to experiment!
If learning SQL is in your New Year Resolution, then you will be happy to read this post.
We all make resolutions that we fail to keep. What I am suggesting here will give that failure minimum chance.
Here is what I am suggesting:
- Get a copy of the SQL Server 2014 Developer Edition
- Get a copy of Microsoft SQL Server 2012 T-SQL Fundamentals By Itzik Ben-Gan.
A friend recently mentioned to me that SQL Server is too expensive to buy. I was surprised that $50 is considered too expensive. Then she said that someone told her she needs the Enterprise Edition, which is beyond her budget.
What I am suggesting is the Developer Edition, which has the same functionalities as the Enterprise Edition, but licensed for use as a development and test system, not as a production server. With a price tag of about $50 (on Amazon) it is an ideal choice for people who build and test database applications, and of cause, for people who have just made a New Year Resolution to learn SQL.
The T-SQL Fundamentals book by Itzik Ben-Gan is also a wonderful book to have as your first SQL book. I took an advanced SQL class from Itzik two years ago. He is one of the best in the SQL circle.
Minimum to get started
There is more in the SQL Server 2014 Developer Edition than what you can imagine. But here is the minimum for you to get started:
- SQL Server 2014 Database Engine (server instance): this is the relational database instance.
- SQL Server 2014 Management Studio: this is the desktop tool that allows you to run SLQ queries again the database.
Other features on the CD are SSAS, SSIS and whole bunch of tools. If you are just starting to learn the SQL language, don’t bother yourself with other features.
Like many of you, I’ve collected several bookshelves of professional books over the years. But I have to admit that most of them are sitting there collecting dusts. I am not trying to minimize the benefits of professional books. What I want to promote here is learning-by-doing.
- Through learning-by-doing your productivity is achieved through practice, self-perfection and minor innovations. Learning-by-doing can be measured by progress.
“For the things we have to learn before we can do them, we learn by doing them.”
“Delight” in SQL
Part of my own New Year Resolution is to read the Bible in One Year. To mirror what I made, I suggest that you also add these three resolutions:
- Resolve to “delight” in SQL.
- Resolve to focus in the spirit of SQL.
- Resolve to enjoy your innovation.
SQL is a wonderful language that I never stopped learning.
Go beyond just the syntax, ponder on how it is different from other programing languages you’ve already known.
Soon or later, you will start to write SQL queries without checking your books first. So start to innovate now.
Have fun learning!
It’s hard to believe that it’s been 5 years since my first ever post, in which I simply spelled Business Intelligence. Thanks to everyone for being with me since then!
I blogged around my daily experience as a Data Warehouse and Business Intelligence developer. Most of my posts are about how to use the tools, i.e. the Microsoft SQL Server Business Intelligence stacks with more productivity and efficiency, of cause, also about the two major data query languages, SQL and MDX. I enjoy working as a DWH and BI developer for the creativity I found in all I do. I mostly praise the tools and the query languages we use, but occasionally my posts complain about the the inefficiency and even bugs in the tools.
In 2013, I co-authored the book MDX with SSAS 2012 Cookbook. As much as I’d love to write again and blog more, in 2014 I felt that I I needed more time for myself to relax and recharge, and to spend more time with my family.
Looking forward to next year, I would like to think that being a DWH and BI developer will continue to give me professional satisfactions; I do not need to find a new career. I also hope that 2015 is a year that will give me inspiration for a new book for DWH and BI developers.
My posts cover a lot of grounds in SSIS, SSRS, SSAS, SQL Jobs, SSMS, Visual Studio’s SSDT, and SQL and MDX. But it’s interesting that the top 5 posts that received the most views are all in SSIS, according to the WordPress’s annual report. More interestingly, these top SSIS posts are all about how to use variables in SSIS. I will consider writing about those topics again.
Once again, thanks for following and commenting on my posts in 2014. I look forward to seeing you all online again in 2015! Happy New Year!
ROLLUP, CUBE, and GROUPING SETS Operators have been around since SQL Server 2005. These operators are extensions of the GROUP BY clause.
The GROUPING SETS operator might be introduced later in SQL Server 2008.
I only have experience with the ROLLUP operator. My test of GROUPING SET also suggested that it is not fully supported either.
The ROLLUP operator is useful in generating reports that contain subtotals and totals
The ROLLUP operator is useful in generating reports that contain subtotals and totals in the same dataset.
This simple GROUP BY query with a ROLLUP operator generates a grand total rows at the end.
Dynamic SQL and aggregates with the ROLLUP are commonly used in web reporting project
You might be wondering where you can actually use this feature. I never used it until I was involved in a web reporting project. To centralize all the data manipulation logic in the database layer, the decision was to write a very flexible stored procedure that meets the following three requirements:
- All metrics calculations must be table-driven.
- Only one stored procedure should be used.
- The stored procedure should return subtotals and grand totals in the same result set.
The first two requirements forced us to generate dynamic SQLs instead of static SQL queries. The last requirement forced us to resort to the ROLLUP operator with aggregations.
This all worked well until we had a new requirement for an additional metric, which needs to calculate:
- A percentage of a COUNT for group over a subtotal.
SUM(COUNT(*)) OVER() can generate an aggregate count
My first reaction is to simply add a TOTAL_CNT to the query using SUM(COUNT(*)) OVER(). Then the PCT_OF_TOTAL will be just the CNT over the TOTAL_CNT.
My thought was right on the mark, until I noticed that ROLLUP operator altered two calculations, TOTAL_CNT and PCT_OF_TOTAL.
To try to figure out what exactly went wrong, I tested a simple aggregation query.
Here I added a few things to the first query.
- VacationHours and AVG_VacationHours: I wanted to make sure that a simple averaging would work always.
- TOTAL_CNT: I used a SUM(COUNT(*)) OVER() to get the total count over all the DepartmentName.
- PCT_OF_TOTAL: this is just the CNT over the TOTAL_CNT.
This query has no ROLLUP or GROUPING SETS. All numbers look good including the TOTAL_CNT and PCT_OF_TOTAL
GROUPING SETS and ROLLUP messed up the SUM(COUNT(*)) OVER()
Next, I added the GROUPING SETS operator.
with the GROUPING SETS operator, a grand total row was added to the result set. The count and the average are all correct, but the TOTAL_CNT and the PCT_OF_TOTAL has been messed up.
Looking closely, you can see that the TOTAL_CNT was doubled. You might be tempted to try to correct it by dividing by 2 to get it right. But adding more hierarchies to the aggregation generates even more confusing results.
ROLLUP operator does not support the SUM() OVER()
I was quite disappointed that the ROLLUP operator does not support the SUM() OVER(). This basically means that I’d need to write a lot more inefficient code to achieve this. With the dynamic queries in the midst, the coding will not be as efficient and elegant.
The closest example I could find is from this SQL Magazine article from 2012 by D. Yitzhak, in which only examples of ranking functions with the ROLLUP, CUBE, and GROUPING SETS operators were given.
GROUPING SETS is supposed to supersede ROLLUP
In the last query, I used the GROUPING SETS instead of the ROLLUP operator. This is because GROUPING SETS is a newer additions to SQL Server, and is supposed to supersede ROLLUP. According to many blogs, the WITH ROLLUP and WITH CUBE features will be removed in future versions of SQL Server. But I have not seen any official statement from Microsoft.
For more on GROUPING SETS in SQL Server 2008 R2, check out these TechNet article:
OVER Clause has been around in SQL Server for a long time. If you have used any of the following types of functions, you are already familiar with it:
- Ranking functions
- Aggregate functions
- Analytic functions
- NEXT VALUE FOR function
The OVER clause defines a window within a query result set. A window function then computes a value for each row in the window. You can use the OVER clause with functions to compute aggregated values such as
- moving averages
- cumulative aggregates
- running totals
- top N per group results
In the following example, adding an ORDER BY clause to the SUM(COUNT(*)) OVER() generates running totals.
For more information, check out this MSDN SQL Server 2014 article:
When we design a database, we usually need to do an estimate on the size of the database based on the dimension and fact tables.
Keep the following information handy next time when you need to do so.
The lists are based on this article:
Table A: Character data types
|Data Type||Length||Storage Size||Max Characters||Unicode|
|char||Fixed||Always n bytes||8,000||No; each character requires 1 byte|
|varchar||Variable||Actual length of entry in bytes||8,000||No; each character requires 1 byte|
|nchar||Fixed||Twice n bytes||4,000||Yes; each character requires 2 bytes|
|nvarchar||Variable||Twice actual length of entry in bytes||4,000||Yes; each character requires 2 bytes|
Table B: Integer data types
|Data type||Minimum value||Maximum value||Storage size|
Table C: Precision storage requirements
|Total characters (precision)||Storage size|
|1 – 9||5 bytes|
|10 – 19||9 bytes|
|20 – 28||13 bytes|
|29 – 38||17 bytes|
Table D: Float and real data type restrictions
|Data type||n||Minimum Value||Maximum value||Precision||Storage size|
|float(n)||1 – 24||-1.79E + 308||1.79 + 308||7 digits||4 bytes|
|25 – 53||-1.79E + 308||1.79E + 308||15 digits||8 bytes|
|real||n/a||-3.40E + 38||3.40E + 38||7 digits||4 bytes|
Table E: Smalldatetime and datetime restrictions
|Data type||Minimum value||Maximum value||Accuracy||Storage size|
|smalldatetime||January 1, 1900||June 6, 2079||Up to a minute||4 bytes (the first 2 bytes store the date; the second 2 bytes store the time)|
|datetime||January 1, 1753||December 31, 9999||One three-hundredth of a second||8 bytes (the first 4 bytes store the date; the second 4 bytes store the time)|
|date||January 1, 1900||December 31, 9999||Only date||4 bytes ?|
|time||00:00:00.0000000||23:59:59.9999999||Only time||4 bytes ?|
|datetime2||larger year and second range||8 bytes ?|
Table F: Smallmoney and money restrictions
|Data type||Minimum value||Maximum value||Storage size|
Does SQL Server 2008 have any new data types?
SQL Server 2008 has several new data types:
- date stores only date values with a range of 0001-01-01 through 9999-12-31.
- time stores only time values with a range of 00:00:00.0000000 through 23:59:59.9999999.
- datetime2 has a larger year and second range.
- datetimeoffset lets you consider times in different zones.
- hierarchyid constructs relationships among data elements within a table, so you can represent a position in a hierarchy.
- spatial identifies geographical locations and shapes — landmarks, roads, and so on.
OLEDB is the native client for SQL Server. Every SSIS developer is familiar with how to use it. Almost every task in Integration Services can use an OLEDB connection. As a matter of fact, it’s a preferable client to use if your backend is SQL Server.
However, there is one exception, that is the Transfer SQL Server Objects Task. This task uses an SMO Server connection only.
If you don’t know this before you use this transfer object task, there can be confusions during development and also during deployment.
Here are my lessons learned.
- Visual differences between the OLEDB and the SMO Server connections in Integration Services 2012 helps
In SSIS 2012, I am happy to see the visual differences. If you are still stuck in 2008, you will not see any visual differences.
- Name your SMO Server connections properly
A good practice is to name the SMO Server connection properly. I used such names as:
- SMOServer_Staging (this could be the source server for the transfer task)
- SMOServer_Reporting (this could be the destination server for the transfer task)
- Rename the SMO Server connection properly
In SSIS 2012, you have three difference places to create connections.
- At project level: connections are available to all packages in the project
- At package level: only available to the packages in which the connection is created
- At task level: also only available to the packages in which the connection is created
The best practice is to create connections at the project level.
However, no matter where you are creating the connections, the UI in the SSIS always prefix the connection with the server name:
- For OLEDB – servername.databasename: removing the servername is usually the best practice here, leaving just the database name.
- For SMO Server – servername: since we cannot just simply remove the servername, the best practice here is to rename it to what is suggested in the section 2.
- Confusions during deployment over where to change to production/test server
In 2008 , during deployment, I’ve seen DBAs done the following:
- Create new connection: because of the confusions about OLEDB vs. SMOServer connections, and even bigger confusions if your connections have your development server name in it, DBAs would most likely choose to create new connections.
- Rename the connection name: again if you have not named your SMO connections properly, they will tend to rename it.
Neither of the above practices is good. If you as a developer have done if properly, DBA’s will not need to mess too much with the connections:
- Named your SMO Server connections properly so they know the differences with the OLEDB connections.
- Your connection names do not have your development server name so they don’t need to rename it.
- In 2012, the deployment is a lot easier
In 2012, deployment is much more straightforward because
- Connections can be created at project level: there is only one place to mess with the connections, including the connection name, server name and database name, and bunch of other connection properties
- Of cause you can take even greater advantages of many features such as project level parameters if you have chosen to deploy packages to SSIS Catalog
More about SMO connection manager
Quote from the MSDN article:
An SMO connection manager enables a package to connect to a SQL Management Object (SMO) server. The transfer tasks that SQL Server Integration Services includes use an SMO connection manager. For example, the Transfer Logins task that transfers SQL Server logins uses an SMO connection manager.
When you add an SMO connection manager to a package, Integration Services creates a connection manager that will resolve to an SMO connection at run time, sets the connection manager properties, and adds the connection manager to the Connections collection on the package. The ConnectionManagerType property of the connection manager is set to SMOServer.
You can configure an SMO connection manager in the following ways:
- Specify the name of a server on which SQL Server is installed.
- Select the authentication mode for connecting to the server.
More about Transfer SQL Server Objects task
Besides, the data flow object, the Transfer SQL Server Objects task is another task you can use to transfer not just data but also other types of objects between instances of SQL Server. For more info, check out this MSDN article.
The Analysis Services Performance Guide from bunch of Microsoft Analysis Services experts have been updated since 2005 edition for 2008 R2 and for 2012.
- Microsoft SQL Server 2005 Analysis Services Performance Guide
- Microsoft SQL Server 2008 R2 Analysis Services Performance Guide
- Microsoft SQL Server Analysis Services Multidimensional Performance and Operations Guide
I highly recommend the guide. It is short, to the point and it’s FREE.
Here is my summary of how to build optimal dimensions from the Part 1 of the guide.
- Efficient Use of Attributes Is a Key Design Skill to Master
- Dimensions are composed of attributes, which are related to each other through hierarchies.
- Studying and implementing the attribute relationships available in the business model can help improve cube performance.
- Using the KeyColumns, ValueColumn, and NameColumn properties effectively can reduce processing time. They also reduce the size of the dimension and the likelihood of user errors. This is especially true for attributes that have a large number of members, that is, greater than one million members.
- KeyColumns: define the attribute; uniquely identify each instance of the attribute; It is a best practice to assign a numeric source field, if available, to the KeyColumns property rather than a string property; Furthermore, use a single column key instead of a composite, multi‐column key.
- NameColumn: displayed to end users; if empty, default to the KeyColumns property.
- ValueColumn: typically used for calculations; strongly typed and can be accessed through the MemberValue MDX function.
- Using ValueColumn, and NameColumn properties: eliminates the need for extraneous attributes. This reduces the total number of attributes in your design, making it more efficient.
- Consider using Key fields (numeric) for the KeyColumns property, rather than the alphanumeric fields.
- Considering Hiding Most Attribute Hierarchies
- Set AttributeHierarchyVisible = false for most attribute hierarchies.
- Hiding the Surrogate Key.
- Considering disabling Attribute Hierarchy
- If attribute is included for ordering purpose, set AttributeHierarchyEnabled = false and AttributeHierarchyOptimizedState = NotOptimized to save on processing operations
- Considering Disabling Ordering of Attribute
- For such hidden attribute that you used only for implementation purposes, you don’t care about the ordering of an attribute (the surrogate key is one such case), set AttributeHierarchyOrdered = false to save time during processing of the dimension.
- Set Explicit Default Attribute Members Properly (carefully)
- Default Attribute Members: Any query that does not explicitly reference a hierarchy will use the current member of that hierarchy.
- The default behavior of Analysis Services is to assign the All member of a dimension as the default member.
- Explicitly set default members only on dimensions with single hierarchies or in hierarchies that do not have an All level.
- Removing Meaningless All Level
- It can even be expensive to ask for the All level of dimension if there is not good aggregate to respond to the query.
- You can disable the All member in a hierarchy. You do this by setting the IsAggregateable = false on the attribute at the top of the hierarchy.
- Note that if you disable the All level, you should also set a default member as described in the previous section–if you don’t, Analysis Services will choose one for you.
- Identifying Attribute Relationships Other Than the Relationships to the Primary Key
- Bushy attribute relationship: By default, all attributes are related to the primary key, and the attribute relationship diagram represents a “bush” where relationships all stem from the key attribute and end at each other’s attribute.
- Redefine attribute relationships: You can optimize performance by defining hierarchical relationships supported by the data.
- Attribute relationships help performance in 3 ways:
- Cross products between levels (for example the cross-product between Subcategory and Category) in the hierarchy do not need to go through the key attribute, saving CPU time during queries.
- Aggregations built on attributes can be reused on related attributes, saving resources during processing and for queries.
- Auto-exits can be more efficient to eliminate attribute combinations that do not exist in the data.
- If the attribute relationship is defined, the Analysis Services engine knows beforehand which category each subcategory belongs to via indexes built at process time.
- Flexible vs. Rigid Relationships: considering using rigid relationships
- A flexible attribute relationship (customer-city) is one where members can move around during dimension updates, and
- A rigid attribute relationship (month-year) is one where the member relationships are guaranteed to be fixed.
- Processing flexible attribute relationship is expensive: When a change is detected during process in a flexible relationship, all indexes for partitions referencing the affected dimension (including the indexes for attribute that are not affected) must be invalidated. This is an expensive operation and may cause Process Update operations to take a very long time. Indexes invalidated by changes in flexible relationships must be rebuilt after a Process Update operation with a Process Index on the affected partitions; this adds even more time to cube processing.
- Flexible relationships are the default setting. Carefully consider the advantages of rigid relationships and change the default where the design allows it.
- Considering unnatural user hierarchies
- There are two types of user hierarchies, natural and unnatural hierarchy.
- Unnatural hierarchies have at least two consecutive levels that have no attribute relationships, commonly created for drill‐down paths of commonly viewed attributes that do not follow any natural hierarchy. An example is gender-education relationship.
- Unnatural hierarchies are not materialized on disk, and the attributes participating in unnatural hierarchies are not automatically considered as aggregation candidates. Rather, they simply provide users with easy‐to‐use drill‐down paths for commonly viewed attributes that do not have natural relationships.
- By assembling these attributes into hierarchies, you can also use a variety of MDX navigation functions to easily perform calculations like percent of parent.
- Considering accessing attributes via member properties only
- Disable the attribute’s hierarchy: by setting the AttributeHierarchyEnabled property to False if you only want to access an attribute as member property.
- Member properties provide a different mechanism to expose dimension information.
- For a given attribute, member properties are automatically created for every direct attribute relationship.
- For the primary key attribute, this means that every attribute that is directly related to the primary key is available as a member property of the primary key attribute.
- After you verify that the correct relationship is in place, you can disable the attribute’s hierarchy.
- Disabling the attribute hierarchy can improve performance and decrease cube size.
- Because the attribute will no longer be indexed or aggregated. This can be especially useful for high‐cardinality attributes that have a one‐to‐one relationship with the primary key.
- High cardinality attributes such as phone numbers and addresses typically do not require slice‐and‐dice analysis.
- By disabling the hierarchies for these attributes and accessing them via member properties, you can save processing time and reduce cube size.