ROLLUP, CUBE, and GROUPING SETS Operators have been around since SQL Server 2005. These operators are extensions of the GROUP BY clause.
The GROUPING SETS operator might be introduced later in SQL Server 2008.
I only have experience with the ROLLUP operator. My test of GROUPING SET also suggested that it is not fully supported either.
The ROLLUP operator is useful in generating reports that contain subtotals and totals
The ROLLUP operator is useful in generating reports that contain subtotals and totals in the same dataset.
This simple GROUP BY query with a ROLLUP operator generates a grand total rows at the end.
Dynamic SQL and aggregates with the ROLLUP are commonly used in web reporting project
You might be wondering where you can actually use this feature. I never used it until I was involved in a web reporting project. To centralize all the data manipulation logic in the database layer, the decision was to write a very flexible stored procedure that meets the following three requirements:
- All metrics calculations must be table-driven.
- Only one stored procedure should be used.
- The stored procedure should return subtotals and grand totals in the same result set.
The first two requirements forced us to generate dynamic SQLs instead of static SQL queries. The last requirement forced us to resort to the ROLLUP operator with aggregations.
This all worked well until we had a new requirement for an additional metric, which needs to calculate:
- A percentage of a COUNT for group over a subtotal.
SUM(COUNT(*)) OVER() can generate an aggregate count
My first reaction is to simply add a TOTAL_CNT to the query using SUM(COUNT(*)) OVER(). Then the PCT_OF_TOTAL will be just the CNT over the TOTAL_CNT.
My thought was right on the mark, until I noticed that ROLLUP operator altered two calculations, TOTAL_CNT and PCT_OF_TOTAL.
To try to figure out what exactly went wrong, I tested a simple aggregation query.
Here I added a few things to the first query.
- VacationHours and AVG_VacationHours: I wanted to make sure that a simple averaging would work always.
- TOTAL_CNT: I used a SUM(COUNT(*)) OVER() to get the total count over all the DepartmentName.
- PCT_OF_TOTAL: this is just the CNT over the TOTAL_CNT.
This query has no ROLLUP or GROUPING SETS. All numbers look good including the TOTAL_CNT and PCT_OF_TOTAL
GROUPING SETS and ROLLUP messed up the SUM(COUNT(*)) OVER()
Next, I added the GROUPING SETS operator.
with the GROUPING SETS operator, a grand total row was added to the result set. The count and the average are all correct, but the TOTAL_CNT and the PCT_OF_TOTAL has been messed up.
Looking closely, you can see that the TOTAL_CNT was doubled. You might be tempted to try to correct it by dividing by 2 to get it right. But adding more hierarchies to the aggregation generates even more confusing results.
ROLLUP operator does not support the SUM() OVER()
I was quite disappointed that the ROLLUP operator does not support the SUM() OVER(). This basically means that I’d need to write a lot more inefficient code to achieve this. With the dynamic queries in the midst, the coding will not be as efficient and elegant.
The closest example I could find is from this SQL Magazine article from 2012 by D. Yitzhak, in which only examples of ranking functions with the ROLLUP, CUBE, and GROUPING SETS operators were given.
GROUPING SETS is supposed to supersede ROLLUP
In the last query, I used the GROUPING SETS instead of the ROLLUP operator. This is because GROUPING SETS is a newer additions to SQL Server, and is supposed to supersede ROLLUP. According to many blogs, the WITH ROLLUP and WITH CUBE features will be removed in future versions of SQL Server. But I have not seen any official statement from Microsoft.
For more on GROUPING SETS in SQL Server 2008 R2, check out these TechNet article:
OVER Clause has been around in SQL Server for a long time. If you have used any of the following types of functions, you are already familiar with it:
- Ranking functions
- Aggregate functions
- Analytic functions
- NEXT VALUE FOR function
The OVER clause defines a window within a query result set. A window function then computes a value for each row in the window. You can use the OVER clause with functions to compute aggregated values such as
- moving averages
- cumulative aggregates
- running totals
- top N per group results
In the following example, adding an ORDER BY clause to the SUM(COUNT(*)) OVER() generates running totals.
For more information, check out this MSDN SQL Server 2014 article:
When we design a database, we usually need to do an estimate on the size of the database based on the dimension and fact tables.
Keep the following information handy next time when you need to do so.
The lists are based on this article:
Table A: Character data types
|Data Type||Length||Storage Size||Max Characters||Unicode|
|char||Fixed||Always n bytes||8,000||No; each character requires 1 byte|
|varchar||Variable||Actual length of entry in bytes||8,000||No; each character requires 1 byte|
|nchar||Fixed||Twice n bytes||4,000||Yes; each character requires 2 bytes|
|nvarchar||Variable||Twice actual length of entry in bytes||4,000||Yes; each character requires 2 bytes|
Table B: Integer data types
|Data type||Minimum value||Maximum value||Storage size|
Table C: Precision storage requirements
|Total characters (precision)||Storage size|
|1 – 9||5 bytes|
|10 – 19||9 bytes|
|20 – 28||13 bytes|
|29 – 38||17 bytes|
Table D: Float and real data type restrictions
|Data type||n||Minimum Value||Maximum value||Precision||Storage size|
|float(n)||1 – 24||-1.79E + 308||1.79 + 308||7 digits||4 bytes|
|25 – 53||-1.79E + 308||1.79E + 308||15 digits||8 bytes|
|real||n/a||-3.40E + 38||3.40E + 38||7 digits||4 bytes|
Table E: Smalldatetime and datetime restrictions
|Data type||Minimum value||Maximum value||Accuracy||Storage size|
|smalldatetime||January 1, 1900||June 6, 2079||Up to a minute||4 bytes (the first 2 bytes store the date; the second 2 bytes store the time)|
|datetime||January 1, 1753||December 31, 9999||One three-hundredth of a second||8 bytes (the first 4 bytes store the date; the second 4 bytes store the time)|
|date||January 1, 1900||December 31, 9999||Only date||4 bytes ?|
|time||00:00:00.0000000||23:59:59.9999999||Only time||4 bytes ?|
|datetime2||larger year and second range||8 bytes ?|
Table F: Smallmoney and money restrictions
|Data type||Minimum value||Maximum value||Storage size|
Does SQL Server 2008 have any new data types?
SQL Server 2008 has several new data types:
- date stores only date values with a range of 0001-01-01 through 9999-12-31.
- time stores only time values with a range of 00:00:00.0000000 through 23:59:59.9999999.
- datetime2 has a larger year and second range.
- datetimeoffset lets you consider times in different zones.
- hierarchyid constructs relationships among data elements within a table, so you can represent a position in a hierarchy.
- spatial identifies geographical locations and shapes — landmarks, roads, and so on.
Keyboard shortcut Alt + F1 works now
SSMS has a rich set of keyboard shortcuts that works by default. But in SSMS 2008 I had one complain about the default Alt+F1 shortcut. The Alt+F1 shortcut is assigned to sp_help by default. It worked for tables, sort of. If you are a good designer and had followed the best practice of putting your tables in the appropriate schemas, using the Alt+F1 shortcut is quite a pain. You will need to wrap your schemanized table as the following for the shortcut to work.
Check out this older blog.
Not any more in SSMS 2012. The fix is finally done. Either of the following will work by default. The old way of wrapping with [ ] will no longer work.
This shortcut is equivalent to execute the sp_help procedure.
Assign keyboard shortcut to sp_helptext
The system procedure sp_helptext is a great helper for developers who work on SQL codes day in and out. I used to have Ctrl+F1 mapped to it. Unfortunately, by default, in SSMS 2012, the shortcut Ctrl+F1 is now mapped to Display SQL Server Books Online.
There are people who tried to re-map it. But it just seems a bit too much to re-map it. So I decided to map the Ctrl+3 key to sp_helptext.
Highlight both of the following text and press Ctrl+3, you will get the procedure definition.
It is equivalent to executing sp_helptext.
A co-worker recently asked me why she would get timeout when updating a column in a big table. This column will be updated to 0 (zero) because a default value of 0 (zero) constrain is now added to the table.
The table has over 40 million records. Although I didn’t ask the exact size in MB, I am pretty sure that a table with over 40 million records is not a small table anymore.
I had a blog a few weeks ago that talked about 4 different scenarios when deleting or updating data on a large table.
I didn’t refer my co-worker to the above blog because the blog is really too long to read in one minute and many scenarios are not applicable to the issue. Scenario # 3 in the above blog “Updating a Nonpartitioned Table – Switch with a Temp Table” is the closest scenario to the issue my co-worker is trying to resolve.
So I summarized a few steps for my co-workers. When we try to “update” a big table, we would actually avoid using the UPDATE command directly on the main table. We want to take advantage of the bulk load optimizations in T-SQL with the INSERT … SELECT method or the SELECT … INTO method.
Update a large table
“Updating”a large table may be time-consuming.
Here is a summary of steps that use the INSERT … SELECT method:
1. Create a temporary table with the same table schema.
2. Copy all the data from the actual table to the temporary table using INSERT … SELECT method. Make sure that the column that needs to be updated to 0 (zero) has value 0 (zero) in the SELECT clause.
3. Drop the actual table. If the large table has constraints and indexes, make sure that you have the scripts to re-create them.
4. Rename the temporary table to the name of the actual table. Recreate any constraints and indexes.
I like the SELECT … INTO method. So here are the steps:
1. Create a temporary table with the same table schema using the SELECT … INTO method. The data is also copied to the temporary table at the same table the temporary table is created. Make sure that the column that needs to be updated to 0 (zero) has value 0 (zero) in the SELECT clause.
2. Truncate the actual table.
3. Use INSERT … SELECT method to insert the data from the temporary table to the actual table.
4. Drop the temporary table
My-coworker took my suggestion and eventually “updated” the column with all 0 (zero) value.
But it puzzled me that how she got the timeout error. “Updating” a big table might take a long time, but where is the timeout from?
Then it dawned on me that she is not “updating” the table using T-SQL, rather modifying the big table by using the table designer in SQL Server Management Studio.
Then I found this Microsoft article.
Although I prefer coding everything in T-SQL, there are programmers who use SQL graphical tools instead. Information from this article can be helpful.
Here is a summary from the article.
When you try to modify a large table by using the table designer in Microsoft SQL Server Management Studio, you may receive an error message that is similar to the following:
Unable to modify table.
Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
This behavior occurs because of the transaction time-out setting for the table designer and for the database designer in SQL Server Management Studio. By default, this setting is 30 seconds.
From the Tools à Options à Designers à Table and Database Designers, you can change this setting ‘Transaction time-out after’ to a higher number.
Notice that this setting differs from the setting in the Execution time-out box in SQL Server Management Studio. By default, the setting in the Execution time-out box for Query Editor in SQL Server Management Studio is zero. Therefore, Query Editor waits infinitely for the query to finish and never times out.
To resolve this by-design behavior, use one of the following methods:
- Click to clear the Override connection string time-out value for table designer updates check box for the table designer and for the database designer in SQL Server Management Studio. See the option in the first screenshot.
- Specify a high setting in the Transaction time-out after box for the table designer and for the database designer in SQL Server Management Studio. Also see the option in the first screenshot.
- Modify the large table by using Transact-SQL statements in Query Editor in SQL Server Management Studio. This is my recommendation to my co-worker. Make sure you don’t use the UPDATE statement directly on the large table, use a temporary table and take advantage of the bulk load optimizations in T-SQL with the INSERT … SELECT method or the SELECT … INTO method.
Now I wonder how my co-worker reacted when I sent her the T-SQL script to “update” the table to resolve the time-out issue.
I don’t have a definition of what is considered a large SQL table. I am pretty sure, however, that a table with 20+ GB data plus another 20+ GB for a few indices is not a small table anymore.
Although this article The Data Loading Performance Guide was written for SQL Server 2008, it has many good information for data warehouse developers to design efficient ETL processes.
Scenarios in the Guide involve partitioned tables and nonpartitioned tables, loading data from text files and also from inside the SQL Server Database Engine.
In a data warehouse environment, it’s not uncommon to design a nightly (or weekly) ETL process to refresh the data going back a longer period of time. During this nightly process, a large amount of data will need to be deleted from the target fact table first and then reload the target fact table with the data from the source.
Our tables are nonpartitioned tables. Here are four common deletion/update scenarios I can think of for nonpartitioned tables. I’ve used the methods below in both automated ETL processes and in ad-hoc queries.
1. Deleting All Rows from a Nonpartitioned Table – Use TRUNCATE
The fastest way to remove all data from a table is to simply execute a truncate statement.
TRUNCATE TABLE factMainTable;
The advantage of TRUNCATE TABLE over DELETE with no WHERE clause is well explained in the SQL Books Online.
- Less transaction log space is used. TRUNCATE TABLE removes all rows from a table without logging the individual row deletions. The DELETE statement, on the other hand, removes rows one at a time and the deletion is fully logged for each deleted row.
- Fewer locks are typically used. When the DELETE statement is executed using a row lock, each row in the table is locked for deletion. TRUNCATE TABLE always locks the table and page but not each row.
- Without exception, zero pages are left in the table. After a DELETE statement is executed, the table can still contain empty pages.
2. Deleting a Large Amount of Rows in a Nonpartitioned Table – Switch with a Temp Table
In this scenario, I’ll assume that the rows that needs to be deleted is more than the rows you want to keep on the table. The idea is to use a temporary table to hold the rows we want to keep, and then switch the temporary table back to the main table. Because the rows we save to the temporary table are much less than what we want to delete, the logging is kept to the minimal compared to executing the DELETE command directly on the main table. In addition, the bulk-insert method can greatly help in the overall execution time because of the optimization.
Here are the general steps:
1) Create a copy of the main table with the data you want to keep.
You need to use one of the bulk insert methods.
To perform the bulk insert, you can use one of the following three methods.
The INSERT … SELECT method:
INSERT INTO factMainTable_Temp WITH (TABLOCK)
SELECT * FROM factMainTable
WHERE OrderDate >=‘20130701’
AND OrderDate <‘20130901’
AND <other Keep Criteria>;
The SELECT … INTO method:
SELECT * INTO factMainTable_Temp
WHERE OrderDate >=‘20130701’
AND OrderDate <‘20130901’
AND <other Keep Criteria>;
As pointed out in the Guide, another way to perform the bulk insert is to use Integration Services to achieve many concurrent, streams into the factMainTable_Temp table. If you need to automate the DELETE in SSIS packages, then the Integration Services is the way to go.
2) Truncate the old data to remove all rows from the main table
BEGIN TRAN; –if you want to keep the operation transactional
TRUNCATE TABLE factMainTable;
3) Bulk Insert the Data from the Temp Table to the Main Table
INSERT INTO factMainTable;
SELECT * FROM factMainTable_Temp;
4) Drop the temporary table
DROP TABLE factMainTable_Temp;
COMMIT TRAN; –If you used a transaction, commit it now
Optionally, in step 2) you can DROP the main table:
DROP TABLE factMainTable;
Don’t forget to script out all the constraints and indexes and keys on the main table before it is dropped.
Then in step 3), you can rename the temp table as the main table:
EXECUTE SP_RENAME ‘TheShcemaName.factMainTable_Temp’, ‘factMainTable’;
Finally, re-create all constraints and indexes and keys on the main table.
3. Updating a Nonpartitioned Table – Switch with a Temp Table
When updating a fairly good size table, you might also want to try the steps in the scenario #2 – Switch with a Temp Table.
Again the idea is to avoid using the UPDATE command directly on the main table, and to take advantage of the bulk load optimizations in T-SQL with the INSERT … SELECT method or the SELECT … INTO method.
The only difference in this UPDATE situation is that the first step, where the temporary table is created, needs to take care of the new values for the columns you want to update.
Typically, the new values come from a third table. In this case, a JOIN will be needed.
INSERT INTO factMainTable_Temp WITH (TABLOCK)
SELECT main.cols, <new values>, main.more_cols
FROM factMainTable main JOIN TheThirdTable third_tbl
ON main.join_col = third_tbl.join_col
WHERE main.OrderDate >=‘20130701’
AND main.OrderDate <‘20130901’
AND <other Keep Criteria>;
4. Deleting 10%+ (but less than 50%) Rows from a Nonpartitioned Table – DELETE in Chunks
In this scenario, I’ll assume that the data you want to delete is less than the data you want to keep on the table. This is opposite to the scenario #2. When deleting more than 10% of the rows from a table with 20+ GB data, I’ve noticed that the deletion started to take much longer to complete.
Since the rows I want to keep is way more than what I need to delete, I doubt the “Switch with a Temp Table” method will be worth it.
So I tried the DELETE in Chunks in T-SQL. The basic idea is to take advantage of the minimum logging.
Check out this section “Using TOP to limit the number of rows deleted” in the link below.
When a TOP (n) clause is used with DELETE, the delete operation is performed on an undefined selection of n number of rows. That is, the DELETE statement chooses any (n) number of rows that meet the criteria defined in the WHERE clause.
In my example, 200,000 rows are deleted from the factMainTable that have order dates that are between the START_DATE and the END_DATE. I do not need to use the ORDER BY clause here not only because ORDER BY will slow down the DELETE, but also that ORDER BY is not needed. What will break the WHILE loop is the system variable @@ROWCOUNT. When no more rows in the date range, @@ROWCOUNT will be zero and the WHILE loop will end.
DECLARE @i int = 1;
WHILE @i > 0
DELETE TOP(200000) factMainTable
WHERE OrderDate >= @START_DATE
AND OrderDate <= @END_DATE;
SET @i = @@ROWCOUNT;
Conclusion – Bulk Load Optimizations
The Guide has a good summary on the Bulk Load Methods in SQL Server.
To provide fast data insert operations, SQL Server ships with several of standard bulk load methods.
- Integration Services Data Destinations – The method for performing bulk load from the Integration Services ETL tool.
- BCP – The command line utility for performing bulk load from text files.
- BULK INSERT – The method for performing bulk load in Transact-SQL. The term “BULK INSERT” refers to the specific Transact-SQL based bulk load method.
- INSERT … SELECT – The method for performing bulk load in process with SQL Server from local queries or any OLE DB source. This method is only available as a minimally logged operation in SQL Server 2008.
- SELECT INTO – The method for creating a new table containing the results of a query; It utilizes bulk load optimizations.
- Similar bulk load techniques are supplied by programming interfaces to SQL Server, including the SQLBulkCopy class in ADO.NET, IRowsetFastload in OLE DB, and the SQL Server Native Client ODBC library.
- BTW, I have not been able to find the Guide for SQL Server 2012.
In a previous blog, SQL #47 – Too many in-line comments totally destroy code readability, I made a point that all SQL developers need to have a consistent writing style to achieve high level of code readability.
Being productive in SQL Management Studio will be your first step to achieve this goal.
In this blog, I’ll show you one trick I use every day in SQL Management Studio. This is one of the many keyboard shortcuts I use to be productive and to achieve the goal of high level of code readability.
I use a lot of tabs/indents in every query I wrote. Hitting the Tab key twenty times for twenty columns can be very tiring (and boring).
Let’s say you want to type a simple SELECT statement with a style as the following. There are two tab spaces after each comma.
You can certainly insert the comma and two Tab spaces for each column, on each line.
I’ll show you a more fun way to do that, and a more productive way.
Block Selection with Keyboard Shortcut Alt + Shift
Yes, block selection is more fun.
Start to insert the comma, and two Tab spaces for the first three lines.
Put your cursor to the beginning line for YearID, before the comma. Then hold down Alt + Shift, and move the right and down arrow key until you have a block selection as shown below.
Block Copy with Keyboard Shortcut Ctrl + C
Let go of the Alt and the Shift key, the block selection will remain. Then press Ctrl + C to copy the block selection.
Now you are ready to insert the block selection anywhere you want.
Block Insert with Keyboard Shortcut Ctrl + V
The keyboard shortcut pair Ctrl + C and Ctrl + V is the only thing I can remember from my early days of programming in C.
Now point your cursor to before column QuarterNumber and press Ctrl + V. You will witness the magic.
Practice the above three Keyboard Shortcuts a couple of times
You are going to a pro!
I have been wanting to blog about this for awhile. SQL code readability is not a topic SQL developers like to talk about. In my 10+ years of working career, I only had this topic with two co-workers. Both of them told me that I write very “readable” SQL code. One of them said he would not use the coding style to judge a SQL developer. Another co-worker told me that he actually installed a SQL code beautifying software to make other developers’ SQL code readable, before he even attempted to modify the code.
Commenting is always welcome in SQL code. Single line comments, multi-line comments, in-line comments are always good to see when you work on other developers’ SQL code, until you realize that not only the comments itself have no readability, but also the messy comments totally destroy the readability of the code.
I find myself recently not only spending time to beautify the SQL code, but also to beautify the comments, especially the in-line comments.
Multi-line comments in the header portion of the procedure
It’s my preference to have multi-line comments in the header portion of the procedure (beginning of the procedure) .
I also place multi-line comments in my SSIS packages by adding them as Annotation. The annotation is certainly not a sophisticated text editor. So make sure you type up your multi-line comments in a SQL/Text/Word editor, and format it with appropriate indents, then paste it to the Annotation in the SSIS package.
Inside the procedure body, I do not use in-line comments
Inside the procedure body, I absolutely do not use in-line comments. Instead I logically divide my code into sections and paragraphs, similar to sections and paragraphs in a book, where paragraph performs only one task, and section is the collection of small tasks.
For Sections, I’d use comments like this:
For paragraphs, I only use single line comments.
One more note before I close this post. When you design your SSIS package, and run into problems in your SQL query in an OLE DB Source in a data flow, check if you have In-line comments embedded in the query. Removing them might help to solve your problem.
Let’s work together to make our SQL code readable and beautiful with a consistent style!