Home > SQL Challenge > SQL #57–A one-liner query with PARTITION BY

SQL #57–A one-liner query with PARTITION BY

What can be done in SAS must be possible in SQL

A friend of mine is a data professional in marketing data analysis with expertise in SAS and pretty good working knowledge of SQL. When he sent me questions about “how to do in SQL…”, I often sent back a query with either a sub query, or a JOIN, or a CTE, or with all of them.

My friend has been always happy to take my suggestions, until one day he sent me a code snippet and insisted that what can be done in SAS must be possible in SQL too.

Getting the MAX row (by date) in each ID

In the following example, each ID can have multiple rows. What he wants to do is to get the MAX row (by date) in each ID, and then get the status. He also wanted to get a row count for each ID.

The 3 rows highlighted in red should be the output.

 

id date status
101 1/3/2013 1
101 4/6/2012 3
101 11/23/2014 2
108 6/16/2007 3
108 1/24/2003 1
109 5/21/2014 1

A simple aggregate query with a GROUP BY will not work

A simple aggregate query with a GROUP BY, such as the one below, is not going to work, because each aggregate function MAX are independent of each other.

SELECT EventID ,      MAX(EventDate) AS max_EventDate ,      MAX(EventStatus) AS max_EventStatus ,      COUNT(*) AS row_count FROM   #test_max_row GROUP BY         EventID

SAS has solved the classic MAX VALUE vs. MAX ROW problem

This is a classic MAX VALUE vs. MAX ROW problem in SQL.

It turned out that SAS has solved this classic problem long time ago, with no sub query whatsoever.

Here is the code in SAS my friend sent to me.

data test;

input id date mmddyy10. status;

cards;

101 01/03/2013 1

101 04/06/2012 3

101 11/23/2014 2

108 06/16/2007 3

108 01/24/2003 1

109 05/21/2014 1

;

run;

* In SAS;

proc sql;

create table results as

select id, date format=mmddyy10., status, count(id) as cnt_id

from test

group by id

having date=max(date);

quit;

The trick in the above proc SQL with SAS-flavor is in the in the HAVING clause, date=max(date).

In standard SQL, the HAVING clause cannot have any column by itself and any columns in the HAVING clause must be companied by an aggregate function. In another word, the following is illegal in SQL:

having date=max(date)

I have programmed in SAS before for 2 years and have some basic understanding of proc SQL in SAS. So it’s not a surprise to me that SAS has its own implementation of SQL.

It’s easy in SQL too but…

To find the max row in SQL is not hard either, but most would require a JOIN or sub query. Most popular ways are:

  • Use Correlated query
  • Find the MAX(date) with GROUP BY, then JOIN to the original table
  • Use a sub query that uses ROW_NUMBER() OVER with PARTITION BY and ORDER BY.

ROW_NUMBER() OVER can be used to find the MAX row, and also to de-duplicate data

My favorite is the last one that uses the ROW_NUMBER() OVER. It’s very flexible, and can be used to not only find the MAX row, but also to de-duplicate data.

But none of the above is a one-liner query, meaning with single SELECT.

A one-liner query is possible in SQL

I agree with my friend that a one-liner query is possible in SQL. In the end, this is what I came out with.

Here is my first try.

pic1

 

 

 

 

 

 

 

I used the OVER() clause three times. This OVER() with PARTITION BY and ORDER BY sorted the rows by date in descending order in each ID. The I used the FIRST_VALUE() function to get the MAX date row. The first one gets the first value of the date, and the second one gets the first value of the status.

The last OVER() uses only a PARTITION by not the ORDER BY because the count() aggregate function does not care about any particular ordering.

The result has the same number of rows as the original table with each ID being repeated. By adding a DISTINCT word, the result is reduced to one ID per row.

Here is the final query:

pic2

 

 

 

 

Here is the query in text.

SELECT DISTINCT                 EventID ,       first_value(EventDate) OVER(PARTITION BY EventID ORDER BY EventDate DESC) AS max_EventDate ,       first_value(EventStatus) OVER(PARTITION BY EventID ORDER BY EventDate DESC) AS max_row_EventStatus ,       count(EventID) OVER(PARTITION BY EventID) AS row_count_for_the_EventID FROM   #test_max_row

Cannot recommend this

I didn’t have time to experiment with a large table to test the execution plan and query time against the other more conventional ways mentioned above, so I cannot recommend this yet. But I am not very optimistic about this query by just looking at it. The PARTITION BY was repeated three times. If you have more columns in your output, the PARTITION BY will be repeated even more times. SQL Server might have done some optimization in this kind of scenario, but I cannot be sure without further experimenting.

Another special thing about this query is that it didn’t use GROUP BY. PARTITION BY is essentially the same as GROUP BY, but without reducing the results. That’s why I had to use the DISTINCT in the SELECT. DISTINCT itself is also a GROUP BY in disguise. This is another reason I am not very optimistic about this query.

It’s certainly fun to experiment!

  1. libertascastegnato
    January 22, 2015 at 8:10 am

    Why not use a CTE like this?

    WITH EventDetail
    AS
    (
    SELECT EventID, EventDate, EventStatus, ROW_NUMBER() OVER (PARTITION BY EventID ORDER BY EventDate DESC) AS rn
    FROM #test_max_row
    )
    SELECT EventID, EventDate, EventStatus
    FROM EventDetail
    WHERE rn = 1;

    Like

  2. KF
    January 22, 2015 at 1:52 pm

    Can also be accomplished by just using a subquery with row_number partitioning (also side benefit the query processor seems to like the subquery method more in the execution plan).

    select x.EventID, x.EventDate, x.EventStatus
    from (
    select EventID, EventDate, EventStatus,
    row_number() over (partition by EventID order by EventDate desc) as RowRank
    from #test_max_row
    ) x
    where x.RowRank = 1

    Like

  3. ronmoses
    January 22, 2015 at 2:42 pm

    It wouldn’t help with the COUNT(), but for the rest of it you can try this:

    SELECT TOP 1 WITH TIES
    EventID, EventDate AS max_EventDate, EventStatus AS max_row_EventStatus
    FROM #test_max_row
    ORDER BY ROW_NUMBER() OVER(PARTITION BY EventID ORDER BY EventDate DESC)

    While it returns the same results as a subquery/join approach, unfortunately the execution plan appears to be the same. Still, it’s a one-liner!

    Like

  4. Tom
    January 22, 2015 at 7:55 pm

    Hi Sherry, nice blog entry. You could do something like the following. It is not exactly a one liner but is perhaps more flexible if there are more columns.

    create table TestEvent (EventID int, EventDate datetime, EventStatus int)
    go
    insert into TestEvent values(101, ’01/03/2013′, 1)
    insert into TestEvent values(101, ’04/06/2012′, 3)
    insert into TestEvent values(101, ’11/23/2014′, 2)
    insert into TestEvent values(108, ’06/16/2007′, 3)
    insert into TestEvent values(108, ’01/24/2003′, 1)
    insert into TestEvent values(109, ’05/21/2014′, 1)
    go

    select * from TestEvent
    go

    with anotherway as (select row_number() over (partition by eventid order by eventdate desc) as row, *, row_number() over (partition by eventid order by eventdate asc) as cnt from testevent)
    select EventID, EventDate, EventStatus, cnt from anotherway where row = 1

    Like

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: