When ROW_NUMBER() OVER (PARTIION BY … ORDER BY …) function is not appropriate
ROW_NUMBER() function has been available since SQL Server 2005. There are many good articles out there about this function. There are many bad ones out there too. (So be careful about what you reading. Always create test cases for your data to make sure you get the results you expect.)
ROW_NUMBER() function can be used with or without the PARTITION BY clause, but it must have the ORDER BY clause. It’s the usage with the PARTITION BY clause that interests me the most. I use ROW_NUMBER() with the PARTITION BY clause in cleansing the raw data with unwanted duplicates.
Here is a simple example. In this example, the device is the same device, but with two different inventory date and database id. We need to clean this kind data by removing the one that has older inventory date or smaller database id.
Here is what you can do to achieve that:
from ( select (other columns)
, ranked = row_number() over
( partition by DeviceName
order by LastInventoryDate desc, DatabaseID desc )
from myRawTable ) a
where ranked = 1
With the above logic, the first record will be removed. This logic will be applied to every device.
Recently, while working on a SQL Reporting Services report, I realized that I need to use the RANK() function instead to meet the reporting requirement.
The requirement goes like this:
1) Data will come from 4 different sources. Using each asset’s operating system to determine which source file to use since these 4 sources files overlap assets.
2) However, within each source file, we want ALL the hits, not just the unique hits.
These two requirements cry out for the RANK() function with the PARTITION BY clause.
With the RANK() function, data from the different source file will get a different rank number, but will get the same rank number if they are from the same source file, which will cause them to be removed all together or be kept all together.