Finding Asymmetrical Duplicates with Transact-SQL

Finding duplicates over all involved columns with Transact-SQL is done with the query

select column1 (, column2...)
from table
group by column1 (, column2...)
having count(*) > 1

This technique is especially important for finding duplicates before inserting into tables with primary keys made up of multiple columns.

But what if the duplicate is only in one column, call it half- or semi-duplicate, incomplete duplicate or asymmetrical duplicate or whatever. For example, I tried to come up with a query that finds hotels with identical names at different destinations. This is what I came up with:

SELECT DISTINCT HotelName FROM
(SELECT DISTINCT HotelName, Destination FROM HotelTable) tmp
GROUP BY HotelName
HAVING COUNT(*) > 1

This query only works with ONE column that's different, but you can expand the selection of columns that you expect to be identical (duplicates).

So the abstracted query is:

SELECT DISTINCT expectedDuplicateColumn (, expectedDuplicateColumn2 ...) FROM
(SELECT DISTINCT expectedDuplicateColumn (,expectedDuplicateColumn2 ...), nonDuplicateColumn FROM MyTable) tmp
GROUP BY expectedDuplicateColumn, (, expectedDuplicateColumn2 ...)
HAVING COUNT(*) > 1

With just one expected duplicate column you can get (of course) the entire rows like this:

SELECT * FROM HotelTable WHERE HotelName IN

(

SELECT DISTINCT HotelName FROM
(SELECT DISTINCT HotelName, Destination FROM HotelTable) tmp
GROUP BY HotelName
HAVING COUNT(*) > 1

)

With multiple identical columns you can use nested subqueries like so:

SELECT * FROM HotelTable WHERE HotelName IN
(
SELECT HotelName FROM
(
SELECT DISTINCT HotelName, HotelChain FROM
(SELECT DISTINCT HotelName, HotelChain, Destination FROM HotelTable) tmp
GROUP BY HotelName, HotelChain
HAVING COUNT(*) > 1
)tmp2
)

This latter approach avoids the use of EXISTS, which can get really messy. I'm sure SQL experts will find a more elegant solution but this might work for you.

Posted: 17 October 2007

comments powered by Disqus