Dirk Bergmann

Developer, panographer. Does consultancy work from own company. Lives in Chiang Mai, Thailand.

Twitter

Chiang Mai

Hua Hin

Thailand Hotels

ListVote - List. Vote. Learn.

Partnerseek - Find Partners

Finding Asymmetrical Duplicates with Transact-SQL

Finding duplicates over all involved columns with Transact-SQL is done with the query

select column1 (, column2...)
from table
group by column1 (, column2...)
having count(*) > 1

This technique is especially important for finding duplicates before inserting into tables with primary keys made up of multiple columns.

But what if the duplicate is only in one column, call it half- or semi-duplicate, incomplete duplicate or asymmetrical duplicate or whatever. For example, I tried to come up with a query that finds hotels with identical names at different destinations. This is what I came up with:

SELECT  DISTINCT HotelName FROM
(SELECT DISTINCT HotelName, Destination FROM HotelTable) tmp
GROUP BY HotelName
HAVING COUNT(*) > 1

This query only works with ONE column that's different, but you can expand the selection of columns that you expect to be identical (duplicates).

So the abstracted query is:

SELECT  DISTINCT expectedDuplicateColumn (, expectedDuplicateColumn2 ...) FROM
(SELECT DISTINCT expectedDuplicateColumn (,expectedDuplicateColumn2 ...), nonDuplicateColumn FROM MyTable) tmp
GROUP BY expectedDuplicateColumn, (, expectedDuplicateColumn2 ...)
HAVING COUNT(*) > 1

With just one expected duplicate column you can get (of course) the entire rows like this: 

SELECT * FROM HotelTable WHERE HotelName  IN

(

SELECT  DISTINCT HotelName FROM
(SELECT DISTINCT HotelName, Destination FROM HotelTable) tmp
GROUP BY HotelName
HAVING COUNT(*) > 1

)

With multiple identical columns you can use nested subqueries like so:

SELECT * FROM HotelTable WHERE HotelName IN
(
SELECT HotelName FROM
(
SELECT  DISTINCT HotelName,  HotelChain FROM
(SELECT DISTINCT HotelName,  HotelChain, Destination FROM HotelTable) tmp
GROUP BY HotelName, HotelChain
HAVING COUNT(*) > 1
)tmp2
)

This latter approach avoids the use of EXISTS, which can get really messy. I'm sure SQL experts will find a more elegant solution but this might work for you.