Wednesday, October 8, 2008

Power of SQL: Find and Delete Duplicates

Most likely we have all had the situation, or will most likely encounter it, where we have duplicate data that needs to be removed from a system. For me this was experienced when moving data from a flat file / legacy system into SQL. This is the solution I came up with after getting it into SQL. Perhaps you have a more elegant way in dealing with this, if so, feel free to comment. The execution of this entire process against ~200,000 records took less than 15 seconds on my particular database. The creation of the index seems to be helpful if you are going to run several queries against the data while preparing the statement.

First, the background:
I have a table of notes. Those zero or more of those notes are related to a foreign key. If there is more than one note that is identical, for the same foreign key, it needs to be removed.

It is extremely difficult to compare a text field to another text field. With up to 2GB in length, it can be a daunting task. For my situation, most of the notes were less than a length of 8000, only a small number were longer. If they were longer, we could safely assume if they were not different for the first 8000 chars, they wouldn't be. You may choose to add additional constraints for your situation.

Basic Table layout:
noteId (identity field)
foreignId (integer / foreign key to other data)

noteText (text field)
... other unimportant data

ALTER Statements Before Starting: This adds a checksum and length based on the note. FYI - I did find duplicate checksums with different lengths. If you need to be precise, I suggest using the binary checksum function which is more sensitive, but understand, a checksum CAN have a collision. That means you can have the same checksum for different data.
ALTER TABLE notes ADD len_notes AS len(CAST(notetext AS Varchar(8000)))
ALTER TABLE notes ADD cs_notes AS checksum(CAST(notetext AS Varchar(8000)))

CREATE INDEX temp_index ON notes (foreignId,cs_notes,len_notes)

Now we have just added a column that represents the length of a note and the checksum of a note. If your data is changing, new records will contain NULL values and updated records will NOT recalculate the checksum, so keep that in mind. You may want to put criteria on your select statement that will only update the non-null values if your data changes, or the data that has a length <>

Select and Review part of your data that will be deleted:

SELECT top 100 n2.noteId, n2.cs_notes, n2.len_notes, n2.notetext
FROM notes n2
WHERE noteId in (
SELECT n2.noteId
FROM notes n1
JOIN notes n2
ON (n1.foreignId = n2.foreignId
AND n1.cs_notes = n2.cs_notes
AND n1.len_notes = n2.len_notes)
GROUP BY n2.noteId
HAVING n2.noteId > min(n1.noteId)
) ORDER BY n2.noteId
Delete your duplicates: To be clear here, this will preserve the lowest noteId and delete those with higher ids where the checksum AND length AND foreignId match.
DELETE from notes where note_id in
(
SELECT n2.noteId
FROM notes n1
JOIN notes n2
ON (n1.foreignId = n2.foreignId
AND n1.cs_notes = n2.cs_notes
AND n1.len_notes = n2.len_notes)
GROUP BY n2.noteId
HAVING n2.noteId > min(n1.noteId)
)
Put the table back the way it was:
DROP INDEX notes.temp_index
ALTER TABLE notes DROP column [len_notes]
ALTER TABLE notes DROP column [cs_notes]
I hope this helped.

No comments: