First, the background:
I have a table of notes. Those zero or more of those notes are related to a foreign key. If there is more than one note that is identical, for the same foreign key, it needs to be removed.
It is extremely difficult to compare a text field to another text field. With up to 2GB in length, it can be a daunting task. For my situation, most of the notes were less than a length of 8000, only a small number were longer. If they were longer, we could safely assume if they were not different for the first 8000 chars, they wouldn't be. You may choose to add additional constraints for your situation.
Basic Table layout:
noteId (identity field)ALTER Statements Before Starting: This adds a checksum and length based on the note. FYI - I did find duplicate checksums with different lengths. If you need to be precise, I suggest using the binary checksum function which is more sensitive, but understand, a checksum CAN have a collision. That means you can have the same checksum for different data.
foreignId (integer / foreign key to other data)
noteText (text field)
... other unimportant data
ALTER TABLE notes ADD len_notes AS len(CAST(notetext AS Varchar(8000)))Now we have just added a column that represents the length of a note and the checksum of a note. If your data is changing, new records will contain NULL values and updated records will NOT recalculate the checksum, so keep that in mind. You may want to put criteria on your select statement that will only update the non-null values if your data changes, or the data that has a length <>
ALTER TABLE notes ADD cs_notes AS checksum(CAST(notetext AS Varchar(8000)))
CREATE INDEX temp_index ON notes (foreignId,cs_notes,len_notes)
Select and Review part of your data that will be deleted:
SELECT top 100 n2.noteId, n2.cs_notes, n2.len_notes, n2.notetextDelete your duplicates: To be clear here, this will preserve the lowest noteId and delete those with higher ids where the checksum AND length AND foreignId match.
FROM notes n2
WHERE noteId in (
SELECT n2.noteId
FROM notes n1
JOIN notes n2
ON (n1.foreignId = n2.foreignId
AND n1.cs_notes = n2.cs_notes
AND n1.len_notes = n2.len_notes)
GROUP BY n2.noteId
HAVING n2.noteId > min(n1.noteId)
) ORDER BY n2.noteId
DELETE from notes where note_id inPut the table back the way it was:
(
SELECT n2.noteId
FROM notes n1
JOIN notes n2
ON (n1.foreignId = n2.foreignId
AND n1.cs_notes = n2.cs_notes
AND n1.len_notes = n2.len_notes)
GROUP BY n2.noteId
HAVING n2.noteId > min(n1.noteId)
)
DROP INDEX notes.temp_indexI hope this helped.
ALTER TABLE notes DROP column [len_notes]
ALTER TABLE notes DROP column [cs_notes]
No comments:
Post a Comment