Removing Duplicate Messages in Apple's OSX Mail

I'm still running Snow Leopard (a.k.a. OS X 10.6.8) and using Apple's Mail version 4.6. I also tend to keep everything, so Mail is bloated with thousands of e-mails dating back to—believe it or not—October 14, 1990 (and checking, it's actually just shy of 100,000 messages). Over the years, Mail has created duplicate messages now and then. I did a bit of research and there are some tools to do it. For instance, Bohemian Boomer's article on Remove Duplicate Messages refers to the so-named AppleScript that does what it says. While I gather it would eventually work, scripting Mail with AppleScript is not exactly fast, and I found it was processing thousands of messages per hour and not the most robust architecture for such a huge project so I stopped it.

Nick Shubin's article on Finding Duplicates in the Mail Database talks about how each message is stored in the ~/Library/Mail/Mailboxes tree in its own *.emlx file. He used a program called Find Duplicate Files by Araxis which would probably work fine.

My technique was similar, although I didn't clean up the attachments files at all. I had written a Python script that would scan files and find duplicates. The original purpose was to replace the duplicates with UNIX "hard links"—where two or more files would point to the same data on disk (as opposed to a "soft link" where on directory entry points to another, much like a Finder alias). There's a whole history about how the data on disk is just stored in chains and is referenced by a link to the data's inode, so the UNIX remove (rm) command is also called the less-ominous and more-accurate unlink. (Something like that anyway … it's been decades since I looked at file systems that closely.)

Anyway, I added a couple options so I could delete files instead (omitting the ln commands to build the links). I closed Mail and ran it on my tree. In a half an hour, it found tens of thousands of duplicates. I had confidence in the program I wrote (having tested it in the past) so I went ahead and removed the duplicates.

Back in Mail, I figured it would be wise to select all the affected mailboxes and Mailbox:Rebuild them. Once I was done, I noticed a few duplicates still appeared in Mail.

It turned out—probably because of some odd rule I had added to Mail—that one message would have a color key in its XML extension and the other would not. Specifically, the difference between two duplicates was only:

<key>color</key>
<string>000000</string>

So I used TextWrangler to search all the files and simply remove that combination (figuring it was not harmful since it was setting some color to black which was probably the default anyway.) Running my script a second time, I got another few thousand messages and removed them. There may be more, but I haven't found them yet and I'm satisfied.

And yeah, someday I'll put that Python script in open source somewhere or other. For now, it's not adequately commented so I'll hold off on publishing it.