Friday 30th December 2016: 4.16am. Link shared: https://gist.github.com/ned14/f261bfda5e376959ab3588242df0a1ef
Every Christmas I try to get some essential technical infrastructure maintenance done, and most years it turns into something quite technically tricky which isn't easy to fix and usually involves writing a complex Python program in an area far from my normal expertise to solve. This is one such year. It's good training, and worth writing here about.
For fifteen plus years now, my email client has been good old Win3.1 era Pegasus Mail (http://www.pmail.com/) which now contains maybe 30k archived emails. Pegasus development has been very slow in recent years since its author retired, plus its search facilities are slow and poor. Ideally I'd like to get all my email out and into some other email client, and to do that I need to convert the Pegasus email store into something portable.
However it's not as easy as just running an export tool. Pegasus, like almost every email client not using sqlite, has inexpertly written file system code which corrupts the store in a multitude of ways over a fifteen year period. One of the big attractions for me originally to Pegasus over others at the time of choosing it was that its simple mail store format is easily repairable by hand, so when it really ballsed up itself I could manually rescue it using a text editor. That easily beat everything else on the market back in the 1990s, indeed it was Outlook corrupting itself irretrievably, and best of all, also corrupting the backups I had been making in a way that made them unrestorable silently that brought me to the nice and simple Pegasus after Outlook caused me to lose three years of email.
So the tricky technical problem I'm facing this Christmas is that if you run the official Pegasus Mail store export tool or any of the third party ones I've tried, you get corrupted and quite useless output because the store itself is corrupt, and the conversion tools blindly assume it is not, so garbage in garbage out. There are some blog posts by people who have solved the same problem using regex find and replace, but my mail store is in considerably worse shape than theirs - it's older, bigger, and also contains Pegasus v3 content mixed with v4 which added Unicode support and a raft of other changes. In short I can find no quick alternative on the internet for avoiding taking the long route with this conversion.
Some years ago long time readers may remember I had the same problem with the content on nedprod.com which is also a 1990s artefact. Over more than a decade from many unexpected power losses and many different tools used, the HTML had gotten itself into a quite parlous state such that modern browsers were failing to cope with the broken tag soup it had become. I ended up writing a ton of Python which actively rewrote all the HTML to correct the many problems like bit flips, missing sections caused by a failed read-modify-write, bad or incorrect HTML and so on. The Python basically parsed the HTML using a very tolerant parser, then rewrote it using a strict XHTML generator, flagging in a log any parts needing human intervention. Everything was moved into a git repo regularly synced to a mirrored ZFS array and since that remedial work all my nedprod.com silent data loss problems are gone.
I obviously need something similar for my email, so during the last few nights after everyone’s gone to bed I’ve written a Pegasus mail store converter linked to by this post which uses Python 3's excellent email and mailbox modules to loosely parse in the corrupted store of emails from the *.PMM files, take some guesses when facing corrupted email, and rewriting out a fully RFC 2822 compliant mail store with missing or incorrect headers repaired etc. in the universally compatible Unix mboxo format which is consumable by almost every email client or MTA out there.
It works by loading each .PMM file in C:\PMAIL\MAIL in turn and writing out a Unix mbox edition, these are the files where Pegasus stores the actual emails. The file format is simple: the first 128 bytes are the mailbox name or null characters. From offset 128 onwards, each email is stored as received separated by an ASCII 26 (EOF) character. And that’s it (I did mention the file format is very simple).
Well, you might think that’s it, but there are obvious big problems with such a storage format when used in the real world. Firstly, Pegasus happily stores emails as they are received whether they are valid or not, so if those emails are not 7-bit clean it doesn’t care. This made introducing UTF-8 message support for Pegasus 4.0 easy, but it also resulted in a mail store of mixed Latin1 and UTF-8 and other invalid or illegal email. Far worse, if the email contains ASCII 26 characters in its body, Pegasus just goes ahead and writes that, and then proceeds to misparse its own store by treating the email as many emails split at the ASCII 26. Yay. It gets even more yay if you delete one of those fragments, it looks like Pegasus’ parsing logic has almost zero error checking nor corruption handling.
Secondly, Pegasus likes to sometimes store email with CRCRLF line endings instead of the correct CRLF endings. It looks like the author patched the parser when learning of this bug to auto filter out the spurious CR to hide the mistake. More yay.
Thirdly, there are quite a few truncated messages in there introduced by Pegasus trying to heal corrupted data, or perhaps simply introduced by FAT32 or NTFS during unexpected power loss events during the past fifteen years or some other cause. Fronts off messages get truncated, and ends, and occasionally middle parts are missing. That means you might find a message starting half way through a header sequence, or ending half way through a HTML rich text sequence. That data is lost permanently obviously, but we still need to figure out what can be saved.
The linked script tries to parse each email using strict mode, and if that fails it tries various remedial measures to parse as much valid email remaining as it can. It’s not absolutely perfect, perhaps ten or so emails get mangled but that’s okay for a 30k email store.
Finally, I have yet to decide on whether I’ll be exporting Unix mboxo or the vastly superior Unix maildir format the latter of which is pretty solid so long as your filing system is journalled (any since about year 2000 is). I’m intending to adopt good old Thunderbird as my new email client, it’s almost as old as Pegasus but is actively maintained, and it has only very recently gained maildir storage support. That makes it somewhat experimental code, but Maildir format is very hard to screw up as each email lives in its own file, and fsync + atomic renames are used to modify data. That means the worst than can happen is duplicate emails, assuming you have a non-buggy journalled file system. Maildir is a lot slower than mboxo for searching and adding to though, tens of thousands of individual files also waste disc space. I’m still pondering that decision. At least though I definitely have my Pegasus mail store extracted and intact!