Sunday, June 15, 2008

Spell-checking using spam

I was doing spell checking over a OCR'd version of The Long Dark Tea Time Of The Soul (Douglas Adam). The problem with OCRs errors is that they are different from the human mistakes, and the spell checker (hunspell) is not trained for these. For instance: cl instead of d, or l instead of I. So far one could suggest to write a different suggestion algorithm for hunspell (or your preferred open source spell checker). However, I was not in the mood for that, and something very interesting turned out of this.

I thought I would use the power of the web by finding the correct text as quotations of the book spread everywhere. Although it seems that people don't quote much The Long Dark Tea Time Of The Soul, nearly every sentence is available! It seems that some kind of spammer has decided to use this book as carpet text for fake websites linking to its products. There are literally thousands of pages hanging around, each of them containing a tiny fragment of the book. The book is entirely available on internet but as unordered sentences mixed with links to commercial products!

Ok, where it comes really interesting is that they seem to have used a much better OCR algorithm than me (or they spent a lot of time fixing the errors) so the sentences are near perfect. So you can use this spam as spell checker. For instance "And where there is something which is not dealt with properly in your world," the old lady pranled on. "pranled on"? What the heck is this verb? You go on google with the previous sentence, and, there, you immediately find out that the correct word is "prattled on". Thank you spammers!