An academic evaluation of the Office 2007 contextual spelling checker

 

A few days ago, I discovered an analysis of our Office 2007 contextual speller carried out by Prof. Graeme Hirst, from the University of Toronto:  An Evaluation of the Contextual Spelling Checker of Microsoft Office Word 2007.

We have discussed this new context-sensitive speller on several occasions on this blog (as well as here) and it is nice to see that it is attracting the attention of researchers in the academic world.

It’s an interesting paper, which provides some food for thought, however, especially with respect to how “aggressive” we should be in our approach to recall.

His conclusion nicely sums up our trade-offs and dilemmas (emphasis mine):

In an evaluation on 1400 examples, it is found to have high precision but low recall — that is, it fails to find most errors, but when it does flag a possible error, it is almost always correct.

The contextual spelling corrector in Microsoft Office Word 2007 is a cautious (low recall) but believable (high precision) system. However, its overall performance, as measured by F, is much poorer than that of the trigram method of Mays et al (1991).

The trade-off between the two systems is a difficult one. In simple terms, better performance is better; but believability is an important attribute for a consumer-level system (“if Word says it’s wrong then it’s wrong”) and could well be considered worth sacrificing performance for. The problem with this, however, is that as users become familiar with the system, their expectations will rise and believability will start to apply also to what Word fails to flag (“If Word says it’s right then it’s right”).

A system that is more visibly error-prone might actually serve users better.

The methodology used by Prof. Hirst and his colleagues to evaluate the system deserves a few comments:

· They automatically induced real-word errors by replacing words by any spelling variation found in the lexicon of the ispell spelling checker. They limit the manipulation to an edit distance of 1 manipulation. So these errors are not natural mistakes.

· They did not consider “malapropisms” (real-word mistakes) involving closed-class words and words formed by the insertion or deletion of an apostrophe or by splitting a word: this means they exclude pairs which we have found to be extremely frequent in real texts (then/than; your/you’re; its/it’s; everyday/every day; to/too; their/there/they’re…). These pairs feature prominently in any analysis of real mistakes, especially in the literature devoted to English as a Second Language. Everyone knows that many native speakers of English have a lot of difficulty mastering these confusables, which is why we decided to specifically target them.

· They did not include phonetic confusables such as cymbal/symbol, principle/principal, pear/pair, there/their which have an edit distance > 1.

The categories they did not include in their tests are precisely those which we focused on because flagging these real and frequent mistakes is very useful for users of Office and Word. So assessing the “performance” of a system by ignoring these may be a bit unfair, at least if one equates “performance” and “usefulness” (will users find the system more useful if we flag “have not lost monkey” (à money), a rare and unnatural mistake, or if we flag “it is to expensive”, a mistake our data shows is very frequent and which we seem to be good at flagging?). Recall would be a lot higher if pairs involving closed-class words and the standard phonetic confusables above were taken into account (our own metrics based on a large corpus of real mistakes shows that our recall is in fact higher than the 20-25% found by Hirst, and is around 40%). The alternative methods which he proposes have even higher recall (50%), but their precision (50%) is way lower than our system’s (96%). Hirst clearly favors a recall-based performance. His assumption is: do people want to use a system like Microsoft’s, which only spots one mistake out of 5 (our metrics show it’s in fact closer to 2 out of 5, i.e. 40%) and is right nearly all the time? Our assumption is: would users really want a system based on the trigram method advocated by Prof. Hirst, which flags 50% of the mistakes but is wrong in 50% of the cases? The feedback we generally get indicates that our users tend to prefer unobtrusive tools and switch off a tool which they consider unreliable.

Interesting debate, isn’t it? I am really grateful to Prof. Hirst for making this discussion possible.

So, what do you think? We are interested in hearing your opinion. Do you prefer a tool which casts the net as wide as possible and catches many mistakes, at the risk of being frequently wrong and of creating many false flags (false positives), or do you prefer a tool which does not catch all possible mistakes, but which you can trust when it does catch one? Do not hesitate to leave your comments below…

Thierry Fontenelle – Program Manager