Spaces in SmartTag.Terms collection

When you add non-terms into the SmartTag.Terms collection do you expect those terms to be recognized? If you do this is a false expectation. The simplest code to add a SmartTag to a Word document looks like this:

private void ThisDocument_Startup(object sender, System.EventArgs e)
{
SmartTag st = new SmartTag("https://www.microsoft.sample.com#foo", "Foo term");
st.Terms.Add("foo");
st.Terms.Add("One Two");
Action action = new Action("Do nothing");
st.Actions = new Action[] { action };
this.VstoSmartTags.Add(st);
}

If you run VSTO solution with the code as above and type foo followed by a space you will notice that foo is almost immediatly "tagged". However if you type One Two followed by a space it would not have the same effect. Why is that?

It is happening because Terms collection should be used only if you need to recognize single tokens. The default implementation of SmartTag will match the members of Terms collections against the TokenList collection as passed into ISmartTagRecognizer2.Recognize2 method.

TokenList Tokenized representation of Text parameter. Strings, punctuation, and white space are broken down into actual words for use by the recognizer. This enables streams of tokens to be passed to the recognizer in addition to raw text.

Terms collection is not much of use if you want to teach your smart tag a vocabulary of people's names which usually is more than a single word. But you can still use SmartTag.Expressions collection like this:

    st.Expressions.Add(new RegEx(@"\bOne Two\b"));

This approach will work pretty well unless you are looking into very large vocabularies. We use pretty dumb sequential algorithm when we just iterate over collection of regular expressions and try to match it against the text that is passed in. Recognition of smart tags happens on the background thread and the extensive load on the processor will not be horribly noticeable but still the experience would be suboptimal.

I havent given this much thought yet, but it is probably possible to come up with a more effective algorithm. And if you know the characteristics of your vocabulary that is an extra advantage. In this case you can pre-sort you vocabulary, define one single regular expression matching the characteristics of your terms and upon a regex match do quick binary seach to verify that matched text is actually in the sorted array.

Anyone done this or has other ideas?