Inside the KN Client Analysis Process
This is Glen Anderson again; Group Program Manager for the KN team. In last week’s blog entry I compared and contrasted KN client analysis with desktop search indexing. This week I’m going to continue to talk about the details behind KN client analysis and answer some additional common questions. Hopefully this will help you better understand how it works and why we made some of the decisions we made. I suggest you read last week’s blog entry describing KN and Desktop Search if you haven’t already, since it provides an overview of the key steps in the analysis process.
Why Not Analyze Sources Other Than Email?
First, let’s be clear about the sources of information that KN client analysis actually looks at in order to create the best possible profile recommendations for you. Some of these are optional (user-selected and controlled as we described in the Dispelling the Myths entry) via the KN UI settings (such as whether you import data from our Outlook Contacts folder) but I won’t make that distinction here.
For contacts, KN mines the following sources:
· To, From, CC, and BCC fields in the headers of your actual email interactions with others
· Outlook Contacts
· IM Contacts from MSN Messenger, Windows Messenger, and Office Communicator
· Active Directory - SharePoint pulls your manager, your direct reports as well as your manager’s direct reports from the Active Directory profile import process and assigns these people as your default “Colleagues”
· SharePoint My Site “Colleagues” – SharePoint 2007 enables you add and manage your list of Colleagues so KN synchronizes that list with your recommended profile
For keywords KN mines the subject and body of your email interactions with others. KN is not mining information in attachments. There are many other sources of information that come to mind that provide “digital clues” as to what you know like Instant Messaging conversations, documents authored and read, blogs, and other in-house expertise systems.
So why doesn’t KN include these sources in our keyword generation algorithms? The answer really comes down to a technical challenge and a practical perspective. We have spent over 2 years experimenting, tuning, and adjusting our keyword generation algorithms with email as the data source. I’m sure you can appreciate the variability and complexity of email when you start to consider how to extract “meaning” and “concepts” across different writing styles, dialects, subjects, and user habits. The way KN makes sense of all this is through a weighted relevance scheme – overall “relevance” of a keyword is computed by weighting many different factors for each instance of the keyword. Technically, as soon as you add a second data source, things start to get much more complicated. KN would need a different set of weightings for this other data source that when merged with the email weightings yields a better overall result. Not impossible, but definitely costly. Given that reality, we looked at what value most KN users might get for the additional cost (including time to market) of adding one or more data sources.
The answer becomes pretty clear at this point. Email is by far the richest and most pervasive source of tacit knowledge in the enterprise today. IM doesn’t come close (it might some day) and furthermore, IM conversations are not consistently saved anywhere which means they wouldn’t help at all building the initial profile. Blogs don’t come close – they are a relatively new phenomenon. Documents present relatively good source information, but are nowhere near as pervasive and universally authored by workers in general.
So the bottom line is what I like to call an instance of the 80/20 rule. Mining email may give you 80% of the value for 20% of the cost. And it provides you with potentially the most treasured tacit knowledge – the emerging knowledge of the enterprise.
That’s not to say KN won’t add other data sources in the future. The KN team will be considering the value provided and feedback from customers’ real experiences using the product.
Won’t Spam Affect My Profile?
We all get spam email; whether it is actual spam from outside the organization, endless discussions on distribution lists or mailing lists we are part of, or even people within our organizations CC’ing us on everything under the sun. A common question people have is: Won’t this affect the quality of my profile? Well, the answer is: not really. When determining which keywords and contacts to associate with you and recommend for your profile, KN is careful to account for and deemphasize these types of items. Following are some of the techniques KN uses:
· In order for people to be recommended as contacts of yours, they need to meet a minimum interaction (email) threshold both in the total number of emails and the number of emails in each direction. So if you’ve never replied to a particular person for example, she will never be considered a contact of yours.
· In order for a keyword to be recommended for your profile, there not only is a minimum number of times the phrase must have occurred in your emails, but it must have been “authored” (i.e., in a mail sent by you) a minimum number of times as well.
· Oh and by the way, we don’t look at your Outlook Junk E-Mail folder.
How do Outlook Rules Affect Analysis?
During the initial setup and configuration process, the KN client provides the ability to set exactly which folders should and shouldn’t be looked at during the analysis process. So for example, if you file personal emails, jokes, employee performance evaluation or review information, etc., in special folders you might naturally exclude them from analysis. But your email hits your inbox first and then is either moved manually or via Outlook rules to the folder of your choice. The question arises as to how this affects the profile that KN recommends for you.
The key point to understand here is that analysis happens at a point in time - a point in time for the initial analysis and a point in time for the incremental analysis (defaulted to every 14 days). Whatever folder the items exist in at that point in time will be the folder we use to determine whether to include or exclude the item from analysis. For items that are moved via Outlook rules to specific folders, there is only a very small window when the message would pass through the inbox. So practically speaking, all those messages would not get analyzed unless their destination folders are part of the analysis. For items that are moved manually to specific folders, it depends on when analysis occurred. The second key point to keep in mind is that KN analysis is about calculating statistics across all your emails (you need a keyword to occur X times before it is recommended; you need to communicate back and forth Y times with a person for her to be considered a contact). So even if one of those emails got caught in the inbox, it very likely would have no effect on the outcome of your profile.
I think that’s enough for this week. Let us know if you have any questions or comments.
Comments
- Anonymous
August 17, 2006
Another great posting. Keep 'em coming :-)
http://www.mikeysgblog.com - Anonymous
September 18, 2006
I'd like to see the ability to add Sharepoint Lists or Document Libraries as sources of Mining.
This seems to provide a somewhat consistant interface for the KN team to build from.
Perhaps add some Content types that by default integrate with KN. - Anonymous
September 18, 2006
Great idea. We will evaluate that for a future release.