Article
01/25/2016

May 2012

Volume 27 Number 05

Kinect - Multimodal Communication with Kinect

By Leland Holmquest | May 2012

In the April issue (msdn.microsoft.com/magazine/hh882450) I introduced you to “Lily,” a virtual assistant intended to help office workers in their daily tasks. I demonstrated how to use the Microsoft Kinect for Windows SDK to create context-aware dialogue, enabling Lily to listen and respond appropriately with respect to the intentions of the user.

Now I’ll take you through the next step in achieving a natural UI by leveraging the Kinect device’s skeletal tracking to facilitate user interaction through gestures. Then I’ll tie the two concepts together and demonstrate multimodal communication by having Lily’s output dependent on not only what gesture was made, but also what spoken command was issued. By combining these two modes of communication, the user comes away with a much richer experience and gets a step closer to ubiquitous computing. The presentation of Lily is in the form of a Windows Presentation Foundation (WPF) application.

Initializing Kinect

The first step in using the Kinect device is to construct a Runtime, setting various parameters. Figure 1 shows the configuration I chose for Lily.

Figure 1 Kinect Runtime Construction

// Initialize Kinect
nui = Runtime.Kinects[0];// new Runtime();
nui.Initialize(RuntimeOptions.UseDepthAndPlayerIndex |    
  RuntimeOptions.UseDepth | RuntimeOptions.UseColor |
  RuntimeOptions.UseSkeletalTracking);
nuiInitialized = true; nui.SkeletonEngine.TransformSmooth = true;
nui.SkeletonEngine.SmoothParameters = new TransformSmoothParameters
{
  Smoothing = 0.75f,
  Correction = 0.0f,
  Prediction = 0.0f,
  JitterRadius = 0.05f,
  MaxDeviationRadius = 0.04f
};
nui.VideoStream.Open(ImageStreamType.Video, 2, 
  ImageResolution.Resolution640x480, ImageType.Color);
nui.DepthStream.Open(ImageStreamType.Depth, 2,
  ImageResolution.Resolution320x240, ImageType.DepthAndPlayerIndex);

When setting up a Kinect device, a number of options are available. First, notice the first line of code in Figure 1. The Kinect for Windows SDK beta 2 has a different constructor for the Runtime. By referencing an index (Runtime.Kinects[0];), it’s simple to attach multiple Kinect units to the application. In this application I’ve limited it to a single Kinect device, so by definition the Runtime must be at location [0]. You can iterate through the collection of Runtime.Kinects to handle multiple Kinect units if available. Next, I need to tell the Kinect device what capabilities are going to be used. This is done by passing the desired capabilities into the Initialize method. There are four values from which to choose:

UseColor enables the application to process the color image information.
UseDepth enables the application to make use of the depth image information.
UseDepthAndPlayerIndex enables the application to make use of the depth image information as well as the index generated by the skeleton tracking engine.
UseSkeletalTracking enables the application to use the skeleton tracking data.

Passing in these values tells the API what subsystems in the Kinect device are going to be used so the appropriate parts of the multistage pipeline of the Runtime can be started. It’s important to note that you can’t access capabilities later in the application that aren’t declared during the initialization. For example, if the only option selected was RuntimeOptions.UseColor and later using the depth information was required, it wouldn’t be available. Therefore, I’ve passed in all of the values available, indicating that I intend to use the full capabilities of the Kinect device.

Tracking Users

Before discussing the next section in the code, let’s look at what the Kinect device is really giving us. When using the skeleton tracking capability, the Kinect device can track up to two active humans interacting with the system. It achieves this by creating a collection of 20 joints and associating an ID with each. Figure 2 shows what joints are being modeled.

Figure 2 The 20 Joints that Are Modeled in Kinect

Figure 3 is an image of the joints being captured from two separate users.

Figure 3 Two Active Skeletons

In order for a skeleton to become active, the Kinect device must be able to see the user from head to foot. Once a skeleton is active, if a joint goes out of view, the Kinect device will try to interpolate where that part of the skeleton is. If you’re going to build Kinect-enabled applications, I strongly encourage you to create a simple application just to watch the skeleton streams and interact with the Kinect device. Make sure you have multiple users participate and set up scenarios where obstructions come between your users and the Kinect device—scenarios that mimic what your application will experience once deployed. This will give you an excellent understanding of how the skeleton tracking works and what it’s capable of, as well as what limitations you might want to address. You’ll quickly see how amazing the technology is and how creative it can be in the interpolation.

In some scenarios (like the one represented by Project Lily) the speed and choppiness of this interpolation can be distracting and unproductive. Therefore the API exposes the ability to control a level of smoothing. Referring to Figure 1 again, first use the SkeletonEngine on the Runtime to set the TransformSmooth to true. This tells the Kinect device that you want to affect the smoothness of the data being rendered. Then set the SmoothParameters. Here’s a brief description of each of the TransformSmoothParameters:

Correction controls the amount of correction with values ranging from 0 to 1.0 and a default value of .5.
JitterRadius controls the jitter-reduction radius. The value passed in represents the radius desired in meters. The default value is set to 0.05, which translates into 5 cm. Any jitter that goes beyond this radius is clamped to the radius.
MaxDeviationRadius controls the maximum radius (in meters) that corrected positions can deviate from the raw data. The default value is 0.04.
Prediction controls the number of predicted frames.
Smoothing controls the amount of smoothing with a range of 0 to 1.0. The documentation makes a specific point that smoothing has an impact on latency; increasing smoothing increases latency. The default value is 0.5. Setting the value to 0 causes the raw data to be returned.

Video and Depth Streams

You’ll want to experiment with these settings in your own application, depending on the requirements that you’re fulfilling. The last thing needed for this application is to open the VideoStream and the DepthStream. This facilitates viewing the video images coming from the color camera and the depth images coming from the depth camera, respectively. Later on I’ll show you how this gets connected to the WPF application.

The Open method requires four parameters. The first is streamType. It represents the type of stream that’s being opened (for example, Video). The second parameter is poolSize. This represents the number of frames that the Runtime is to buffer. The maximum value is 4. The third parameter is resolution, which represents the resolution of the desired images. The values include 80x60, 640x480, 320x240 and 1280x1024 to match your needs. And the last parameter indicates the desired type of image (for example, Color).

Kinect Events

With the Runtime successfully initialized, it’s time to wire up the events made available from the Runtime to the application. For Lily, the first two events that will be handled are used simply to give the end user a graphical view of the color images and the depth images. First, let’s look at the method that’s handling the Runtime.VideoFrameReady event. This event passes an ImageFrameReadyEventArgs as its event argument. The nui_VideoFrameReady method is where Lily handles the event, as shown in the following code:

void nui_VideoFrameReady(object sender, ImageFrameReadyEventArgs e)
{
  // Pull out the video frame from the eventargs and
  // load it into our image object.
  PlanarImage image = e.ImageFrame.Image;
  BitmapSource source =
    BitmapSource.Create(image.Width, image.Height, 96, 96,
    PixelFormats.Bgr32, null, image.Bits,
    image.Width * image.BytesPerPixel);
  colorImage.Source = source;
}

The Kinect for Windows API makes this method simple. The ImageFrameReadyEventArgs contains an ImageFrame.Image. I convert that to a BitmapSource and then pass that BitmapSource to an Image control in the WPF application. The frame coming from the Kinect device’s color camera is thus displayed on the application, like what you see in Figure 3.

The DepthFrameReady event, which is being handled by nui_DepthFrameReady, is similar but needs a little more work to get a useful presentation. You can look at this method in the code download, which is the same as last month’s article (msdn.com/magazine/msdnmag0412). I didn’t create this method myself, but found it used in a number of examples online.

The event handler that really starts to get interesting is the nui_SkeletonFrameReady method. This method handles the SkeletonFrameReady event and gets passed in SkeletonFrameReadyEventArgs, as shown in Figure 4.

Figure 4 nui_SkeletonFrameReady

void nui_SkeletonFrameReady(object sender, SkeletonFrameReadyEventArgs e)
{
  renderSkeleton(sender, e);
  if (!trackHands)
    return;// If the user doesn't want to use the buttons, simply return.
  if (e.SkeletonFrame.Skeletons.Count() == 0)
    return;// No skeletons, don't bother processing.
  SkeletonFrame skeletonSet = e.SkeletonFrame;
  SkeletonData firstPerson = (from s in skeletonSet.Skeletons
                              where s.TrackingState ==
                              SkeletonTrackingState.Tracked
                              orderby s.UserIndex descending
                              select s).FirstOrDefault();
  if (firstPerson == null)
    return;// If no one is being tracked, no sense in continuing.
  JointsCollection joints = firstPerson.Joints;
  Joint righthand = joints[JointID.HandRight];
  Joint lefthand = joints[JointID.HandLeft];
  // Use the height of the hand to figure out which is being used.
  Joint joinCursorHand = (righthand.Position.Y > lefthand.Position.Y)
    ? righthand
    : lefthand;
  float posX = joinCursorHand.ScaleTo((int)SystemParameters.PrimaryScreenWidth,
    (int)SystemParameters.PrimaryScreenHeight).Position.X;
  float posY = joinCursorHand.ScaleTo((int)SystemParameters.PrimaryScreenWidth,
    (int)SystemParameters.PrimaryScreenHeight).Position.Y;
  Joint scaledCursorJoint = new Joint
  {
    TrackingState = JointTrackingState.Tracked,
    Position = new Microsoft.Research.Kinect.Nui.Vector
    {
      X = posX,
      Y = posY,
      Z = joinCursorHand.Position.Z
    }
  };
  OnButtonLocationChanged(kinectButton, buttons, 
    (int)scaledCursorJoint.Position.X,
    (int)scaledCursorJoint.Position.Y);
}

One thing I found necessary to put into this application was that first conditional in Figure 4. When the user doesn’t want the application to track her hand movements, there are spoken commands that set the trackHands variable, which in turn determines whether the hands are tracked. If trackHands is set to false, the code simply returns out of this method. If Lily is tracking the user’s hands when that isn’t the desired behavior, it quickly becomes tedious and tiring.

Similarly, if no skeletons are being tracked (either there are no users, or they’re out of the view range of the Kinect device) then there’s no sense in continuing to evaluate the data, so the code returns out of the method. However, if there is a skeleton and the user wants the hands tracked, then the code continues to evaluate. The HoverButton project (bit.ly/nUA2RC) comes with sample code. Most of this method came from those examples. One of the interesting things happening in this method is that the code checks to see which hand on the user is physically higher. It then makes the assumption that the highest hand is the one being used to potentially select a button. The code then goes on to determine whether a button is being hovered over, and renders a “hand” on the screen in the place that’s representative of the screen with respect to the location of the user’s hand. In other words, as the user moves his hand, a graphical hand is moved around the screen in like fashion. This gives the user a natural interface, no longer bound by the cord of the mouse. The user is the controller.

The next item of interest is when the system determines that one of the HoverButtons is clicked. Lily has a total of eight buttons on the screen. Each has an on_click event handler wired in. At this point, I need to cover three special classes: ButtonActionEvaluator, LilyContext and MultiModalReactions.

The action of clicking a button has a corresponding event associated with it, but Lily takes this single action and checks if it can be coupled to a corresponding audio command to evaluate as a multimodal communication that would take on a higher level of meaning. For example, clicking one of the HoverButtons represents the intention of selecting a project. With that information, the only action required by the system is to note that the context, with respect to the project being worked on, has changed. No further action is desired. However, if the user either previously made an unsatisfied request to “open the project plan” or subsequently makes the same request, the application must put these two disparate pieces of data together to create a higher order of meaning (the communication coming from two separate modes makes this multimodal communication) and respond accordingly. To make this all occur in a seamless fashion, the following design was implemented.

The ButtonActionEvaluator class is implemented as a singleton and implements the INotifyPropertyChanged interface. This class also exposes a PropertyChanged event that’s handled by the LilyContext class (also a singleton). The following code probably requires a bit of explaining, even though it looks innocuous enough:

void buttonActionEvaluator_PropertyChanged(
  object sender, System.ComponentModel.PropertyChangedEventArgs e)
{
  if (MultiModalReactions.ActOnMultiModalInput(
    buttonActionEvaluator.EvaluateCriteria()) == 
    PendingActionResult.ClearPendingAction)
    buttonActionEvaluator.ActionPending = 
      PendingAction.NoneOutstanding;
}

Evaluating Lily’s State

First, the preceding code calls the EvaluateCriteria method on the buttonActionEvaluator class. This method simply returns a numerical representation for the state as defined by the ActionPending and SelectedButton properties. This is at the heart of how the application is able to infer meaning through the use of multimodal communication. In traditional applications, the desired action is evaluated by looking at the state of a single event or property (for example, button1.clicked). But with Lily, the state being evaluated (from the multimodal perspective) is the combination of two otherwise separate properties. In other words, each property has significance and requires actions independently, but when evaluated together, they take on a new and higher level of meaning.

That numeric representation of the combined state is then passed into the ActOnMultiModalInput method on the MultiModalReactions class. This method implements a large switch statement that handles all of the permutations possible. (This is a rudimentary implementation that was used to illustrate the point. Future iterations of Lily will replace this implementation with more advanced techniques such as state machines and machine learning to enhance the overall experience and usability.) If this method results in the intention of the user being satisfied (for example, the user intends for the system to open the project plan for Project Lily), the return type is PendingActionResult.ClearPendingAction. This leaves the context of the system still in the frame of reference of Project Lily, but there’s no action waiting to be executed in the queue. If the user’s intention is still unsatisfied, the PendingActionResult.LeavePendingActionInPlace is returned, telling the system that whatever action was taken hasn’t yet satisfied the user’s intention, and to therefore not clear the pending action.

In the first article I showed how to create grammars that are specific to a given domain or context. The Kinect unit, leveraging the Speech Recognition Engine, used these grammars, loading and unloading them to meet the needs of the user. This created an application that doesn’t require the user to stick to a scripted interaction. The user can go in whatever direction she desires and change directions without having to reconfigure the application. This created a natural way of establishing dialogue between the human user and computer application.

Higher Level of Meaning

In this article I demonstrated how to couple actions resulting from context-aware grammars to a user’s physical gesturing in the form of selecting buttons by hovering one’s hand over a button. Each event (speechDetected and buttonClicked) can be handled individually and independently. But in addition, the two events can be correlated by the system, bringing a higher level of meaning to the events and acting accordingly.

I hope you’re as excited about the capabilities that Kinect puts into our hands as I am. I think Microsoft brought us to the edge where human computing interfaces can take leaps forward. As testimony to this, as I developed Lily, there were times when I was testing different components and sections of code. As the application matured and I was able to actually “talk” to Lily, I would find something wrong, switch to my second monitor and start looking up something in the documentation or elsewhere. But I would continue to verbally interact with Lily, asking her to execute tasks or even to shut down. I found that when Lily was unavailable, I became perturbed because the amount of enabling that Lily represented was significant—taking petty tasks off my hands through simple verbal communications.

And incorporating little “tricks” into the dialogue mechanism (for example, random but contextually and syntactically correct responses) made the adoption of the application intuitive and satisfying. Kinect truly makes your body the controller. Where you go with it is limited only by your imagination. What will you Kinect?

Leland Holmquest is an enterprise strategy consultant at Microsoft. Previously he worked for the Naval Surface Warfare Center Dahlgren. He’s working on his Ph.D. in Information Technology at George Mason University.

Thanks to the following technical expert for reviewing this article: Mark Schwesinger