Azure Cognitive Services: Bing Speech API and Language Understanding Intelligent Service (LUIS)
Setting the scene...
You are in command of the most advanced starship: the SS TechNetWiki. This is a critical mission for the space federation but like IT projects, the coffers have run dry and the budget has been limited. Fortunately, the starship is outfitted with the latest technology but there was only enough cash left to higher you as the captain. Fortunately, there is Cognitive Services...
Cognitive Services
Cognitive Services is a collection of Azure hosted intelligent algorithms as a service. There are five main categories including Vision, Speech, Language, Knowledge, and Search. For this wiki, the Bing Speech API in Speech and the Language Understanding in Language will be used to translate spoken commands so they can be used in an application.
Bing Speech API
Bing Speech API can be used to translate from human speech to text as well as from text to audio streams. There are two ways of to using the API: Rest API endpoints and client libraries.
Most of the features are available for both ways but the client libraries have three distinct advantages:
- Converting longer audio (>15 seconds)
- The ability to use an event to get interim results while a longer audio is being recognized
- Understand the recognized text using LUIS (this will be shown in the scenario below)
Language Understanding Intelligent Service (LUIS)
Language Understanding Intelligent Service (LUIS) uses machine learning to extract the meaning of natural language. The key to LUIS is a domain model that can range from a pre-built to a custom domain. There are three core concepts:
- Intents - This represents actions the user wants to perform. For example, fire photon torpedoes or engines on.
- Utterances - Text input from the user like "engines on maximum" or "all ahead full"
- Entities - An entity is an information that is relevant to the utterance like "engine"
A LUIS application can be built using the Authoring APIS or a website. There are resources on how to build a LUIS app so this wiki will not repeat these steps. Please see Microsoft Bot Framework Basics: Building Intelligent Bots - Adding Language Understanding Capability (Part 2) by Chervine for more information.
Back to the scenario
A LUIS app has been created to help the captain control the SS TechNetWiki associated with US English:
Three intents have been defined to fire the photon torpedoes, turn the engines on and turn the engines off:
If we inspect the Torpedoes intent, we can see the different utterances used to train the model:
When testing the intent, we will try several phrases:
- "fire now" - Hopefully this will not match the intent as the entity is not clear. Is this for the lasers, shuttle pods or torpedoes?
- "fire torpedoes now" - This should be a good match as it is similar to our utterances.
- "shoot torpedoes" - This should also be a good match as shoot and fire are synonyms.
- "eat chocolate" - Hmmm. Shouldn't be a match.
In the LUIS website there is a handy Test panel that allows for quick testing of the utterances:
Sample Project
The sample project can be used as a starting point to explore the Bing Speech API and LUIS. It consists of an example of using the cognitive services client library and calling the services using the web API.
Client Libraries
Both Bing Speech API and LUIS off client libraries:
Bing Speech API: C# desktop library, C# service library, JavaScript, Java library for Android, Objective-C library for iOS
LUIS: C#, Java, JavaScript, Node.js, Python, PHP, Ruby
The following shows a call to Bing Speech API:
There are a couple of things to note. An implementation of IAuthorizationProvider needs to be provided (see github or sample project for an example). Also, the client subscribes to events which will contain the result of the processing. The result is either a PartialRecognitionResult or a RecognitionResult containing a status and a collection of RecognitionPhrases. The following illustrates how the result can be parsed and displayed in the console:
REST APIs
The Bing Speech API is a request/response API where the audio file is uploaded and the result returned in JSON format. Please see the project for details. The response is worth inspecting as it contains a status as well as the offset from the start as to where the phrase was recognized and the top best matches:
The result of the Bing Speech API can be submitted directly to LUIS and this is illustrated below:
The API is detailed in the documentation but note the ability to specify the staging environment and the ability to perform spell checking as part of the service call.
Summary
So how did LUIS do with the phrase "fire torpedoes now" spoken in an audio file? It was 99.88% certain that this was the command to launch the photon torpedoes. Probably need to do a bit more testing before launching but looks promising so far.
Additional information:
- Speech Test MSDN Project containing the source code
- Cognitive Services
- Speech Basic Concepts
- LUIS Endpoint API
- LUIS Documentation
- Speech Sevice Library project