Reconocimiento de voz personalizado

20 minutos

Si queremos hacerlo mejor que el reconocimiento de voz de Windows predeterminado, necesitamos crear código para un sistema de reconocimiento de voz específico de la aplicación diseñado para gestionar la entrada de oraciones completas.

Esto supone un trabajo bastante grande de creación de código, así que puede que merezca la pena mejorar las propiedades AutomationProperties.Name de la aplicación y volver a probarla con el reconocimiento de voz de Windows. Es cierto que ya tenemos un sistema que es accesible para este método de entrada, aunque es algo burdo. Pero para una entrada de voz realmente fluida en un contexto especializado, necesitamos crear este sistema personalizado.

Referencias

Para obtener una lista completa de comandos, consulte Comandos de reconocimiento de voz de Windows.

Obtención de permisos para aceptar la entrada de un micrófono y ejecutar el reconocimiento de voz en esa entrada

Antes de poder comenzar a usar el reconocimiento de voz personalizado, es necesario establecer varios permisos y funcionalidades.

En Visual Studio, con el proyecto de calculadora cargado, abra el archivo Package.appxmainifest y luego seleccione Funcionalidades. Active la funcionalidad Micrófono.

Setting the microphone capability.

Al establecer esta funcionalidad se proporciona acceso a la fuente de audio del micrófono. Guarde y cierre el archivo de manifiesto.
Esto es todo lo que se necesita de la aplicación, pero no es todo lo que hace falta para que funcione el reconocimiento de voz. El usuario debe habilitar el micrófono y el reconocimiento de voz en la aplicación, y esto último está deshabilitado de forma predeterminada. El desarrollador es también el usuario durante las pruebas, así que escriba "configuración de privacidad" en la barra de búsqueda de Windows.

Setting the privacy settings.

Seleccione Voz y asegúrese de que Online speech recognition (Reconocimiento de voz en línea) esté activado. Seleccione Micrófono y asegúrese de que Permitir que las aplicaciones accedan al micrófono esté activado. Cierre o minimice la ventana de configuración.

Intentaremos desactivar esta configuración más adelante, simplemente para probar que hemos gestionado correctamente estas situaciones en nuestra aplicación.

Adición de código para buscar la coincidencia de palabras y frases con elementos de la interfaz de usuario

Para admitir un sistema de reconocimiento de voz personalizado, hace falta mucho código, pero vamos a comenzar con las instrucciones using y variables globales.

Agregue las siguientes instrucciones using a la parte superior del código.

using Windows.Media.SpeechRecognition;
using Windows.Media.Capture;

Agregue las siguientes variables globales y una nueva enumeración.

        enum eElements
        {
            Button,
            ToggleSwitch,
            Unknown
        }

        bool isRecognitionAvailable;
        SpeechRecognizer speechRecognizer;

Para gestionar los problemas de permisos descritos anteriormente, agregue la siguiente clase al código. Una llamada a RequestMicrophonePermission comprobará todos los permisos necesarios. Este es código genérico y se puede usar en cualquier aplicación que desarrolle para Windows 10 con el fin de que se admitan los permisos de reconocimiento de voz a través de un micrófono, aunque no controla la privacidad de Cortana o del dictado.

        public class AudioCapturePermissions
        {
            // If no microphone is present, an exception is thrown with the following HResult value.
            private static readonly int NoCaptureDevicesHResult = -1072845856;

            /// <summary>
            ///  Note that this method only checks the Settings->Privacy->Microphone setting, it does not handle
            /// the Cortana/Dictation privacy check.
            /// </summary>
            /// <returns>True, if the microphone is available.</returns>
            public async static Task<bool> RequestMicrophonePermission()
            {
                try
                {
                    // Request access to the audio capture device.
                    var settings = new MediaCaptureInitializationSettings
                    {
                        StreamingCaptureMode = StreamingCaptureMode.Audio,
                        MediaCategory = MediaCategory.Speech,
                    };
                    var capture = new MediaCapture();

                    await capture.InitializeAsync(settings);
                }
                catch (TypeLoadException)
                {
                    // Thrown when a media player is not available.
                    var messageDialog = new Windows.UI.Popups.MessageDialog("Media player components are unavailable.");
                    await messageDialog.ShowAsync();
                    return false;
                }
                catch (UnauthorizedAccessException)
                {
                    // Thrown when permission to use the audio capture device is denied.
                    var messageDialog = new Windows.UI.Popups.MessageDialog("Permission to use the audio capture device is denied.");
                    await messageDialog.ShowAsync();
                    return false;
                }
                catch (Exception exception)
                {
                    // Thrown when an audio capture device is not present.
                    if (exception.HResult == NoCaptureDevicesHResult)
                    {
                        var messageDialog = new Windows.UI.Popups.MessageDialog("No Audio Capture devices are present on this system.");
                        await messageDialog.ShowAsync();
                        return false;
                    }
                    else
                    {
                        throw;
                    }
                }
                return true;
            }
        }

Es recomendable tener una configuración que active y desactive las características de reconocimiento de voz. Defina otro conmutador para alternar en el archivo MainPage.xaml. Observe que vamos a configurar la tecla de método abreviado de teclado L (para el "cliente de escucha") para activar el botón de alternancia. Agregue este código justo antes de la entrada ListConstants.

        <ToggleSwitch x:Name="ToggleSpeechRecognition"
            Margin="685,409,0,0"
            HorizontalAlignment="Left"
            VerticalAlignment="Top"
            Header="Speech recognition"
            IsOn="False"
            Toggled="ToggleSpeechRecognition_Toggled">
            <ToggleSwitch.KeyboardAccelerators>
                <KeyboardAccelerator Key="L" Modifiers="None" />
            </ToggleSwitch.KeyboardAccelerators>
        </ToggleSwitch>

Ahora, defina el evento ToggleSpeechRecognition_Toggled nombrado en el archivo XAML y algunos métodos auxiliares, de nuevo en el archivo MainPage.xaml.cs.

        private async Task InitSpeechRecognition()
        {
            isRecognitionAvailable = await AudioCapturePermissions.RequestMicrophonePermission();

            if (isRecognitionAvailable)
            {
                // Create an instance of SpeechRecognizer.
                speechRecognizer = new SpeechRecognizer();

                // Compile the dictation grammar by default.
                await speechRecognizer.CompileConstraintsAsync();

                speechRecognizer.UIOptions.ShowConfirmation = true;
            }
            else
            {
                ToggleSpeechRecognition.IsOn = false;
                isRecognitionAvailable = false;
            }
        }

        private async void ToggleSpeechRecognition_Toggled(object sender, RoutedEventArgs e)
        {
            if (ToggleSpeechRecognition.IsOn)
            {
                await InitSpeechRecognition();
                await StartListening();
            }
            else
            {
                isRecognitionAvailable = false;
            }
        }

        private async Task StartListening()
        {
            if (isRecognitionAvailable)
            {
                try
                {
                    // Start recognition.
                    var speechRecognitionResult = await speechRecognizer.RecognizeWithUIAsync();
                    ParseSpokenCalculationAsync(speechRecognitionResult.Text);

                    // Turn off the Toggle each time.
                    ToggleSpeechRecognition.IsOn = false;
                }
                catch (Exception ex)
                {
                    var messageDialog = new Windows.UI.Popups.MessageDialog(ex.Message);
                    await messageDialog.ShowAsync();
                    ToggleSpeechRecognition.IsOn = false;
                    isRecognitionAvailable = false;
                }
            }
        }

Al método ParseSpokenCalculation se le da una cadena reconocida por voz como entrada. Para procesar esta cadena, necesitamos agregar un gran fragmento de código específico de la aplicación.

Este código toma la oración hablada e intenta hacer coincidir las palabras y frases de esa oración con los botones, conmutadores para alternar o constantes de nuestra aplicación. Las palabras que no coinciden se omiten. El código siguiente es un enfoque de fuerza bruta hacia el problema.

Pegue el código siguiente en la aplicación.

        private bool FindConstantFromSpeech(string spokenText, ref string value)
        {
            bool isLocated = false;
            int n = 0;
            string[] nameValue;

            // Remove the word "constant" from the start of the spoken text.
            spokenText = spokenText.Remove(0, spokenText.IndexOf(' ')).Trim();

            while (n < ListConstants.Items.Count && !isLocated)
            {
                nameValue = ListConstants.Items[n].ToString().Split('=');

                if (spokenText == nameValue[0].Trim().ToLower())
                {
                    value = nameValue[1].Trim();
                    isLocated = true;
                }
                else
                {
                    ++n;
                }
            }
            return isLocated;
        }

        private async void SayCurrentCalculationAsync()
        {
            if (TextDisplay.Text.Length == 0)
            {
                await SayAsync("The current calculation is empty.");
            }
            else
            {
                await SayAsync($"The current calculation is: {TextDisplay.Text}.");
            }
        }

        private async void ParseSpokenCalculationAsync(string spokenText)
        {
            spokenText = spokenText.ToLower().Trim();
            if (spokenText.Length == 0)
            {
                return;
            }

            // First check for specific control phrases.
            if (spokenText == "say memory")
            {
                await SayAsync($"The current memory is: {TextMemory.Text}.");
            }
            else
                if (spokenText == "say calculation")
            {
                SayCurrentCalculationAsync();
            }
            else
                 if (spokenText.StartsWith("const"))
            {
                string value = "";
                if (FindConstantFromSpeech(spokenText, ref value))
                {
                    MathEntry(value, "Number");
                    SayCurrentCalculationAsync();
                }
                else
                {
                    await SayAsync("Sorry, I did not recognize that constant.");
                }
            }
            else
            {
                // Ensure + is a word in its own right.
                // Sometimes the speech recognizer will enter "+N" and we need "+ N".
                spokenText = spokenText.Replace("+", "+ ");
                spokenText = spokenText.Replace("  ", " ");

                double d;
                string[] words = spokenText.Split(' ');
                int w = 0;
                ToggleSwitch ts;
                object obj;
                var eType = eElements.Unknown;

                while (w < words.Length)
                {
                    try
                    {
                        // Is the word a number?
                        d = double.Parse(words[w]);
                        MathEntry(d.ToString(), "Number");
                    }
                    catch
                    {
                        try
                        {
                            // Is the word a ratio?
                            string[] ratio = words[w].Split('/');
                            d = double.Parse(ratio[0]) / double.Parse(ratio[1]);
                            MathEntry(d.ToString(), "Number");
                        }
                        catch
                        {
                            // Check if a word or phrase refers to a button, test phrases up to 4 words long.
                            // There are only buttons in gridButtons, so no need to test for anything else.
                            obj = FindElementFromString(GridButtons.Children, words, w, 4, ref w, ref eType);
                            if (obj != null)
                            {
                                Button_Click(obj, null);
                            }
                            else
                            {
                                // Controls can be up to three words in our app.
                                obj = FindElementFromString(GridCalculator.Children, words, w, 3, ref w, ref eType);
                                if (obj != null)
                                {
                                    switch (eType)
                                    {
                                        case eElements.Button:
                                            Button_Click(obj, null);
                                            break;

                                        case eElements.ToggleSwitch:
                                            ts = (ToggleSwitch)obj;
                                            ts.IsOn = !ts.IsOn;
                                            break;

                                        default:
                                            break;
                                    }
                                }
                            }
                        }
                    }
                    ++w;
                }
                if (mode != Emode.CalculateDone)
                {
                    SayCurrentCalculationAsync();
                }
            }
        }

        private bool IsMatchingElementText(eElements elementType, object obj, string textToMatch)
        {
            string name = "";
            string accessibleName = "";

            switch (elementType)
            {
                case eElements.Button:
                    var b = (Button)obj;
                    name = b.Content.ToString().ToLower();
                    accessibleName = b.GetValue(AutomationProperties.NameProperty).ToString().ToLower();
                    break;

                case eElements.ToggleSwitch:
                    var ts = (ToggleSwitch)obj;
                    name = ts.Header.ToString().ToLower();
                    accessibleName = ts.GetValue(AutomationProperties.NameProperty).ToString().ToLower();
                    break;
            }

            // Return true if the name or accessibleName matches the spoken text.
            if ((textToMatch == name && name.Length > 0) || (textToMatch == accessibleName && accessibleName.Length > 0))
            {
                return true;
            }

            return false;
        }

        private object FindElementFromString(UIElementCollection elements, string[] words, int startIndex, int maxConcatenatedWords, ref int updatedIndex, ref eElements elementType)
        {
            // Return true if the spoken text matches the text for a button.
            int n;
            Button b;
            ToggleSwitch ts;

            // Longer phrazes take precendence over shorter ones, so start with the longest allowed and work down.
            for (int c = maxConcatenatedWords; c > 0; c--)
            {
                if (startIndex + c - 1 < words.Length)
                {
                    // Build the phraze from the following words.
                    string txt = words[startIndex];
                    for (n = 1; n < c; n++)
                    {
                        txt += " " + words[startIndex + n];
                    }

                    // Test the word or phrase against the content/tag/name of each button.
                    for (int i = 0; i < elements.Count; i++)
                    {
                        // Is the UI element a button?
                        try
                        {
                            b = (Button)elements[i];
                            if (IsMatchingElementText(eElements.Button, (object)b, txt))
                            {
                                updatedIndex = startIndex + c - 1;
                                elementType = eElements.Button;
                                return (object)b;
                            }
                        }
                        catch
                        {
                            // UI element is not a button, is it a ToggleSwitch?
                            try
                            {
                                ts = (ToggleSwitch)elements[i];
                                if (IsMatchingElementText(eElements.ToggleSwitch, (object)ts, txt))
                                {
                                    updatedIndex = startIndex + c - 1;
                                    elementType = eElements.ToggleSwitch;
                                    return (object)ts;
                                }
                            }
                            catch
                            {
                                // Ignore the UI element.
                            }
                        }
                    }
                }
            }
            updatedIndex = startIndex;
            return null;
        }

Nota:

Para especificar una constante, diga "constante" y, luego, diga el nombre completo de la constante. "Decir memoria" anunciará el contenido de la memoria. "Decir cálculo" anunciará el contenido del cálculo actual.

Compile y ejecute la aplicación y active el conmutador para alternar de reconocimiento de voz.
Con el micrófono listo, diga "¿cuánto es 1,23456 veces 2789?". Lo que diga debe aparecer en un cuadro de diálogo llamado Escuchando, que se cerrará rápidamente cuando deje de hablar. El cálculo aparecerá entonces en la pantalla.

Speaking a natural addition.

Nota:

Si el cuadro de diálogo Escuchando muestra Sorry, didn't catch that (Lo siento, no lo entendí), presione la barra espaciadora para que aparezca de nuevo el cliente de escucha.

Pruebe a introducir una serie de cálculos sencillos con la voz.

Nota:

Solo tiene que presionar la tecla L y decir "borrar" para borrar un cálculo "rebelde".

Los cálculos se pueden construir por partes, dado que basta con una ligera pausa para que el cliente de escucha se cierre y la entrada se analice. Por ejemplo, diga "¿cuál es el seno de 30 veces?". Luego, seleccione la tecla L y, cuando aparezca el cliente de escucha, diga "el coseno de 30". A continuación, seleccione L nuevamente y diga "es igual a". Debe obtener el resultado.
Tenga en cuenta que las palabras irrelevantes como "¿cuál es?", "de" y "el" puede incluirse en la oración, pero se omiten correctamente.
Intente construir ecuaciones (no hay problema si no tienen sentido matemático) que prueben todos los botones y conmutadores para alternar, incluidos los botones Clr, Del, botones de almacenamiento de memoria, constantes y todos los demás. De esta forma, se tiene la seguridad de que se gestionan correctamente con el código.

Continuar

Reconocimiento de voz personalizado

Referencias

Obtención de permisos para aceptar la entrada de un micrófono y ejecutar el reconocimiento de voz en esa entrada

Adición de código para buscar la coincidencia de palabras y frases con elementos de la interfaz de usuario

Comentarios