Hi everyone,
I have been trying to setup an API using Python to make real-time speech recognition.
Context
I am using azure.cognitiveservices.speech
library. I followed the tutorial here, using continuous recognition.
Basically, my API receives an audio stream with WebSocket protocol. The received buffer is pushed into a stream (azure.cognitiveservices.speech.audio.PushAudioInputStream
). This stream is used in the AudioConfig.
Problem
The code seems to be working. However, the obtained transcript is not accurate and the result is pretty bad. I don't know why. The received stream is in PCM format, 32 bits, 1 channel, 48 Hz. I checked the integrity of the data by saving an audio file: nothing to report, the audio is perfectly audible and of good quality.
Code
This is the main code that I am using. The Transcription class is instanciated when the WebSocket connection opens, then the audio data is pushed thanks to the push_data() method, in a loop.
import azure.cognitiveservices.speech as speechsdk
speech_key = os.environ.get('SPEECH_KEY')
speech_region = os.environ.get('SPEECH_REGION')
language = "fr-FR"
class Transcription:
def __init__(self):
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region)
speech_config.speech_recognition_language=language
# setup the audio stream
audio_format = speechsdk.audio.AudioStreamFormat(samples_per_second=48000, bits_per_sample=32, channels=1)
self.stream = speechsdk.audio.PushAudioInputStream(audio_format)
audio_config = speechsdk.audio.AudioConfig(stream=self.stream)
# instantiate the speech recognizer with push stream input
self.speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
# Connect callbacks to the events fired by the speech recognizer
self.speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING...: {}'.format(evt)))
self.speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
self.speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
self.speech_recognizer.session_stopped.connect(self.stop)
self.speech_recognizer.canceled.connect(self.stop)
def start(self):
self.speech_recognizer.start_continuous_recognition()
def stop(self, evt="forced stop"):
print('CLOSING on {}'.format(evt))
self.stream.close()
self.speech_recognizer.stop_continuous_recognition()
def push_data(self, frames):
self.stream.write(frames)
Call-to-help
Do you have any idea of where this problem comes from? I would gladly take any suggestion to help me through this issue.
Thank you in advance.