Use UDP channel as input for ConversationTranscriber

Question

Use UDP channel as input for ConversationTranscriber

NB 0

I have an audio stream accessible through UDP. I want to transcribe this audio using the ConversationTranscriber service from Azure Speech in real time (or as close to real time as possible).

I have implemented a custom PullAudioInputStreamCallback to read from one of the channels (it's a 2 channel stream) and return the data to be processed. I provide the code bellow.

When testing the code using the mic as input source, the system works well. Nevertheless, with the PullAudioStream as input, the system starts running but after some seconds the script stops without throwing any error or raising any exception. Is there a more appropriate way to connect the UDP source to the ConversationTranscriber service?

import socket
import time
import numpy as np
import azure.cognitiveservices.speech as speechsdk

key_ = <MY_KEY>
region_ = <MY_REGION>

class MyAudioStream(speechsdk.audio.PullAudioInputStreamCallback):
	def __init__(self, socket, channel):
		super().__init__()
		print("Init")
		self.socket = socket
		self.channel = channel  # Channel 0 for left, 1 for right
		self.buffer = bytearray()

	def read(self, buffer_size):
		# Fetch new data from socket if buffer is insufficient
		while len(self.buffer) < buffer_size * 2:  # x2 because stereo data
			data, _ = self.socket.recvfrom(buffer_size * 2)
			self.buffer.extend(data)

		# Extract the requested buffer size * 2 (to handle stereo data)
		data = self.buffer[:buffer_size * 2]
		del self.buffer[:buffer_size * 2]

		# Assuming 16-bit samples, split the stereo into mono
		audio_data = np.frombuffer(data, dtype=np.int16)  # Read data as 16-bit samples
		mono_data = audio_data[self.channel::2]  # Extract one channel
		# print(audio_data)
		return mono_data.tobytes()

	def close(self):
		print("Closed")
		self.socket.close()
        
def conversation_transcriber_recognition_canceled_cb(evt: speechsdk.SessionEventArgs):
    print('Canceled event')

def conversation_transcriber_session_stopped_cb(evt: speechsdk.SessionEventArgs):
    print('SessionStopped event')

def conversation_transcriber_transcribed_cb(evt: speechsdk.SpeechRecognitionEventArgs, file_handle):
	if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
		s_id, text = evt.result.speaker_id, evt.result.text
		file_handle.write('Speaker ID={}: {}\n'.format(s_id, text))
		print('Speaker ID={}: {}\n'.format(s_id, text))
	elif evt.result.reason == speechsdk.ResultReason.NoMatch:
		file_handle.write('NOMATCH: Speech could not be TRANSCRIBED: {}\n'.format(evt.result.no_match_details))


def conversation_transcriber_session_started_cb(evt: speechsdk.SessionEventArgs):
    print('SessionStarted event')


def create_transcriber(sock, sample_rate, channel):
    
	speech_config = speechsdk.SpeechConfig(subscription=key_, region=region_)
	speech_config.speech_recognition_language="en-US"
	# Set the expected audio format (monaural)
	audio_format = speechsdk.audio.AudioStreamFormat(samples_per_second=sample_rate, bits_per_sample=16, channels=1)
	# Create an instance of PullAudioInputStream with the custom stream callback
	stream_callback = MyAudioStream(sock, channel)
	audio_input = speechsdk.audio.PullAudioInputStream(stream_callback, audio_format)
	audio_config = speechsdk.audio.AudioConfig(stream=audio_input)
	conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)

	return conversation_transcriber	

def main():
	# Setup the UDP socket common for both channels
	UDP_IP = "localhost"
	UDP_PORT = <MY_PORT>
	sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
	sock.bind((UDP_IP, UDP_PORT))
      
	transcribing_stop = False
	conversation_transcriber = create_transcriber(sock, 16000, 0)

	def stop_cb(evt: speechsdk.SessionEventArgs):
		#"""callback that signals to stop continuous recognition upon receiving an event `evt`"""
		print('CLOSING on {}'.format(evt))
		nonlocal transcribing_stop
		transcribing_stop = True

	output_file_path = "test.txt"
	with open(output_file_path, 'w') as file_handle:
		conversation_transcriber.transcribed.connect(lambda evt: conversation_transcriber_transcribed_cb(evt, file_handle))
		conversation_transcriber.session_started.connect(conversation_transcriber_session_started_cb)
		conversation_transcriber.session_stopped.connect(conversation_transcriber_session_stopped_cb)
		conversation_transcriber.canceled.connect(conversation_transcriber_recognition_canceled_cb)
		conversation_transcriber.session_stopped.connect(stop_cb)
		conversation_transcriber.canceled.connect(stop_cb)

		conversation_transcriber.start_transcribing_async()

		try:
			while not transcribing_stop:
				time.sleep(.5)
		except Exception as e:
			print(e)
			conversation_transcriber.stop_transcribing_async()
			conversation_transcriber.close()
			sock.close()

if __name__=="__main__":
      main()

navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2024-05-15T04:37:23.2333333+00:00
@NB Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

The approach you’ve taken to connect the UDP source to the ConversationTranscriber service is generally correct. You’ve created a custom PullAudioInputStreamCallback to read from the UDP channel and feed it into the ConversationTranscriber. This is the recommended way to handle custom audio input sources in the Azure Speech SDK.

.

Sharing some of the recommendations here, which you can try:

Exception Logging: Since there is no exception raised, I would recommend you enable and check the Speech SDK logging as explained here.

Updated SDK: Ensure that you’re using the latest version of the Azure Cognitive Services Speech SDK, as there might have been bug fixes or improvements that could resolve your issue.

Buffer Size: Ensure that the buffer size you’re using in the read method of MyAudioStream matches the buffer size expected by the PullAudioInputStream. If there’s a mismatch, it could potentially cause issues.

Error Handling in read Method: Currently, your read method assumes that the socket will always return the required amount of data. Network issues or other factors could cause less data to be returned, which might cause issues. Consider adding error handling to manage these scenarios.

Thread Safety: If your read method is being called from multiple threads, you need to ensure that the operations on the shared buffer are thread-safe.

Socket Timeout: By default, the socket.recvfrom method blocks until it receives data. If the data source stops sending data, this could cause your script to hang. Consider setting a timeout on the socket to prevent this.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

NB 0

Hi @navba-MSFT ! Thanks a lot for the answer.

I set up the logging system to try to understand better what happens.
My current concern is that the transcription connection seems to shut down before any data is even received.

When I debug the code, I can see it never enters the read method of my custom class MyAudioStream . The sequence goes as it starts the session (it prints "SessionStarted event" from conversation_transcriber_session_started_cb function), then it goes in the while loop but shortly after the script stops while the state of transcribing_stop variable never changed. (Not sure the explanation is very clear... sorry about that).

I did some test using the mic input directly vs the UDP custom class and I can see some differences in the logs of the system.

With the mic as direct input source:

[270095]: 239ms SPX_DBG_TRACE_FUNCTION:  audio_stream_session.cpp:721 CSpxAudioStreamSession::AddRecognizer
[270095]: 239ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:111 CSpxHandleTable::TrackHandle p=0x00000196454C3E20
[270095]: 240ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:121 CSpxHandleTable::TrackHandle class=ISpxRecognizer, h=0x00000196454C3E20, p=0x00000196454C3E20, tot=1
[270095]: 240ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxPropertyBagImpl::SetStringValue: this=0x00000196454ECF78; name='IsConversationTranscriber_V2'; value='true'
[270095]: 240ms SPX_DBG_TRACE_SCOPE_EXIT:  speechapi_c_factory.cpp:659 recognizer_create_conversation_transcriber_from_config
[270095]: 241ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:111 CSpxHandleTable::TrackHandle p=0x00000196454C4368
[270095]: 241ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:121 CSpxHandleTable::TrackHandle class=ISpxNamedProperties, h=0x00000196454C4368, p=0x00000196454C4368, tot=3
[270095]: 242ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x00000196454C4368; name='SPEECH-RecoMode'; value=''
[270095]: 242ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxPropertyBagImpl::SetStringValue: this=0x00000196454ECF78; name='SPEECH-RecoMode'; value='CONVERSATION'
[270095]: 242ms SPX_DBG_TRACE_FUNCTION:  audio_stream_session.cpp:1116 CSpxAudioStreamSession::StartRecognitionAsync
[270095]: 242ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:111 CSpxHandleTable::TrackHandle p=0x00000196454F06F0
[37310]: 242ms SPX_DBG_TRACE_SCOPE_ENTER:  audio_stream_session.cpp:1119 *** CSpxAudioStreamSession::StartRecognitionAsync kicked-off THREAD started ***
[270095]: 243ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:121 CSpxHandleTable::TrackHandle class=CSpxAsyncOp<void>, h=0x00000196454F06F0, p=0x00000196454F06F0, tot=1
[37310]: 243ms SPX_TRACE_INFO:  thread_service.cpp:96 Started thread Background with ID [727428ll]
[727428]: 243ms SPX_DBG_TRACE_SCOPE_ENTER:  audio_stream_session.cpp:1176 CSpxAudioStreamSession::StartRecognizing
[727428]: 243ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:1177 [00000196454ECE90]CSpxAudioStreamSession::StartRecognizing
[727428]: 243ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:3584 [00000196454ECE90]CSpxAudioStreamSession::TryChangeState: recoKind/sessionState: 0/0 => 4/1
[727428]: 243ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:1219 [00000196454ECE90]CSpxAudioStreamSession::StartRecognizing:  Now WaitForPumpSetFormatStart ...
[727428]: 243ms SPX_DBG_TRACE_SCOPE_ENTER:  audio_stream_session.cpp:3163 CSpxAudioStreamSession::StartAudioPump
[727428]: 243ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:3164 [00000196454ECE90]CSpxAudioStreamSession::StartAudioPump:  RecognitionKind 4 | m_audioPump [00000196454A17B0]
[727428]: 243ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:1446 [00000196454ECE90]CSpxAudioStreamSession::FireSessionStartedEvent: ...
[727428]: 243ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:1457 [00000196454ECE90]CSpxAudioStreamSession::FireSessionStartedEvent: Firing SessionStarted event: SessionId: b90b4b4473094963b8057b738a7d33f2
[727428]: 244ms SPX_TRACE_INFO:  thread_service.cpp:96 Started thread User with ID [371217ll]
[727428]: 244ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x00000196454ECF78; name='IsConversationTranscriber_V2'; value='true'
[727428]: 244ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x00000196454ECF78; name='SPEECH-MaxBufferSizeMs'; value='66000'
[727428]: 244ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x00000196454ECF78; name='SPEECH-BufferSizePercentSwitchToLowRate'; value='50'
[727428]: 244ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x00000196454ECF78; name='SPEECH-BufferSizePercentSwitchToHighRate'; value='9'
[727428]: 244ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session_throttle_logic.cpp:104 [00000196454BE480] Is VAD Gating = 0, Is RNNT reco engine = 0, Is conversation transcriber = 0, Is meeting transcriber = 0, Is async transcriber  = 0, fastLane = 5000 msec (160000 bytes), maxDuration = 66000 msec, highDuration = 33000 msec, lowDuration = 5940 msec, lowRate = 100%, highRate = 200%
[371217]: 244ms SPX_DBG_TRACE_SCOPE_ENTER:  audio_stream_session.cpp:1724 DispatchEvent task started...
[727428]: 244ms SPX_DBG_TRACE_FUNCTION:  audio_buffer.cpp:92 PcmAudioBuffer::NewTurn

With the UDP custom class as input source:

[534528]: 70ms SPX_DBG_TRACE_FUNCTION:  audio_stream_session.cpp:721 CSpxAudioStreamSession::AddRecognizer
[534528]: 70ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:111 CSpxHandleTable::TrackHandle p=0x00000190CD82A410
[534528]: 70ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:121 CSpxHandleTable::TrackHandle class=ISpxRecognizer, h=0x00000190CD82A410, p=0x00000190CD82A410, tot=1
[534528]: 70ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxPropertyBagImpl::SetStringValue: this=0x00000190CD7E50C8; name='IsConversationTranscriber_V2'; value='true'
[534528]: 70ms SPX_DBG_TRACE_SCOPE_EXIT:  speechapi_c_factory.cpp:659 recognizer_create_conversation_transcriber_from_config
[534528]: 71ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:111 CSpxHandleTable::TrackHandle p=0x00000190CD82A958
[534528]: 71ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:121 CSpxHandleTable::TrackHandle class=ISpxNamedProperties, h=0x00000190CD82A958, p=0x00000190CD82A958, tot=3
[534528]: 71ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:180 CSpxHandleTable::StopTracking(h) h=0x00000190CD013498
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:195 CSpxHandleTable::StopTracking(h) class=ISpxSpeechConfig, h=0x00000190CD013498, p=0x00000190CD013498, tot=0
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:180 CSpxHandleTable::StopTracking(h) h=0x00000190CD013428
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:195 CSpxHandleTable::StopTracking(h) class=ISpxNamedProperties, h=0x00000190CD013428, p=0x00000190CD013428, tot=2
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:180 CSpxHandleTable::StopTracking(h) h=0x00000190CD68D730
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:195 CSpxHandleTable::StopTracking(h) class=SPXWAVEFORMATEX, h=0x00000190CD68D730, p=0x00000190CD68D730, tot=0
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:180 CSpxHandleTable::StopTracking(h) h=0x00000190CD81B540
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:195 CSpxHandleTable::StopTracking(h) class=ISpxAudioStream, h=0x00000190CD81B540, p=0x00000190CD81B540, tot=0
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:180 CSpxHandleTable::StopTracking(h) h=0x00000190CD7F8AF8
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:195 CSpxHandleTable::StopTracking(h) class=ISpxAudioConfig, h=0x00000190CD7F8AF8, p=0x00000190CD7F8AF8, tot=1
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:180 CSpxHandleTable::StopTracking(h) h=0x00000190CD7F8A88
[534528]: 72ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:195 CSpxHandleTable::StopTracking(h) class=ISpxNamedProperties, h=0x00000190CD7F8A88, p=0x00000190CD7F8A88, tot=1
[534528]: 73ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x00000190CD82A958; name='SPEECH-RecoMode'; value=''
[534528]: 73ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxPropertyBagImpl::SetStringValue: this=0x00000190CD7E50C8; name='SPEECH-RecoMode'; value='CONVERSATION'
[534528]: 73ms SPX_DBG_TRACE_FUNCTION:  audio_stream_session.cpp:1116 CSpxAudioStreamSession::StartRecognitionAsync
[534528]: 73ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:111 CSpxHandleTable::TrackHandle p=0x00000190CF4D5010
[534528]: 73ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:121 CSpxHandleTable::TrackHandle class=CSpxAsyncOp<void>, h=0x00000190CF4D5010, p=0x00000190CF4D5010, tot=1
[545392]: 73ms SPX_DBG_TRACE_SCOPE_ENTER:  audio_stream_session.cpp:1119 *** CSpxAudioStreamSession::StartRecognitionAsync kicked-off THREAD started ***
[545392]: 74ms SPX_TRACE_INFO:  thread_service.cpp:96 Started thread Background with ID [5370ll]
[534528]: 74ms SPX_DBG_TRACE_FUNCTION:  audio_stream_session.cpp:752 CSpxAudioStreamSession::SetDisposing
[534528]: 74ms SPX_DBG_TRACE_FUNCTION:  audio_stream_session.cpp:1138 CSpxAudioStreamSession::StopRecognitionAsync
[534528]: 74ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:111 CSpxHandleTable::TrackHandle p=0x00000190CF4D5550
[534528]: 74ms SPX_DBG_TRACE_VERBOSE:  handle_table.h:121 CSpxHandleTable::TrackHandle class=CSpxAsyncOp<void>, h=0x00000190CF4D5550, p=0x00000190CF4D5550, tot=2
[5370]: 75ms SPX_DBG_TRACE_SCOPE_ENTER:  audio_stream_session.cpp:1176 CSpxAudioStreamSession::StartRecognizing
[5370]: 75ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:1177 [00000190CD7E4FE0]CSpxAudioStreamSession::StartRecognizing
[5370]: 75ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:3584 [00000190CD7E4FE0]CSpxAudioStreamSession::TryChangeState: recoKind/sessionState: 0/0 => 4/1
[5370]: 75ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:1219 [00000190CD7E4FE0]CSpxAudioStreamSession::StartRecognizing:  Now WaitForPumpSetFormatStart ...
[5370]: 75ms SPX_DBG_TRACE_SCOPE_ENTER:  audio_stream_session.cpp:3163 CSpxAudioStreamSession::StartAudioPump
[5370]: 75ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:3164 [00000190CD7E4FE0]CSpxAudioStreamSession::StartAudioPump:  RecognitionKind 4 | m_audioPump [00000190CD81A8A0]
[5370]: 75ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:1446 [00000190CD7E4FE0]CSpxAudioStreamSession::FireSessionStartedEvent: ...
[5370]: 75ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:1457 [00000190CD7E4FE0]CSpxAudioStreamSession::FireSessionStartedEvent: Firing SessionStarted event: SessionId: e4e22be08156458db6b302cd1e2fa3e7
[5370]: 76ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session.cpp:1704 [00000190CD7E4FE0]CSpxAudioStreamSession::FireEvent, recognizer is disposing, ignore events
[5370]: 76ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x00000190CD7E50C8; name='IsConversationTranscriber_V2'; value='true'
[5370]: 76ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x00000190CD7E50C8; name='SPEECH-MaxBufferSizeMs'; value='66000'
[5370]: 76ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x00000190CD7E50C8; name='SPEECH-BufferSizePercentSwitchToLowRate'; value='50'
[5370]: 76ms SPX_DBG_TRACE_VERBOSE:  named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x00000190CD7E50C8; name='SPEECH-BufferSizePercentSwitchToHighRate'; value='9'
[5370]: 76ms SPX_DBG_TRACE_VERBOSE:  audio_stream_session_throttle_logic.cpp:104 [00000190CF4D65A0] Is VAD Gating = 0, Is RNNT reco engine = 0, Is conversation transcriber = 0, Is meeting transcriber = 0, Is async transcriber  = 0, fastLane = 5000 msec (10240 bytes), maxDuration = 66000 msec, highDuration = 33000 msec, lowDuration = 5940 msec, lowRate = 100%, highRate = 200%
[5370]: 76ms SPX_DBG_TRACE_FUNCTION:  audio_buffer.cpp:92 PcmAudioBuffer::NewTurn

From the logs, I can see that in the second case, there is an "StopAsyncRecognition" event just after it starts it. This doesn't happen on the mic case although the code is the same, just changing the audio config to alter the source.

Is there anything else I should add / modify to be able to use the streaming channel as source?
Just for completion, I'm using v1.37.0 of azure-cognitiveservices-speech SDK.

Thanks for the help :)

navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2024-05-15T09:47:30.8633333+00:00
@NB Thanks for your reply. The StopRecognitionAsync method is being called immediately after StartRecognitionAsync. This could be due to a few reasons:

Data Availability: The PullAudioInputStreamCallback might be trying to read data before it’s available. This could cause the stream to close prematurely. You could add a check in your read method to ensure data is available before trying to read it.

Socket Closure: If the socket is closed or loses connection, it could cause the stream to stop. Ensure that the socket remains open and connected for the duration of the audio streaming.

Thread Safety: If your read method is being called from multiple threads, there could be race conditions affecting the shared buffer variable. Consider adding thread synchronization mechanisms (like locks) to ensure thread safety.

Hope this helps.

Your answer

navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2024-05-15T04:37:23.2333333+00:00

@NB Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

The approach you’ve taken to connect the UDP source to the ConversationTranscriber service is generally correct. You’ve created a custom PullAudioInputStreamCallback to read from the UDP channel and feed it into the ConversationTranscriber. This is the recommended way to handle custom audio input sources in the Azure Speech SDK.

.

Sharing some of the recommendations here, which you can try:

Exception Logging: Since there is no exception raised, I would recommend you enable and check the Speech SDK logging as explained here.

Updated SDK: Ensure that you’re using the latest version of the Azure Cognitive Services Speech SDK, as there might have been bug fixes or improvements that could resolve your issue.

Buffer Size: Ensure that the buffer size you’re using in the read method of MyAudioStream matches the buffer size expected by the PullAudioInputStream. If there’s a mismatch, it could potentially cause issues.

Error Handling in read Method: Currently, your read method assumes that the socket will always return the required amount of data. Network issues or other factors could cause less data to be returned, which might cause issues. Consider adding error handling to manage these scenarios.

Thread Safety: If your read method is being called from multiple threads, you need to ensure that the operations on the shared buffer are thread-safe.

Socket Timeout: By default, the socket.recvfrom method blocks until it receives data. If the data source stops sending data, this could cause your script to hang. Consider setting a timeout on the socket to prevent this.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2024-05-15T09:47:30.8633333+00:00

@NB Thanks for your reply. The StopRecognitionAsync method is being called immediately after StartRecognitionAsync. This could be due to a few reasons:

Data Availability: The PullAudioInputStreamCallback might be trying to read data before it’s available. This could cause the stream to close prematurely. You could add a check in your read method to ensure data is available before trying to read it.

Socket Closure: If the socket is closed or loses connection, it could cause the stream to stop. Ensure that the socket remains open and connected for the duration of the audio streaming.

Thread Safety: If your read method is being called from multiple threads, there could be race conditions affecting the shared buffer variable. Consider adding thread synchronization mechanisms (like locks) to ensure thread safety.

Hope this helps.

Share via

Use UDP channel as input for ConversationTranscriber

Your answer