How to create an AudioConfig object given a public link to an audio file

Question

How to create an AudioConfig object given a public link to an audio file

Nikhil Kapse 0

I'm currently using Azure's Speech Pronunciation Assessment service, and I'm receiving my input audio via a public url to an audio file (which is not in azure blob storage). Is there any way for me to create an AudioConfig without having to use local or Azure Blob storage? I'm aware I can't pass the url in directly to the AudioConfig constructor, but are there other ways to accomplish something like this?

2 answers

Your answer

Answer 1

Hi @Nikhil Kapse ,

Thank you for using the Microsoft Q&A.

To create an AudioConfig object for Azure's Speech Pronunciation Assessment service given a public link to an audio file, you can follow these steps.

Import the necessary libraries and configure your subscription key and region.

import requests
import base64
import json
import time
import azure.cognitiveservices.speech as speechsdk
subscriptionKey = "YOUR_SUBSCRIPTION_KEY"
region = "YOUR_REGION"
public_audio_url = "https://www.example.com/public_audio.wav"

Initialize uploadFinishTime and define the wave header.

uploadFinishTime = 0

# Common wave header, with zero audio length
WaveHeader16K16BitMono = bytes([82, 73, 70, 70, 78, 128, 0, 0, 87, 65, 86, 69, 102, 109, 116, 32, 18, 0, 0, 0, 1, 0, 1, 0, 128, 62, 0, 0, 0, 125, 0, 0, 2, 0, 16, 0, 0, 0, 100, 97, 116, 97, 0, 0, 0, 0])

Create a generator function to read audio data chunk by chunk from the URL.

def get_chunk_from_url(audio_url, chunk_size=1024):
    global uploadFinishTime  # Define uploadFinishTime as a global variable
    yield WaveHeader16K16BitMono
    with requests.get(audio_url, stream=True) as response:
        for chunk in response.iter_content(chunk_size=chunk_size):
            if not chunk:
                uploadFinishTime = time.time()
                break
            yield chunk

Build pronunciation assessment parameters and request headers.

referenceText = "Perhaps"
pronAssessmentParamsJson = "{\"ReferenceText\":\"%s\",\"GradingSystem\":\"HundredMark\",\"Dimension\":\"Comprehensive\"}" % referenceText
pronAssessmentParamsBase64 = base64.b64encode(bytes(pronAssessmentParamsJson, 'utf-8'))
pronAssessmentParams = str(pronAssessmentParamsBase64, "utf-8")

url = "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-us" % region
headers = {
    'Accept': 'application/json;text/xml',
    'Connection': 'Keep-Alive',
    'Content-Type': 'audio/wav; codecs=audio/pcm; samplerate=16000',
    'Ocp-Apim-Subscription-Key': subscriptionKey,
    'Pronunciation-Assessment': pronAssessmentParams,
    'Transfer-Encoding': 'chunked',
    'Expect': '100-continue'
}

Send the request with chunked data from the public audio URL.

response = requests.post(url=url, data=get_chunk_from_url(public_audio_url), headers=headers)
getResponseTime = time.time()

resultJson = json.loads(response.text)
print(json.dumps(resultJson, indent=4)

# Check if uploadFinishTime is still 0 (not updated) and calculate latency
if uploadFinishTime == 0:
    uploadFinishTime = time.time()  # Set it to the current time
latency = getResponseTime - uploadFinishTime
print("Latency = %sms" % int(latency * 1000))

This code allows you to perform pronunciation assessment with Azure's Speech Pronunciation Assessment service using a public URL for the audio file, without the need for local or Azure Blob storage. The audio data is streamed from the URL, and the pronunciation assessment is performed based on the provided parameters.

Output

{
    "RecognitionStatus": "Success",
    "Offset": 190300000,
    "Duration": 3300000,
    "NBest": [
        {
            "Confidence": 0.99992526,
            "Lexical": "Perhaps",
            "ITN": "Perhaps",
            "MaskedITN": "perhaps",
            "Display": "Perhaps.",
            "AccuracyScore": 25.0,
            "FluencyScore": 0.0,
            "CompletenessScore": 0.0,
            "PronScore": 5.0,
            "Words": [
                {
                    "Word": "Perhaps",
                    "Offset": 190300000,
                    "Duration": 3300000,
                    "Confidence": 0.0,
                    "AccuracyScore": 25.0,
                    "ErrorType": "Mispronunciation",
                    "Syllables": [
                        {
                            "Syllable": "paxr",
                            "Offset": 190300000,
                            "Duration": 1200000,
                            "AccuracyScore": 32.0
                        },
                        {
                            "Syllable": "haeps",
                            "Offset": 191600000,
                            "Duration": 2000000,
                            "AccuracyScore": 20.0
                        }
                    ],
                    "Phonemes": [
                        {
                            "Phoneme": "p",
                            "Offset": 190300000,
                            "Duration": 200000,
                            "AccuracyScore": 0.0
                        },
                        {
                            "Phoneme": "ax",
                            "Offset": 190600000,
                            "Duration": 500000,
                            "AccuracyScore": 31.0
                        },
                        {
                            "Phoneme": "r",
                            "Offset": 191200000,
                            "Duration": 300000,
                            "AccuracyScore": 57.0
                        },
                        {
                            "Phoneme": "h",
                            "Offset": 191600000,
                            "Duration": 500000,
                            "AccuracyScore": 35.0
                        },
                        {
                            "Phoneme": "ae",
                            "Offset": 192200000,
                            "Duration": 500000,
                            "AccuracyScore": 13.0
                        },
                        {
                            "Phoneme": "p",
                            "Offset": 192800000,
                            "Duration": 500000,
                            "AccuracyScore": 23.0
                        },
                        {
                            "Phoneme": "s",
                            "Offset": 193400000,
                            "Duration": 200000,
                            "AccuracyScore": 0.0
                        }
                    ]
                }
            ]
        }
    ],
    "DisplayText": "Perhaps."
}
Latency = 0ms

For more details on the technical documentation, go through the Speech-to-text and PronunciationAssessment sample

Nikhil Kapse 0 Reputation points

2023-10-30T13:50:34.7766667+00:00

Thank you for your response! How might I go about accomplishing this if the public audio URL is a .ogg file? I saw that Azure supports primarily WAV audio files, but I would like to avoid using an local file storage while retrieving the pronunciation assessment.

Answer 2

dupammi 8,615 Microsoft External Staff

Hi @Nikhil Kapse ,

Thank you for the response.

If you have a public audio URL that points to a .ogg file and you want to perform pronunciation assessment without the need for local file storage, you can still achieve this by converting the .ogg file to the required .wav format in-memory and then proceed with pronunciation assessment using the converted WAV audio data.

The process is similar to the previous code, but it handles .ogg audio files.

Pip install:

pip install soundfile

Sample Code below.

import requests
import base64
import json
import time
import azure.cognitiveservices.speech as speechsdk
import soundfile as sf #read the .ogg audio and write it to the in-memory WAV audio.
import io # in-memory 
# Replace with your Azure subscription key and region
subscriptionKey = "YOUR_SUBSCRIPTION_KEY" 
region = "YOUR_REGION" 

# Replace with the URL of the public .ogg audio file
public_audio_url = "https://upload.wikimedia.org/wikipedia/commons/c/c8/Example.ogg"

# Convert the .ogg audio to WAV format in-memory
response = requests.get(public_audio_url)
wav_audio = io.BytesIO()

with sf.SoundFile(io.BytesIO(response.content), 'rb') as ogg_audio:
    sf.write(wav_audio, ogg_audio.read(), 16000, format='WAV')

# Initialize uploadFinishTime to 0
uploadFinishTime = 0

# a common wave header, with zero audio length
WaveHeader16K16BitMono = bytes([82, 73, 70, 70, 78, 128, 0, 0, 87, 65, 86, 69, 102, 109, 116, 32, 18, 0, 0, 0, 1, 0, 1, 0, 128, 62, 0, 0, 0, 125, 0, 0, 2, 0, 16, 0, 0, 0, 100, 97, 116, 97, 0, 0, 0, 0])

# A generator which reads audio data chunk by chunk from the in-memory WAV audio
def get_chunk_from_audio(wav_audio, chunk_size=1024):
    global uploadFinishTime  # Define uploadFinishTime as a global variable
    yield WaveHeader16K16BitMono
    wav_audio.seek(0)
    while True:
        chunk = wav_audio.read(chunk_size)
        if not chunk:
            uploadFinishTime = time.time()
            break
        yield chunk

# Build pronunciation assessment parameters and request headers
referenceText = "example"
pronAssessmentParamsJson = "{\"ReferenceText\":\"%s\",\"GradingSystem\":\"HundredMark\",\"Dimension\":\"Comprehensive\"}" % referenceText
pronAssessmentParamsBase64 = base64.b64encode(bytes(pronAssessmentParamsJson, 'utf-8'))
pronAssessmentParams = str(pronAssessmentParamsBase64, "utf-8")

url = "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-us" % region
headers = {
    'Accept': 'application/json;text/xml',
    'Connection': 'Keep-Alive',
    'Content-Type': 'audio/wav; codecs=audio/pcm; samplerate=16000',
    'Ocp-Apim-Subscription-Key': subscriptionKey,
    'Pronunciation-Assessment': pronAssessmentParams,
    'Transfer-Encoding': 'chunked',
    'Expect': '100-continue'
}

# Send request with chunked data from the in-memory WAV audio
response = requests.post(url=url, data=get_chunk_from_audio(wav_audio), headers=headers)
getResponseTime = time.time()

resultJson = json.loads(response.text)
print(json.dumps(resultJson, indent=4))

# Check if uploadFinishTime is still 0 (not updated) and calculate latency
if uploadFinishTime == 0:
    uploadFinishTime = time.time()  # Set it to the current time
latency = getResponseTime - uploadFinishTime
print("Latency = %sms" % int(latency * 1000))

Hope this helps. Thanks!

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Nikhil Kapse 0

Hey, thanks for the response. I'm able to successfully convert from ogg to wav format using soundfile as you demonstrated, but I'm not sure how to create an AudioConfig object using the corresponding wav_audio to . I tried the following, but receive an error.

import azure.cognitiveservices.speech as speechsdk
import requests
import json
import soundfile as sf
import io

def pronunciation_assessment(audio_file_link, subscription_key, reference_text, region="eastus", language_code="en-US"):
    # Create an instance of a speech config with specified subscription key and service region.
    speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)

    # Get the audio file from the public link.
    response = requests.get(audio_file_link)
    # Convert the .ogg audio to WAV format in-memory
    wav_audio = io.BytesIO()

    with sf.SoundFile(io.BytesIO(response.content), 'rb') as ogg_audio:
        sample_rate = ogg_audio.samplerate  # Get the sample rate of the original audio
        sf.write(wav_audio, ogg_audio.read(), sample_rate, format='WAV')
        
    # Create an audio stream from the audio data.
    audio_stream = speechsdk.audio.AudioInputStream(wav_audio)

    # Create an audio configuration that points to an audio stream.
    # The script fails at the following line
    audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)

The error seems to come from the final line in the code snippet above, here's the stack trace:

File "pronunciation.py", line 107, in pronunciation_assessment
    audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)
  File "lib/python3.9/site-packages/azure/cognitiveservices/speech/audio.py", line 382, in __init__
    _call_hr_fn(fn=_sdk_lib.audio_config_create_audio_input_from_stream, *[ctypes.byref(handle), stream._handle])
  File "lib/python3.9/site-packages/azure/cognitiveservices/speech/interop.py", line 61, in _call_hr_fn
    hr = fn(*args) if len(args) > 0 else fn()
ctypes.ArgumentError: argument 2: <class 'TypeError'>: Don't know how to convert parameter 2
Exception ignored in: <function _Handle.__del__ at 0x110823040>
Traceback (most recent call last):
  File "lib/python3.9/site-packages/azure/cognitiveservices/speech/interop.py", line 105, in __del__
    elif self.__test_fn(self.__handle):
ctypes.ArgumentError: argument 1: <class 'TypeError'>: Don't know how to convert parameter 1

dupammi 8,615 Reputation points Microsoft External Staff

2023-10-31T03:00:49.0966667+00:00

Hi @Nikhil Kapse ,

In the code provided, the error occurs when trying to create an AudioConfig object using the audio_stream.

The error indicates that there is an issue with parameter conversion.

To address the error, I can suggest the following changes to your code:

To create an AudioConfig object from an audio stream, you should use the from_stream_input method provided by the AudioConfig class.

Please try and hopefully it should remove the issue.

Thank you!

Nikhil Kapse 0

Hello,

I looked through the source code, and I don't believe the FromStreamInput method that you linked exists in the python library for speech.sdk.AudioConfig. Unsurprisingly, when I try using it like so:

wav_audio = io.BytesIO()

    with sf.SoundFile(io.BytesIO(response.content), 'rb') as ogg_audio:
        sample_rate = ogg_audio.samplerate  # Get the sample rate of the original audio
        sf.write(wav_audio, ogg_audio.read(), sample_rate, format='WAV')
        
    # Create an audio stream from the audio data.
    audio_stream = speechsdk.audio.AudioInputStream(wav_audio)

    # Create an audio configuration that points to an audio stream.
    # The script fails at the following line
    # audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)
    audio_config = speechsdk.audio.AudioConfig.FromStreamInput(audio_stream)

I get the attribute error: AttributeError: type object 'AudioConfig' has no attribute 'FromStreamInput'.

It seems like the real issue might be that the stream I'm passing into to AudioConfig's stream argument is somehow incorrect, but please let me know what you think.

dupammi 8,615 Reputation points Microsoft External Staff

2023-10-31T05:45:23.1833333+00:00
Hi @Nikhil Kapse ,

I tried using the AudioConfig in below lines and even tried to check if the speech (ogg public URL speech) is recognized.

audio stream. audio_config = speechsdk.audio.AudioConfig.from_stream_input(audio_stream)

However, the wiki ogg public URL (audio quality), I was using or the ogg-to-wav conversion might be having an issue, due to which the code execution gave below output at my end.

No speech could be recognized

In this kind of complex audio format conversions, the output depends on quality of the audio and also the Conversion Process that I have been using at my end.

If you are still facing any issue with ogg, I request you to raise a support case through Azure portal.
https://ms.portal.azure.com/#view/Microsoft_Azure_Support/HelpAndSupportBlade/~/overview

I hope you understand. Thank you.
Nikhil Kapse 0 Reputation points

2023-10-31T15:26:53.71+00:00
I'm a bit confused about how you're running this code in the first place:

audio stream. audio_config = speechsdk.audio.AudioConfig.from_stream_input(audio_stream)

When I attempt this, I get
AttributeError: type object 'AudioConfig' has no attribute 'from_stream_input'

Could you describe any relevant imports/packages you used to run this code? I don't see any function with the signature

from_stream_input

within the AudioConfig source code.
Nikhil Kapse 0 Reputation points

2023-10-31T15:32:39.6033333+00:00
Hello, I'm still a bit confused about how you're successfully running this line:

audio_config = speechsdk.audio.AudioConfig.from_stream_input(audio_stream)

When I try to do the same, I get the following error:

AttributeError: type object 'AudioConfig' has no attribute 'from_stream_input'

Furthermore, there is no function with the signature 'from_stream_input' that I can find in the AudioConfig source code. Could you link me to the python documentation for the function or explain any relevant imports/packages you used to run the code you provided above?

dupammi 8,615 Microsoft External Staff

Hi @Nikhil Kapse ,

Below are the versions installed of azure-cognitiveservices-speech and soundfile in my python environment:

azure-cognitiveservices-speech 1.32.1

soundfile 0.12.1

Below are the imports that I was doing:

import azure.cognitiveservices.speech as speechsdk

import requests

import json

import soundfile as sf

import io

I was still not getting any error, while using from_stream_input. Please find below screenshot:

User's image

To address this unidentified issue, due to which from_stream_input has been accepted flawlessly in my python code, I slightly modified my code to use AudioConfig in conjunction with speechsdk.audio.AudioInputStream directly and in-memory stream by using io.BytesIO

This approach gave me, RuntimeError: 5 error indicating a failure in creating or using the audio input stream.

To work around this, I used a temporary WAV file "temp.wav" created from the audio data and specified it as the audio_config for the speech recognizer. Then passed it to speech assessment. Please see below:

import azure.cognitiveservices.speech as speechsdk
import requests
import base64
import json
import time
import wave

subscriptionKey = "YOUR_SUBSCRIPTION_KEY"
region = "YOUR_REGION"
public_audio_url = "https://upload.wikimedia.org/wikipedia/commons/c/c8/Example.ogg"

# Initialize uploadFinishTime to 0
uploadFinishTime = 0

# a common wave header, with zero audio length
# since stream data doesn't contain header, but the API requires header to fetch format information, so you need to post this header as the first chunk for each query
WaveHeader16K16BitMono = bytes([82, 73, 70, 70, 78, 128, 0, 0, 87, 65, 86, 69, 102, 109, 116, 32, 18, 0, 0, 0, 1, 0, 1, 0, 128, 62, 0, 0, 0, 125, 0, 0, 2, 0, 16, 0, 0, 0, 100, 97, 116, 97, 0, 0, 0, 0])


# Convert the audio data to WAV format and save it to a temporary file
response = requests.get(public_audio_url, stream=True)
wav_audio = bytes([byte for byte in WaveHeader16K16BitMono]) + response.content

# Create a temporary WAV file
with open("temp.wav", "wb") as wav_file:
    wav_file.write(wav_audio)

# Initialize a speech configuration
speech_config = speechsdk.SpeechConfig(subscription=subscriptionKey, region=region)
audio_config = speechsdk.audio.AudioConfig(filename="temp.wav")

# Create a speech recognizer
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

# Start speech recognition
result = speech_recognizer.recognize_once()

# Build pronunciation assessment parameters
referenceText = "example"
pronAssessmentParamsJson = {
    "ReferenceText": referenceText,
    "GradingSystem": "HundredMark",
    "Dimension": "Comprehensive"
}

# Convert the pronunciation assessment parameters to Base64
pronAssessmentParamsBase64 = base64.b64encode(bytes(json.dumps(pronAssessmentParamsJson), 'utf-8')).decode('utf-8')

# Build request
url = "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-us" % region
headers = {
    'Accept': 'application/json;text/xml',
    'Content-Type': 'audio/wav; codecs=audio/pcm; samplerate=16000',
    'Ocp-Apim-Subscription-Key': subscriptionKey,
    'Pronunciation-Assessment': pronAssessmentParamsBase64,
    'Expect': '100-continue'
}

# Send the request
response = requests.post(url=url, data=wav_audio, headers=headers)
getResponseTime = time.time()

resultJson = json.loads(response.text)
print(json.dumps(resultJson, indent=4))

# Check if uploadFinishTime is still 0 (not updated) and calculate latency
if uploadFinishTime == 0:
    uploadFinishTime = time.time()
latency = getResponseTime - uploadFinishTime
print("Latency = %sms" % int(latency * 1000))

In this modified code, I used the temporary WAV file "temp.wav" created from the audio data, and I specified it as the audio_config for the speech recognizer and finished with the assessment. If needed, the temp file can also be removed from within Python using the os.remove

Below is the output of the pronunciation assessment with confidence score 0.9517101:

{
    "RecognitionStatus": "Success",
    "Offset": 26500000,
    "Duration": 2300000,
    "NBest": [
        {
            "Confidence": 0.9517101,
            "Lexical": "example",
            "ITN": "example",
            "MaskedITN": "example",
            "Display": "Example.",
            "AccuracyScore": 1.0,
            "FluencyScore": 0.0,
            "CompletenessScore": 0.0,
            "PronScore": 0.2,
            "Words": [
                {
                    "Word": "example",
                    "Offset": 26500000,
                    "Duration": 2300000,
                    "Confidence": 0.0,
                    "AccuracyScore": 1.0,
                    "ErrorType": "Mispronunciation",
                    "Syllables": [
                        {
                            "Syllable": "ihg",
                            "Offset": 26500000,
                            "Duration": 500000,
                            "AccuracyScore": 0.0
                        },
                        {
                            "Syllable": "zaem",
                            "Offset": 27100000,
                            "Duration": 800000,
                            "AccuracyScore": 1.0
                        },
                        {
                            "Syllable": "paxl",
                            "Offset": 28000000,
                            "Duration": 800000,
                            "AccuracyScore": 3.0
                        }
                    ],
                    "Phonemes": [
                        {
                            "Phoneme": "ih",
                            "Offset": 26500000,
                            "Duration": 200000,
                            "AccuracyScore": 0.0
                        },
                        {
                            "Phoneme": "g",
                            "Offset": 26800000,
                            "Duration": 200000,
                            "AccuracyScore": 0.0
                        },
                        {
                            "Phoneme": "z",
                            "Offset": 27100000,
                            "Duration": 200000,
                            "AccuracyScore": 0.0
                        },
                        {
                            "Phoneme": "ae",
                            "Offset": 27400000,
                            "Duration": 200000,
                            "AccuracyScore": 0.0
                        },
                        {
                            "Phoneme": "m",
                            "Offset": 27700000,
                            "Duration": 200000,
                            "AccuracyScore": 2.0
                        },
                        {
                            "Phoneme": "p",
                            "Offset": 28000000,
                            "Duration": 200000,
                            "AccuracyScore": 8.0
                        },
                        {
                            "Phoneme": "ax",
                            "Offset": 28300000,
                            "Duration": 200000,
                            "AccuracyScore": 0.0
                        },
                        {
                            "Phoneme": "l",
                            "Offset": 28600000,
                            "Duration": 200000,
                            "AccuracyScore": 0.0
                        }
                    ]
                }
            ]
        }
    ],
    "DisplayText": "Example."
}
Latency = 0ms

The stream that we were writing to in-memory is somehow causing issues. But by using temp WAV in audio_config, we were able to successfully perform the pronunciation assessment from the ogg public URL.

Note: Above code can also be modified to read the temporary WAV file as bytes and use it in the POST request for pronunciation assessment.

I hope this helps!

Thank you!

Share via

How to create an AudioConfig object given a public link to an audio file

2 answers

Your answer