Azure Text to Speech produces an invalid WAV file

Dutchottie 20 Reputation points
2024-03-20T16:02:52.4533333+00:00

In Unity 2022.29 I am trying to convert a TTS stream into a valid WAV file for Unity.

I'm using:

speechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw48Khz16BitMonoPcm);

audioConfig = AudioConfig.FromWavFileOutput(filepath); //path inside Resources

Then after the <speechSynthesizer.StartSpeakingSsmlAsync(ssml).Result> I do (from the SDK examples):

var audioDataStream = AudioDataStream.FromResult(result);

var isFirstAudioChunk = true;

audioClip = AudioClip.Create("Speech",

SampleRate * 120, // 600 = Can speak 10mins audio as maximum

1,

SampleRate,

true,

(float[] audioChunk) =>

{//do the audio building}

And finally I want to do:

audioSource.clip = Resources.Load<AudioClip>("audio\outputaudio");

audioSource.Play();

So I want to control the audio play (not let the speechsynthesizer do it) by creating a WAV file and playing it manually. However Unity reports an error on importing the WAV file:

FSBTool Error: The format of the source file is invalid.

Can anyone help me how to get the WAV file correct?
Or, I've tried but failed, run the speechSynthesizer so that a SaveToWaveFileAsync(filename) works?

Thanks!

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,555 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Sina Salam 7,441 Reputation points
    2024-03-21T14:03:57.29+00:00

    Hello @OtteMMarco-3696

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    Regarding to your question, you encountered an issue while generating a valid WAV file from Azure Text to Speech output in Unity and you were asking how to generate wave file correctly.

    Firstly,

    Verify that the file path you're providing to FromWavFileOutput is correct and accessible within Unity. Based on your code snippet, it looks like you're setting it to Raw48Khz16BitMonoPcm, which should be appropriate for generating a WAV file. So, ensure that the folder structure is properly set up and that Unity can write to the specified location.

    Secondly,

    Instead of using FromWavFileOutput, you can try using SaveToWaveFileAsync to directly save the generated audio to a WAV file. This method is available in the SpeechSynthesizer class.

    You can modify your code to use SaveToWaveFileAsync using the below example:

    var speechSynthesisResult = await speechSynthesizer.SpeakTextAsync("Your text here");
    await speechSynthesisResult.SaveToWaveFileAsync(filePath);
    

    This above code sample will eliminate the need to use FromWavFileOutput, and have capacity in resolving any issues related to invalid WAV file formats.

    Thirdly,

    Ensure that you're correctly handling the audio data obtained from AudioDataStream.FromResult(result) and that the sample rate matches the configuration you've set (SampleRate * 120).

    Finally,

    If your import setting is correct, because Unity might be expecting specific encoding settings or metadata in the WAV file. You can adjust these settings by selecting the WAV file in the Unity Editor and inspecting its Import Settings. Then, implement proper error handling in your code. Sample on how to implement such could be seen in the GitHub link provided by @VasaviLankipalle-MSFT

    You also have resources available by the right side of this page.

    This is additions to your second post:

    If you have look into that GitHub and try the above here and the problem persist.

    Now, after saving the WAV file using SaveToWaveFileAsync, make sure to call AssetDatabase.Refresh() to ensure that Unity's asset database is updated with the new file. This step is crucial for Unity to recognize the newly saved WAV file as an asset.

    Instead of relying on AssetDatabase.LoadAssetAtPath, try loading the AudioClip directly from the file using UnityWebRequestMultimedia.GetAudioClip. This method loads audio files asynchronously and can be more reliable for dynamically loaded assets. You can modify your code to load the AudioClip from the WAV file like the below example:

    IEnumerator LoadAudioClip(string filePath, Action<AudioClip> callback)
    {
        UnityWebRequest www = UnityWebRequestMultimedia.GetAudioClip("file://" + filePath, AudioType.WAV);
        yield return www.SendWebRequest();
        if (www.result == UnityWebRequest.Result.Success)
        {
            AudioClip audioClip = DownloadHandlerAudioClip.GetContent(www);
            callback?.Invoke(audioClip);
        }
        else
        {
            Debug.LogError("Failed to load audio clip: " + www.error);
            callback?.Invoke(null);
        }
    }
    

    Call this coroutine after saving the WAV file and pass the file path. Once the AudioClip is loaded, use the callback function to assign it to the AudioSource.

    By implementing these suggestions and ensuring proper file handling and coroutine execution, you should be able to reliably save the Azure Text to Speech output to a WAV file and load it as an AudioClip in Unity for playback.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.

    Please remember to "Accept Answer" if answer helped, so that others in the community facing similar issues can easily find the solution.

    Best Regards,

    Sina Salam

    0 comments No comments

  2. Dutchottie 20 Reputation points
    2024-03-25T09:40:43.0733333+00:00

    I found a solution that works. Apart from setting the <async> commands correctly, I discovered that setting the SpeechSynthesisOutputFormat can cause an audio file that is not recognized by Unity. Perhaps there is a correct setting, but leaving this option empty seems to work fine.

    This is the (summarized) Unity code I use:

    private const int SampleRate = 24000;
    
    static event Action<AudioClip> PlayAudio;
    
    Start()
    
    {
    
    speechConfig = SpeechConfig.FromSubscription(speechKey, speechRegion);
    
    speechConfig.SpeechSynthesisVoiceName = "en-US-JennyNeural";
    
    PlayAudio += AudioCreated;
    
    }
    
    Update()
    
    {
    
    if(triggerTTS)
    
    {
    
    StartSpeech();
    
    }
    
    async Task StartSpeech()
    
    {
    
        speechSynthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
    
        string ssml = File.ReadAllText("Assets\\Resources\\SSML\\ssml.xml"); 
    
        var getTTS = await speechSynthesizer.StartSpeakingSsmlAsync(ssml);
    
        using var stream = AudioDataStream.FromResult(getTTS);
    
        outputAudioFilePath = filePath + fileName + fileNumber.ToString() + ".wav";
    
        await stream.SaveToWaveFileAsync(outputAudioFilePath);
    
        StartCoroutine(LoadAudioClip(outputAudioFilePath, PlayAudio));
    
    }
    
    IEnumerator LoadAudioClip(string filePath, Action<AudioClip> callback)
    
    {
    
        filePath = "D:/Unity Projects/Talking Heads/Assets/Resources/audio/" + fileName + fileNumber + ".wav";
    
        using (UnityWebRequest www = UnityWebRequestMultimedia.GetAudioClip("file://" + filePath, AudioType.WAV))
    
        {
    
            yield return www.SendWebRequest();
    
            if(www.result == UnityWebRequest.Result.ConnectionError)
    
            {
    
                Debug.LogError("Failed to load audio clip: " + www.error);
    
                callback?.Invoke(null);
    
            }
    
            else
    
            {
    
                yield return new WaitUntil(() => www.downloadHandler.isDone);
    
                AudioClip audioClip = ((DownloadHandlerAudioClip)www.downloadHandler).audioClip; //DownloadHandlerAudioClip.GetContent(www);
    
                callback?.Invoke(audioClip);
    
            }
    
        }
    }
    void AudioCreated(AudioClip thisClip)
    
    {
    
        audioSource.clip = thisClip;
    
        audioSource.Play();
    
    }
    
    0 comments No comments