Best Practices for Structuring Multi-Speaker Audio and Transcripts in Custom Azure Speech Models

Question

Best Practices for Structuring Multi-Speaker Audio and Transcripts in Custom Azure Speech Models

Kay Wiberg 20

Hi!

My team and I have a few questions regarding Azure Speech, specifically about how to structure datasets for training and testing custom speech models.

Training Data

Please confirm best format for audio and transcripts for multi-speaker training data for a custom speech model, including how the .zip file should be structured. Should there be only one line, without timestamps, in the .txt transcript file for each audio file, even if that audio file contains conversation turns from two different speakers? Could you please provide an illustrative example of such a transcript?

Test Data

I know that training data transcripts for a custom speech model should all be normalized (lower case, no punctuation). For the test data transcripts - should it be true-cased and de-tokenized etc.?

Thank you!

kothapally Snigdha 3,020 Microsoft External Staff Moderator

Hi @Kay Wiberg

For training multi-speaker audio data in a Custom Speech project, each audio file should be paired with a corresponding plain text .txt transcript that contains one line per audio file, without any speaker labels or timestamps.
Even if the audio includes multiple speakers or conversational turns, the transcript should present the full dialogue as a single normalized line of text, all in lowercase and stripped of punctuation.
The Custom Speech service doesn’t differentiate between speakers or manage speaker diarization during training; its focus is on learning from the combined acoustic and textual content.

Training Procedure

The .zip file you upload for training must include pairs of .wav and .txt files with identical base filenames. For example, conversation1.wav and conversation1.txt must be paired together in the ZIP archive. Each .txt file should contain the normalized transcript representing the full spoken content of the audio file, regardless of speaker shifts.

Audio File Requirements

Property	Value
File format	RIFF (WAV)
File format	RIFF (WAV)
Sample rate	8,000 Hz or 16,000 Hz
Channels	1 (mono)
Maximum length per audio	Two hours (testing) / 40 s (training) Training with audio has a maximum audio length of 40 seconds per file (up to 30 seconds for Whisper customization). For audio files longer than 40 seconds, only the corresponding text from the transcription files is used for training. If all audio files are longer than 40 seconds, the training fails.
Sample format	PCM, 16-bit
Archive format	.zip
Maximum zip size	2 GB or 10,000 files

Training File Requirement

A large training dataset is required to improve recognition. Generally, we recommend that you provide word-by-word transcriptions for 1 to 100 hours of audio (up to 20 hours for older models that do not charge for training).

However, even as little as 30 minutes can help improve recognition results. Although creating human-labeled transcription can take time, improvements in recognition are only as good as the data that you provide. You should upload only high-quality transcripts.

Test Data

Unlike training data, the transcript format for test datasets should not be normalized. Test transcripts should be true-cased and de-tokenized to reflect how the speech should appear in its final recognized form. This means preserving capitalization, punctuation, proper names, and natural sentence formatting in the .txt files that accompany each test audio file.

The purpose of testing is to evaluate the accuracy of the speech recognition model as it would be used in production, so the transcript should represent the desired output.

For reference https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-test-and-train#audio--human-labeled-transcript-data-for-training-or-testing

I hope these helps you. Do let us know do you have any further Queries.

Thank you!

kothapally Snigdha 3,020 Reputation points Microsoft External Staff Moderator

2025-04-10T06:00:29.0033333+00:00

Hi @Kay Wiberg

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
kothapally Snigdha 3,020 Reputation points Microsoft External Staff Moderator

2025-04-11T03:56:15.8333333+00:00

Hi @Kay Wiberg

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Kay Wiberg 20 Reputation points

2025-04-11T04:29:51.4+00:00

I will review your answer with my team and get back to you.
SriLakshmi C 6,010 Reputation points Microsoft External Staff Moderator

2025-04-14T09:53:15.0466667+00:00

@Kay Wiberg

Just checking in—were you able to go through the response shared earlier? Please let me know if you have any questions or need further clarification on any part of it. we are happy to assist further if needed.

Thank you!
kothapally Snigdha 3,020 Reputation points Microsoft External Staff Moderator

2025-04-15T05:50:05.28+00:00

Hi @Kay Wiberg

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Thank you!

Your answer

kothapally Snigdha 3,020 Reputation points Microsoft External Staff Moderator

2025-04-10T06:00:29.0033333+00:00

Hi @Kay Wiberg

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
kothapally Snigdha 3,020 Reputation points Microsoft External Staff Moderator

2025-04-11T03:56:15.8333333+00:00

Hi @Kay Wiberg

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Kay Wiberg 20 Reputation points

2025-04-11T04:29:51.4+00:00

I will review your answer with my team and get back to you.
SriLakshmi C 6,010 Reputation points Microsoft External Staff Moderator

2025-04-14T09:53:15.0466667+00:00

@Kay Wiberg

Just checking in—were you able to go through the response shared earlier? Please let me know if you have any questions or need further clarification on any part of it. we are happy to assist further if needed.

Thank you!
kothapally Snigdha 3,020 Reputation points Microsoft External Staff Moderator

2025-04-15T05:50:05.28+00:00

Hi @Kay Wiberg

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Thank you!

Share via

Best Practices for Structuring Multi-Speaker Audio and Transcripts in Custom Azure Speech Models

Training Data

Test Data

Your answer