Custom Speech: What happens if audio data for training exceeds 60 seconds?

Question

Custom Speech: What happens if audio data for training exceeds 60 seconds?

Anonymous

I am training custom speech models using audio + human-labeled transcript data.
According to the docs "each training file can't exceed 60 seconds, or it will error out", but I just have data with more than 60 seconds (about 5-10 minutes), and strangely I could upload it and train models with it. (Here's the link about the limitation, please scroll down until the "Note" section.)
So my question is, what happens if audio data for training exceeds 60 seconds? It look perfect on the console, but is something wrong happening inside the training loop? (for instance, the audio data was cut off at somewhere)

romungi-MSFT 49,101 Reputation points Microsoft Employee Moderator

2022-02-02T13:50:50.813+00:00

@MlyamaeYuichi-6843 Based on the errors I have encountered previously while setting up projects with custom speech, if the file does not get processed completely the training errors out.
In your case since the training is successful even with files with audio >60s the complete file should have been read. I am checking internally with product team to confirm the same and will get back to you on this.
Anonymous

2022-02-03T05:57:11.393+00:00

@romungi-MSFT Thank you so much for your comment! I'm waiting for your response after the confirmation, thanks in advance.

Answer accepted by question author

0 additional answers

Your answer

romungi-MSFT 49,101 Reputation points Microsoft Employee Moderator

2022-02-02T13:50:50.813+00:00

@MlyamaeYuichi-6843 Based on the errors I have encountered previously while setting up projects with custom speech, if the file does not get processed completely the training errors out.
In your case since the training is successful even with files with audio >60s the complete file should have been read. I am checking internally with product team to confirm the same and will get back to you on this.
Anonymous

2022-02-03T05:57:11.393+00:00

@romungi-MSFT Thank you so much for your comment! I'm waiting for your response after the confirmation, thanks in advance.

Answer 1

@MlyamaeYuichi-6843 Based on feedback passed from the product group, Using larger files is not an issue because the process would still use these files to improve the custom terms in your data without ignoring anything above 60s. The shorter files help in training the acoustic part of the model.

To summarize, the text files or transcript play a bigger role in creating the model so ensuring the correct text is added is important.
The audio files complement the above by helping train the model based on your audio quality or background that you would probably use with all your future files. The length of the audio is preferably short for training the acoustic model.

I hope this helps.

If an answer is helpful, please click on or upvote which might help other community members reading this thread.

Anonymous

2022-02-08T11:50:20.697+00:00

@romungi-MSFT Thank you for your kind reply! It helped me so much, I got to know what to do next thanks to it.

Share via

Custom Speech: What happens if audio data for training exceeds 60 seconds?

0 additional answers

Your answer