How to estimate the time needed to train a custom STT model?

Question

How to estimate the time needed to train a custom STT model?

Bruno Goncalves Vaz (P) 20

Hey!

I'm thinking about fine-tuning a STT model with Audio + human-labeled transcript data in Speech Studio. However, as I read through the docs, I can see that "If you switch to a base model that supports customization with audio data, the training time may increase from several hours to several days" (quotation taken from here). Given the STT pricing page, this could turn out to be very expensive. Is there a way I can estimate how many hours/days it takes to train a custom model based on the amount of data I have? Or is there any kind of "personalization" I'm allowed to do to decrease the training time, like diminishing the number of epochs or something like that?

Thanks a lot! 🙂

(I have another question, but is off topic, though. If I select the Whisper models in Speech Studio in order to train them, I get a small warning that says those models "can only be used in batch speech to text". Why is that? Why can't the Whisper models be served as real time models, like they are if we use them via Azure OpenAI?)

Bruno Goncalves Vaz (P) 20 Reputation points

2024-05-29T07:32:28.6566667+00:00

Hello @dupammi 🙂

Yes, it helps a lot – that's precisely what I wanted to know!

Thank you for your answer!

All the best
dupammi 8,615 Reputation points Microsoft External Staff

2024-05-29T08:10:33.7733333+00:00

Hi @Bruno Goncalves Vaz (P)

I'm glad to hear that my response was helpful to you! If you have any further questions or concerns, please don't hesitate to ask.

Also, please consider accepting my answer by clicking the "Accept" button. This will help other users with similar queries find the solution more easily. Thank you for your consideration!

All the best to you as well!

Accepted answer

0 additional answers

Your answer

Bruno Goncalves Vaz (P) 20 Reputation points

2024-05-29T07:32:28.6566667+00:00

Hello @dupammi 🙂

Yes, it helps a lot – that's precisely what I wanted to know!

Thank you for your answer!

All the best
dupammi 8,615 Reputation points Microsoft External Staff

2024-05-29T08:10:33.7733333+00:00

Hi @Bruno Goncalves Vaz (P)

I'm glad to hear that my response was helpful to you! If you have any further questions or concerns, please don't hesitate to ask.

Also, please consider accepting my answer by clicking the "Accept" button. This will help other users with similar queries find the solution more easily. Thank you for your consideration!

All the best to you as well!

Answer 1

Hi @Bruno Goncalves Vaz (P)

Thank you for your question.

Training a custom speech-to-text (STT) model can indeed vary significantly in terms of time and cost depending on several factors.

Training time increases with the amount of audio and transcript data; dedicated hardware regions process ~10 hours/day, while others handle ~1 hour/day. Training is faster in regions with dedicated hardware, and more complex models take longer. While the Speech to text FAQ and How long does it take to train a custom model with audio data- doesn't specifically mention adjusting epochs, reducing the number of epochs or other training parameters might help speed up training. However, this could also affect model performance. More complex models (or models that require more data for fine-tuning) will naturally take longer to train.

The Whisper models are optimized for batch processing to handle large volumes of audio efficiently. These models might not be optimized for the latency requirements needed for real-time processing, which is why they can only be used in batch speech-to-text.

I hope this helps. Thank you.

Share via

How to estimate the time needed to train a custom STT model?

0 additional answers

Your answer