How to estimate the time needed to train a custom STT model?

Bruno Goncalves Vaz (P) 20 Reputation points


I'm thinking about fine-tuning a STT model with Audio + human-labeled transcript data in Speech Studio. However, as I read through the docs, I can see that "If you switch to a base model that supports customization with audio data, the training time may increase from several hours to several days" (quotation taken from here). Given the STT pricing page, this could turn out to be very expensive. Is there a way I can estimate how many hours/days it takes to train a custom model based on the amount of data I have? Or is there any kind of "personalization" I'm allowed to do to decrease the training time, like diminishing the number of epochs or something like that?

Thanks a lot! 🙂

(I have another question, but is off topic, though. If I select the Whisper models in Speech Studio in order to train them, I get a small warning that says those models "can only be used in batch speech to text". Why is that? Why can't the Whisper models be served as real time models, like they are if we use them via Azure OpenAI?)

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,469 questions
{count} votes

Accepted answer
  1. dupammi 7,745 Reputation points Microsoft Vendor

    Hi @Bruno Goncalves Vaz (P)

    Thank you for your question.

    Training a custom speech-to-text (STT) model can indeed vary significantly in terms of time and cost depending on several factors.

    Training time increases with the amount of audio and transcript data; dedicated hardware regions process ~10 hours/day, while others handle ~1 hour/day. Training is faster in regions with dedicated hardware, and more complex models take longer. While the Speech to text FAQ and How long does it take to train a custom model with audio data- doesn't specifically mention adjusting epochs, reducing the number of epochs or other training parameters might help speed up training. However, this could also affect model performance. More complex models (or models that require more data for fine-tuning) will naturally take longer to train.

    The Whisper models are optimized for batch processing to handle large volumes of audio efficiently. These models might not be optimized for the latency requirements needed for real-time processing, which is why they can only be used in batch speech-to-text.

    I hope this helps. Thank you.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful