@Kohei Watanabe Azure speech to text provides two options with respect to the models that are used behind the service.
- Baseline Model
- Custom Model
With the baseline model you can use the API directly without any customization where the model is trained by Microsoft against fairly decent background conditions. If your scenario involves recognition of speech in a day to day scenario or recordings this model should work for you right away with the API.
The custom model can be used in a scenario where the baseline model accuracy is not to your standards. For example, you have custom words in your speech like acronyms, phrases used in an organization, speech from factory floor with lot of background noises etc. This custom model is trained with your audio files on top of the baseline model so all the capabilities of baseline model are built in your resultant endpoint.
To summarize, the result from the service depends on your scenario and all the factors do effect them but if your scenario isn't for a custom background then you can use the baseline model rightaway and evaluate the performance. Please check the FAQ document that could help you with more details.