An Azure communication platform for deploying applications across devices and platforms.
Hi @Kynyk, Yaroslav
Thank you for reaching us regarding the issue.
Azure Communication Services (ACS) Call Automation provides Speech‑to‑Text (STT) and Text‑to‑Speech (TTS) through its integration with Foundry Tools, and this integration is supported only when ACS is connected to a Multi‑service Azure AI Services (Cognitive Services) resource that includes Speech
How speech works in ACS Call Automation
- ACS Call Automation exposes AI‑powered features such as:
- Text‑to‑Speech (TTS) using plain text or SSML
- Speech‑to‑Text (STT) for recognizing caller speech
- These features are implemented via the Foundry Tools integration, which internally relies on the Azure AI Speech capability
- The integration only supports a Multi‑service Cognitive Service resource When creating or connecting an Azure AI resource for ACS, Microsoft explicitly recommends using a Multi‑service Cognitive Service resource and ensuring that Speech is included and enabled
Through this setup, ACS Call Automation can:
- Convert text or SSML into audio using Azure Text‑to‑Speech voices
- Recognize spoken responses using Azure Speech‑to‑Text
- Execute these capabilities via built‑in Play and Recognize actions without developers handling media streams directly
Role of Azure OpenAI and Azure AI Foundry
- Azure OpenAI and Azure AI Foundry models are not documented as supported speech backends for ACS Call Automation’s built‑in STT/TTS.
- These services are commonly used in sample architectures for:
- Text reasoning
- Natural language understanding
- Generating responses after speech has already been converted to text
- However, the speech layer itself (TTS/STT) is always provided by the Speech capability within the connected Multi‑service Cognitive Service resource, accessed through Foundry Tools.
Correct and supported configuration
For STT and TTS to work correctly with ACS Call Automation, the supported and documented setup is:
- Connect ACS to an Azure AI Services – Multi‑service Cognitive Service resource that includes Speech
- Use that resource as the Cognitive Services endpoint for Call Automation
- Use Azure OpenAI separately in your backend logic if needed for text generation or decision‑making
This design cleanly separates responsibilities:
- ACS Call Automation + Foundry Tools > handles telephony, audio, STT, and TTS using Azure Speech
- Azure OpenAI / Foundry models > handle text‑based AI reasoning and response generation
Reference:
https://learn.microsoft.com/en-us/azure/communication-services/samples/call-automation-ai?pivots=programming-language-javascript
https://learn.microsoft.com/en-us/azure/communication-services/concepts/call-automation/azure-communication-services-azure-cognitive-services-integration
If the answer is helpful, Please do click "Accept the answer” and Yes, this can be beneficial to other community members.If you have any other questions, let me know in the "comments" and I would be happy to help you