As per
https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/models-featured#microsoft
Phi-4-multimodal-instruct chat-completion (with image and audio content)
- Input: text, images, and audio (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text
If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.
hth
Marcin