Phi-4-multimodal does not support text + audio + image inputs in the same prompt

Luca Piscitelli 5 Reputation points
2025-03-18T19:11:50.0433333+00:00

I am integrating the Phi-4-Multimodal-Instruct model via the Azure AI Inference SDK and successfully submitting text and image inputs. However, when I attempt to include audio as input, I receive the following error:

{'error': {'code': 'Bad Request', 'message': 'Phi4MMForCausalLM does not support text + audio + image inputs in the same prompt', 'status': 400}}

Shouldn't Phi-4-multimodal be supposed to accept text + audio + image input?
How to send text+audio+image input?
Why is the error mentioning "Phi4MMForCasualLM" model when in my code I specify 'Phi-4-multimodal-instruct' model?

This is the format I use for the request:

payload_text = {
  "model": 'Phi-4-multimodal-instruct',
  "messages": [
    {
      "role": "system",
      # "content": 'What is the result of 1+1?'
      "content": [
        {"text": "What''s in this image?", "type": "text"},
        {"image_url": {"url": f"data:image/jpeg;base64,{image_b64}", "detail":"low"}, "type": "image_url"},
        {
          "audio_url": {
              "url": f"data:audio/mp3;base64,{audio_b64}",
              "format": "mp3"
          }, "type": "audio_url"
        }
      ]
    }
  ],
  "temperature": 0.10,
  "top_p": 0.70,
  "stream": stream
}

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,333 questions
0 comments No comments
{count} vote

1 answer

Sort by: Most helpful
  1. SriLakshmi C 6,010 Reputation points Microsoft External Staff Moderator
    2025-03-19T06:02:09.5766667+00:00

    Hello @Luca Piscitelli,

    To address your questions:

    Shouldn't Phi-4-multimodal be supposed to accept text + audio + image input?

    The Phi-4-Multimodal-Instruct model is designed to process text, image, and audio inputs. However, while it can handle each input type individually or in certain combinations, simultaneous processing of text, audio, and image inputs may not be supported in the current implementation. This limitation is reflected in the error message you encountered, indicating that the model does not allow all three input types in a single prompt.

    Please refer this https://huggingface.co/microsoft/Phi-4-multimodal-instruct?utm_

    How to send text+audio+image input?

    Given the current limitations, it is advisable to process inputs separately or in supported combinations. The model allows either text and image together or text and audio together, but not all three at the same time. To work around this restriction, you can first process the audio input to generate a text transcription and then combine this transcribed text with the image input for final processing. This approach ensures that you stay within the model's capabilities while effectively utilizing all input types.

    Why is the error mentioning "Phi4MMForCasualLM" model when in my code I specify 'Phi-4-multimodal-instruct' model?

    The mention of "Phi4MMForCausalLM" in the error message suggests a possible misconfiguration or a specific limitation within the model's implementation. This reference indicates that the 'Phi-4-multimodal-instruct' model is built on the Phi4MMForCausalLM architecture, and the error is highlighting a restriction within this underlying framework. The model name in the error message helps pinpoint where the limitation occurs, clarifying that simultaneous processing of text, audio, and image inputs may not be supported.

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    Thank you!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.