Phi-4-multimodal does not support text + audio + image inputs in the same prompt

Question

Phi-4-multimodal does not support text + audio + image inputs in the same prompt

Luca Piscitelli 5

I am integrating the Phi-4-Multimodal-Instruct model via the Azure AI Inference SDK and successfully submitting text and image inputs. However, when I attempt to include audio as input, I receive the following error:

{'error': {'code': 'Bad Request', 'message': 'Phi4MMForCausalLM does not support text + audio + image inputs in the same prompt', 'status': 400}}

Shouldn't Phi-4-multimodal be supposed to accept text + audio + image input?
How to send text+audio+image input?
Why is the error mentioning "Phi4MMForCasualLM" model when in my code I specify 'Phi-4-multimodal-instruct' model?

This is the format I use for the request:

payload_text = {
  "model": 'Phi-4-multimodal-instruct',
  "messages": [
    {
      "role": "system",
      # "content": 'What is the result of 1+1?'
      "content": [
        {"text": "What''s in this image?", "type": "text"},
        {"image_url": {"url": f"data:image/jpeg;base64,{image_b64}", "detail":"low"}, "type": "image_url"},
        {
          "audio_url": {
              "url": f"data:audio/mp3;base64,{audio_b64}",
              "format": "mp3"
          }, "type": "audio_url"
        }
      ]
    }
  ],
  "temperature": 0.10,
  "top_p": 0.70,
  "stream": stream
}

1 answer

Your answer

Answer 1

Hello @Luca Piscitelli,

To address your questions:

Shouldn't Phi-4-multimodal be supposed to accept text + audio + image input?

The Phi-4-Multimodal-Instruct model is designed to process text, image, and audio inputs. However, while it can handle each input type individually or in certain combinations, simultaneous processing of text, audio, and image inputs may not be supported in the current implementation. This limitation is reflected in the error message you encountered, indicating that the model does not allow all three input types in a single prompt.

Please refer this https://huggingface.co/microsoft/Phi-4-multimodal-instruct?utm_

How to send text+audio+image input?

Given the current limitations, it is advisable to process inputs separately or in supported combinations. The model allows either text and image together or text and audio together, but not all three at the same time. To work around this restriction, you can first process the audio input to generate a text transcription and then combine this transcribed text with the image input for final processing. This approach ensures that you stay within the model's capabilities while effectively utilizing all input types.

Why is the error mentioning "Phi4MMForCasualLM" model when in my code I specify 'Phi-4-multimodal-instruct' model?

The mention of "Phi4MMForCausalLM" in the error message suggests a possible misconfiguration or a specific limitation within the model's implementation. This reference indicates that the 'Phi-4-multimodal-instruct' model is built on the Phi4MMForCausalLM architecture, and the error is highlighting a restriction within this underlying framework. The model name in the error message helps pinpoint where the limitation occurs, clarifying that simultaneous processing of text, audio, and image inputs may not be supported.

I Hope this helps. Do let me know if you have any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Thank you!

SriLakshmi C 6,010 Reputation points Microsoft External Staff Moderator

2025-03-19T10:08:44.83+00:00

@Luca Piscitelli,

We received your feedback that the answer provided on the thread was not helpful. Kindly let us know what we could have done better to improve the answer and make your engagement experience good. We are here to help you and strive to make your experience better and greatly value your feedback. Our engineer have provided a detailed answer which has clear steps for which you are looking for. If you wish, you may re-surveying/rating for the engagement you received on the thread. Your feedback is very important to us. Looking forward to your reply. Much appreciate your feedback!
SriLakshmi C 6,010 Reputation points Microsoft External Staff Moderator

2025-03-20T12:10:04.0666667+00:00

@Luca Piscitelli,

The total number of tokens the model can process at once applies across all modalities combined, but this does not necessarily mean it can handle all modality types simultaneously. According to the summary, Phi-4-multimodal-instruct can process text, images, and audio inputs, generate text outputs, and handle up to 128K tokens in context. However, if the input exceeds 128K tokens, the model will not process it.

The summary does not explicitly confirm whether the model can handle text, audio, and video simultaneously in a single prompt. The error you are encountering suggests there may be certain limitations, such as the need for specific formatting when combining modalities, potential technical restrictions on supported modality combinations, or differences in video processing requirements compared to static images. It may be helpful to review the model documentation for guidance on multimodal input handling.

Could you please take a moment to retake the survey on the above response? Your feedback is greatly appreciated.

Share via

Phi-4-multimodal does not support text + audio + image inputs in the same prompt

1 answer

Your answer