Hello @Luca Piscitelli,
To address your questions:
Shouldn't Phi-4-multimodal be supposed to accept text + audio + image input?
The Phi-4-Multimodal-Instruct model is designed to process text, image, and audio inputs. However, while it can handle each input type individually or in certain combinations, simultaneous processing of text, audio, and image inputs may not be supported in the current implementation. This limitation is reflected in the error message you encountered, indicating that the model does not allow all three input types in a single prompt.
Please refer this https://huggingface.co/microsoft/Phi-4-multimodal-instruct?utm_
How to send text+audio+image input?
Given the current limitations, it is advisable to process inputs separately or in supported combinations. The model allows either text and image together or text and audio together, but not all three at the same time. To work around this restriction, you can first process the audio input to generate a text transcription and then combine this transcribed text with the image input for final processing. This approach ensures that you stay within the model's capabilities while effectively utilizing all input types.
Why is the error mentioning "Phi4MMForCasualLM" model when in my code I specify 'Phi-4-multimodal-instruct' model?
The mention of "Phi4MMForCausalLM" in the error message suggests a possible misconfiguration or a specific limitation within the model's implementation. This reference indicates that the 'Phi-4-multimodal-instruct' model is built on the Phi4MMForCausalLM architecture, and the error is highlighting a restriction within this underlying framework. The model name in the error message helps pinpoint where the limitation occurs, clarifying that simultaneous processing of text, audio, and image inputs may not be supported.
I Hope this helps. Do let me know if you have any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful.
Thank you!