Phi-4-Multimodal-Instruct - Unable to Send Audio Input to Phi-4-Multimodal-Instruct – "Invalid Input" Error

Giacomo Maccagni 0 Reputation points
2025-03-18T14:53:25.8033333+00:00

Description:

Issue Summary:

I am integrating the Phi-4-Multimodal-Instruct model via the Azure AI Inference SDK and successfully submitting text and image inputs. However, when I attempt to include audio as input, I receive the following error:

azure.core.exceptions.HttpResponseError: (Invalid input) invalid input error
Code: Invalid input
Message: invalid input error

Steps to Reproduce:

  1. I encode an MP3 file (what_do_you_see.mp3) to base64 and attempt to load it as an AudioContentItem.
  2. I submit the request using client.complete() with the audio file included.
  3. The API responds with an "Invalid input" error.
  4. Submitting only text and image works fine, confirming the issue is specific to audio.

Code Snippet:

with open("what_do_you_see.mp3", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(token),
    model_name="Phi-4-multimodal-instruct"
)

response = client.complete(
    messages=[
        SystemMessage("You are an AI assistant for translating and transcribing audio clips."),
        UserMessage(
            [
                AudioContentItem(
                    input_audio=InputAudio.load(
                        audio_file="test_0.mp3", audio_format=AudioContentFormat.MP3
                    )
                ),
            ],
        ),
    ]
)

Additional Information:

  • The file exists and is properly encoded.
  • I have tried both MP3 and WAV formats with the same issue.
  • The model does not provide further debugging details beyond "Invalid input".

Request for Support:

  1. What is the correct format and encoding expected for audio input?
  2. Does Phi-4-Multimodal-Instruct currently support audio input via the Azure API?
  3. Are there any known limitations or documentation updates regarding audio input?

Thank you for your assistance!

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,231 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Giacomo Maccagni 0 Reputation points
    2025-03-18T18:52:06.18+00:00

    Ok, i solved the Audio issue. Since the correct format to use for the request is:

    payload_text = {
      "model": 'Phi-4-multimodal-instruct',
      "messages": [
        {
          "role": "system",
          # "content": 'What is the result of 1+1?'
          "content": [
            {"text": "What''s in this image?", "type": "text"},
            {"image_url": {"url": f"data:image/jpeg;base64,{image_b64}", "detail":"low"}, "type": "image_url"},
            {
              "audio_url": {
                  "url": f"data:audio/mp3;base64,{audio_b64}",
                  "format": "mp3"
              }, "type": "audio_url"
            }
          ]
        }
      ],
      "temperature": 0.10,
      "top_p": 0.70,
      "stream": stream
    }
    
    0 comments No comments

  2. Saideep Anchuri 5,955 Reputation points Microsoft External Staff
    2025-03-19T03:28:45.3433333+00:00

    Hi Giacomo Maccagni

    I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer.

    Ask: Phi-4-Multimodal-Instruct - Unable to Send Audio Input to Phi-4-Multimodal-Instruct – "Invalid Input" Error

    Solution: The issue is resolved. That You solved the Audio issue. Since the correct format to use for the request is:

    payload_text = {
      "model": 'Phi-4-multimodal-instruct',
      "messages": [
        {
          "role": "system",
          # "content": 'What is the result of 1+1?'
          "content": [
            {"text": "What''s in this image?", "type": "text"},
            {"image_url": {"url": f"data:image/jpeg;base64,{image_b64}", "detail":"low"}, "type": "image_url"},
            {
              "audio_url": {
                  "url": f"data:audio/mp3;base64,{audio_b64}",
                  "format": "mp3"
              }, "type": "audio_url"
            }
          ]
        }
      ],
      "temperature": 0.10,
      "top_p": 0.70,
      "stream": stream
    }
    

    If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.

    If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.

     

    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    Thank you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.