Quickstart - Getting started with Azure OpenAI audio generation - Azure OpenAI

The gpt-4o-audio-preview and gpt-4o-mini-audio-preview models introduce the audio modality into the existing /chat/completions API. The audio model expands the potential for AI applications in text and voice-based interactions and audio analysis. Modalities supported in gpt-4o-audio-preview and gpt-4o-mini-audio-preview models include:  text, audio, and text + audio.

Here's a table of the supported modalities with example use cases:

Modality input	Modality output	Example use case
Text	Text + audio	Text to speech, audio book generation
Audio	Text + audio	Audio transcription, audio book generation
Audio	Text	Audio transcription
Text + audio	Text + audio	Audio book generation
Text + audio	Text	Audio transcription

By using audio generation capabilities, you can achieve more dynamic and interactive AI applications. Models that support audio inputs and outputs allow you to generate spoken audio responses to prompts and use audio inputs to prompt the model.

Supported models

Currently only gpt-4o-audio-preview and gpt-4o-mini-audio-preview version: 2024-12-17 supports audio generation.

For more information about region availability, see the models and versions documentation.

Currently the following voices are supported for audio out: Alloy, Echo, and Shimmer.

The maximum audio file size is 20 MB.

Напомена

The Realtime API uses the same underlying GPT-4o audio model as the completions API, but is optimized for low-latency, real-time audio interactions.

API support

Support for audio completions was first added in API version 2025-01-01-preview.

Deploy a model for audio generation

To deploy the gpt-4o-mini-audio-preview model in the Azure AI Foundry portal:

Go to the Azure OpenAI Service page in Azure AI Foundry portal. Make sure you're signed in with the Azure subscription that has your Azure OpenAI Service resource and the deployed gpt-4o-mini-audio-preview model.
Select the Chat playground from under Playgrounds in the left pane.
Select + Create new deployment > From base models to open the deployment window.
Search for and select the gpt-4o-mini-audio-preview model and then select Deploy to selected resource.
In the deployment wizard, select the 2024-12-17 model version.
Follow the wizard to finish deploying the model.

Now that you have a deployment of the gpt-4o-mini-audio-preview model, you can interact with it in the Azure AI Foundry portal Chat playground or chat completions API.

Use GPT-4o audio generation

To chat with your deployed gpt-4o-mini-audio-preview model in the Chat playground of Azure AI Foundry portal, follow these steps:

Go to the Azure OpenAI Service page in Azure AI Foundry portal. Make sure you're signed in with the Azure subscription that has your Azure OpenAI Service resource and the deployed gpt-4o-mini-audio-preview model.
Select the Chat playground from under Resource playground in the left pane.
Select your deployed gpt-4o-mini-audio-preview model from the Deployment dropdown.
Start chatting with the model and listen to the audio responses.

You can:
- Record audio prompts.
- Attach audio files to the chat.
- Enter text prompts.

Reference documentation | Library source code | Package (npm) | Samples

The gpt-4o-audio-preview and gpt-4o-mini-audio-preview models introduce the audio modality into the existing /chat/completions API. The audio model expands the potential for AI applications in text and voice-based interactions and audio analysis. Modalities supported in gpt-4o-audio-preview and gpt-4o-mini-audio-preview models include:  text, audio, and text + audio.

Here's a table of the supported modalities with example use cases:

Modality input	Modality output	Example use case
Text	Text + audio	Text to speech, audio book generation
Audio	Text + audio	Audio transcription, audio book generation
Audio	Text	Audio transcription
Text + audio	Text + audio	Audio book generation
Text + audio	Text	Audio transcription

By using audio generation capabilities, you can achieve more dynamic and interactive AI applications. Models that support audio inputs and outputs allow you to generate spoken audio responses to prompts and use audio inputs to prompt the model.

Supported models

Currently only gpt-4o-audio-preview and gpt-4o-mini-audio-preview version: 2024-12-17 supports audio generation.

For more information about region availability, see the models and versions documentation.

Currently the following voices are supported for audio out: Alloy, Echo, and Shimmer.

The maximum audio file size is 20 MB.

Напомена

The Realtime API uses the same underlying GPT-4o audio model as the completions API, but is optimized for low-latency, real-time audio interactions.

API support

Support for audio completions was first added in API version 2025-01-01-preview.

Prerequisites

An Azure subscription - Create one for free
Node.js LTS or ESM support.
An Azure OpenAI resource created in one of the supported regions. For more information about region availability, see the models and versions documentation.
Then, you need to deploy a gpt-4o-mini-audio-preview model with your Azure OpenAI resource. For more information, see Create a resource and deploy a model with Azure OpenAI.

Microsoft Entra ID prerequisites

For the recommended keyless authentication with Microsoft Entra ID, you need to:

Install the Azure CLI used for keyless authentication with Microsoft Entra ID.
Assign the Cognitive Services User role to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.

Set up

Create a new folder audio-completions-quickstart to contain the application and open Visual Studio Code in that folder with the following command:

shell
```
mkdir audio-completions-quickstart && cd audio-completions-quickstart
```
Create the package.json with the following command:

shell
```
npm init -y
```
Install the OpenAI client library for JavaScript with:

Console
```
npm install openai
```
For the recommended keyless authentication with Microsoft Entra ID, install the @azure/identity package with:

Console
```
npm install @azure/identity
```

Retrieve resource information

You need to retrieve the following information to authenticate your application with your Azure OpenAI resource:

Microsoft Entra ID
API key

Variable name	Value
`AZURE_OPENAI_ENDPOINT`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
`AZURE_OPENAI_DEPLOYMENT_NAME`	This value will correspond to the custom name you chose for your deployment when you deployed a model. This value can be found under Resource Management > Model Deployments in the Azure portal.
`OPENAI_API_VERSION`	Learn more about API Versions. You can change the version in code or use an environment variable.

Learn more about keyless authentication and setting environment variables.

Variable name	Value
`AZURE_OPENAI_ENDPOINT`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
`AZURE_OPENAI_API_KEY`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal. You can use either `KEY1` or `KEY2`.
`AZURE_OPENAI_DEPLOYMENT_NAME`	This value will correspond to the custom name you chose for your deployment when you deployed a model. This value can be found under Resource Management > Model Deployments in the Azure portal.
`OPENAI_API_VERSION`	Learn more about API Versions.

Learn more about finding API keys and setting environment variables.

Важно

Use API keys with caution. Don't include the API key directly in your code, and never post it publicly. If you use an API key, store it securely in Azure Key Vault. For more information about using API keys securely in your apps, see API keys with Azure Key Vault.

For more information about AI services security, see Authenticate requests to Azure AI services.

Опрез

To use the recommended keyless authentication with the SDK, make sure that the AZURE_OPENAI_API_KEY environment variable isn't set.

Create the to-audio.js file with the following code:

JavaScript

require("dotenv").config();
const { AzureOpenAI } = require("openai");
const { DefaultAzureCredential, getBearerTokenProvider } = require("@azure/identity");
const { writeFileSync } = require("node:fs");

// Keyless authentication    
const credential = new DefaultAzureCredential();
const scope = "https://cognitiveservices.azure.com/.default";
const azureADTokenProvider = getBearerTokenProvider(credential, scope);

// Set environment variables or edit the corresponding values here.
const endpoint = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const deployment = process.env.AZURE_OPENAI_DEPLOYMENT_NAME || "gpt-4o-mini-audio-preview"; 
const apiVersion = process.env.OPENAI_API_VERSION || "2025-01-01-preview"; 

const client = new AzureOpenAI({ 
    endpoint, 
    azureADTokenProvider, 
    apiVersion, 
    deployment 
}); 

async function main() {

    // Make the audio chat completions request
    const response = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview", 
        modalities: ["text", "audio"], 
        audio: { voice: "alloy", format: "wav" }, 
        messages: [ 
        { 
            role: "user", 
            content: "Is a golden retriever a good family dog?" 
        } 
        ] 
    }); 

// Inspect returned data 
console.log(response.choices[0]); 

// Write the output audio data to a file
writeFileSync( 
    "dog.wav", 
    Buffer.from(response.choices[0].message.audio.data, 'base64'), 
    { encoding: "utf-8" } 
); 
}

main().catch((err) => {
  console.error("Error occurred:", err);
});

module.exports = { main };

Sign in to Azure with the following command:

shell
```
az login
```
Run the JavaScript file.

shell
```
node to-audio.js
```

Create the to-audio.js file with the following code:

JavaScript

require("dotenv").config();
const { AzureOpenAI } = require("openai");
const { writeFileSync } = require("node:fs");

// Set environment variables or edit the corresponding values here.
const endpoint = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const apiKey = process.env.AZURE_OPENAI_API_KEY || "AZURE_OPENAI_API_KEY";
const apiVersion = "2025-01-01-preview"; 
const deployment = "gpt-4o-mini-audio-preview"; 

const client = new AzureOpenAI({ 
    endpoint, 
    apiKey, 
    apiVersion, 
    deployment 
});  

async function main() {

    // Make the audio chat completions request
    const response = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview", 
        modalities: ["text", "audio"], 
        audio: { voice: "alloy", format: "wav" }, 
        messages: [ 
        { 
            role: "user", 
            content: "Is a golden retriever a good family dog?" 
        } 
        ] 
    }); 

// Inspect returned data 
console.log(response.choices[0]); 

// Write the output audio data to a file
writeFileSync( 
    "dog.wav", 
    Buffer.from(response.choices[0].message.audio.data, 'base64'), 
    { encoding: "utf-8" } 
); 
}

main().catch((err) => {
  console.error("Error occurred:", err);
});

module.exports = { main };

Run the JavaScript file.

shell
```
node to-audio.js
```

Wait a few moments to get the response.

Output for audio generation from text input

The script generates an audio file named dog.wav in the same directory as the script. The audio file contains the spoken response to the prompt, "Is a golden retriever a good family dog?"

Generate audio and text from audio input

Microsoft Entra ID
API key

Create the from-audio.js file with the following code:

JavaScript

require("dotenv").config();
const { AzureOpenAI } = require("openai");
const { DefaultAzureCredential, getBearerTokenProvider } = require("@azure/identity");
const fs = require('fs').promises;
const { writeFileSync } = require("node:fs");

// Keyless authentication    
const credential = new DefaultAzureCredential();
const scope = "https://cognitiveservices.azure.com/.default";
const azureADTokenProvider = getBearerTokenProvider(credential, scope);

// Set environment variables or edit the corresponding values here.
const endpoint = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const apiVersion = "2025-01-01-preview"; 
const deployment = "gpt-4o-mini-audio-preview"; 

const client = new AzureOpenAI({ 
    endpoint, 
    azureADTokenProvider, 
    apiVersion, 
    deployment 
});    

async function main() {

    // Buffer the audio for input to the chat completion
    const wavBuffer = await fs.readFile("dog.wav"); 
    const base64str = Buffer.from(wavBuffer).toString("base64"); 

    // Make the audio chat completions request
    const response = await client.chat.completions.create({
        model: "gpt-4o-mini-audio-preview",
        modalities: ["text", "audio"],
        audio: { voice: "alloy", format: "wav" }, 
        messages: [
            {
                role: "user",
                content: [
                    { 
                        type: "text", 
                        text: "Describe in detail the spoken audio input." 
                    },
                    { 
                        type: "input_audio", 
                        input_audio: { 
                            data: base64str, 
                            format: "wav" 
                        } 
                    }
                ]
            }
        ]
    });

    console.log(response.choices[0]); 

    // Write the output audio data to a file
    writeFileSync( 
        "analysis.wav", 
        Buffer.from(response.choices[0].message.audio.data, 'base64'), 
        { encoding: "utf-8" } 
    ); 
}

main().catch((err) => {
    console.error("Error occurred:", err);
});

module.exports = { main };

Sign in to Azure with the following command:

shell
```
az login
```
Run the JavaScript file.

shell
```
node from-audio.js
```

Create the from-audio.js file with the following code:

JavaScript

require("dotenv").config();
const { AzureOpenAI } = require("openai");
const fs = require('fs').promises;
const { writeFileSync } = require("node:fs");

// Set environment variables or edit the corresponding values here.
const endpoint = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const apiKey = process.env.AZURE_OPENAI_API_KEY || "AZURE_OPENAI_API_KEY";
const apiVersion = "2025-01-01-preview"; 
const deployment = "gpt-4o-mini-audio-preview"; 

const client = new AzureOpenAI({ 
    endpoint, 
    apiKey, 
    apiVersion, 
    deployment 
});  

async function main() {

    // Buffer the audio for input to the chat completion
    const wavBuffer = await fs.readFile("dog.wav"); 
    const base64str = Buffer.from(wavBuffer).toString("base64"); 

    // Make the audio chat completions request
    const response = await client.chat.completions.create({
        model: "gpt-4o-mini-audio-preview",
        modalities: ["text", "audio"],
        audio: { voice: "alloy", format: "wav" }, 
        messages: [
            {
                role: "user",
                content: [
                    { 
                        type: "text", 
                        text: "Describe in detail the spoken audio input." 
                    },
                    { 
                        type: "input_audio", 
                        input_audio: { 
                            data: base64str, 
                            format: "wav" 
                        } 
                    }
                ]
            }
        ]
    });

    console.log(response.choices[0]); 

    // Write the output audio data to a file
    writeFileSync( 
        "analysis.wav", 
        Buffer.from(response.choices[0].message.audio.data, 'base64'), 
        { encoding: "utf-8" } 
    ); 
}

main().catch((err) => {
    console.error("Error occurred:", err);
});

module.exports = { main };

Run the JavaScript file.

shell
```
node from-audio.js
```

Wait a few moments to get the response.

Output for audio and text generation from audio input

The script generates a transcript of the summary of the spoken audio input. It also generates an audio file named analysis.wav in the same directory as the script. The audio file contains the spoken response to the prompt.

Generate audio and use multi-turn chat completions

Microsoft Entra ID
API key

Create the multi-turn.js file with the following code:

JavaScript

require("dotenv").config();
const { AzureOpenAI } = require("openai");
const { DefaultAzureCredential, getBearerTokenProvider } = require("@azure/identity");
const fs = require('fs').promises;

// Keyless authentication    
const credential = new DefaultAzureCredential();
const scope = "https://cognitiveservices.azure.com/.default";
const azureADTokenProvider = getBearerTokenProvider(credential, scope);

// Set environment variables or edit the corresponding values here.
const endpoint = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const apiVersion = "2025-01-01-preview"; 
const deployment = "gpt-4o-mini-audio-preview"; 

const client = new AzureOpenAI({ 
    endpoint, 
    azureADTokenProvider, 
    apiVersion, 
    deployment 
}); 

async function main() {

    // Buffer the audio for input to the chat completion
    const wavBuffer = await fs.readFile("dog.wav"); 
    const base64str = Buffer.from(wavBuffer).toString("base64"); 

    // Initialize messages with the first turn's user input 
    const messages = [
        {
            role: "user",
            content: [
                { 
                    type: "text", 
                    text: "Describe in detail the spoken audio input." 
                },
                { 
                    type: "input_audio", 
                    input_audio: { 
                        data: base64str, 
                        format: "wav" 
                    } 
                }
            ]
        }
    ];

    // Get the first turn's response 

    const response = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview",
        modalities: ["text", "audio"], 
        audio: { voice: "alloy", format: "wav" }, 
        messages: messages
    }); 

    console.log(response.choices[0]); 

    // Add a history message referencing the previous turn's audio by ID 
    messages.push({ 
        role: "assistant", 
        audio: { id: response.choices[0].message.audio.id }
    });

    // Add a new user message for the second turn
    messages.push({ 
        role: "user", 
        content: [ 
            { 
                type: "text", 
                text: "Very concisely summarize the favorability." 
            } 
        ] 
    }); 

    // Send the follow-up request with the accumulated messages
    const followResponse = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview",
        messages: messages
    });

    console.log(followResponse.choices[0].message.content); 
}

main().catch((err) => {
    console.error("Error occurred:", err);
});

module.exports = { main };

Sign in to Azure with the following command:

shell
```
az login
```
Run the JavaScript file.

shell
```
node multi-turn.js
```

Create the multi-turn.js file with the following code:

JavaScript

require("dotenv").config();
const { AzureOpenAI } = require("openai");
const fs = require('fs').promises;

// Set environment variables or edit the corresponding values here.
const endpoint = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const apiKey = process.env.AZURE_OPENAI_API_KEY || "AZURE_OPENAI_API_KEY";
const apiVersion = "2025-01-01-preview"; 
const deployment = "gpt-4o-mini-audio-preview"; 

const client = new AzureOpenAI({ 
    endpoint, 
    apiKey, 
    apiVersion, 
    deployment 
});  

async function main() {

    // Buffer the audio for input to the chat completion
    const wavBuffer = await fs.readFile("dog.wav"); 
    const base64str = Buffer.from(wavBuffer).toString("base64"); 

    // Initialize messages with the first turn's user input 
    const messages = [
        {
            role: "user",
            content: [
                { 
                    type: "text", 
                    text: "Describe in detail the spoken audio input." 
                },
                { 
                    type: "input_audio", 
                    input_audio: { 
                        data: base64str, 
                        format: "wav" 
                    } 
                }
            ]
        }
    ];

    // Get the first turn's response 

    const response = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview",
        modalities: ["text", "audio"], 
        audio: { voice: "alloy", format: "wav" }, 
        messages: messages
    }); 

    console.log(response.choices[0]); 

    // Add a history message referencing the previous turn's audio by ID 
    messages.push({ 
        role: "assistant", 
        audio: { id: response.choices[0].message.audio.id }
    });

    // Add a new user message for the second turn
    messages.push({ 
        role: "user", 
        content: [ 
            { 
                type: "text", 
                text: "Very concisely summarize the favorability." 
            } 
        ] 
    }); 

    // Send the follow-up request with the accumulated messages
    const followResponse = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview",
        messages: messages
    });

    console.log(followResponse.choices[0].message.content); 
}

main().catch((err) => {
    console.error("Error occurred:", err);
});

module.exports = { main };

Run the JavaScript file.

shell
```
node multi-turn.js
```

Wait a few moments to get the response.

Output for multi-turn chat completions

The script generates a transcript of the summary of the spoken audio input. Then, it makes a multi-turn chat completion to briefly summarize the spoken audio input.

Library source code | Package | Samples

The gpt-4o-audio-preview and gpt-4o-mini-audio-preview models introduce the audio modality into the existing /chat/completions API. The audio model expands the potential for AI applications in text and voice-based interactions and audio analysis. Modalities supported in gpt-4o-audio-preview and gpt-4o-mini-audio-preview models include:  text, audio, and text + audio.

Here's a table of the supported modalities with example use cases:

Modality input	Modality output	Example use case
Text	Text + audio	Text to speech, audio book generation
Audio	Text + audio	Audio transcription, audio book generation
Audio	Text	Audio transcription
Text + audio	Text + audio	Audio book generation
Text + audio	Text	Audio transcription

By using audio generation capabilities, you can achieve more dynamic and interactive AI applications. Models that support audio inputs and outputs allow you to generate spoken audio responses to prompts and use audio inputs to prompt the model.

Supported models

Currently only gpt-4o-audio-preview and gpt-4o-mini-audio-preview version: 2024-12-17 supports audio generation.

For more information about region availability, see the models and versions documentation.

Currently the following voices are supported for audio out: Alloy, Echo, and Shimmer.

The maximum audio file size is 20 MB.

Напомена

The Realtime API uses the same underlying GPT-4o audio model as the completions API, but is optimized for low-latency, real-time audio interactions.

API support

Support for audio completions was first added in API version 2025-01-01-preview.

Use this guide to get started generating audio with the Azure OpenAI SDK for Python.

Prerequisites

An Azure subscription. Create one for free.
Python 3.8 or later version. We recommend using Python 3.10 or later, but having at least Python 3.8 is required. If you don't have a suitable version of Python installed, you can follow the instructions in the VS Code Python Tutorial for the easiest way of installing Python on your operating system.
An Azure OpenAI resource created in one of the supported regions. For more information about region availability, see the models and versions documentation.
Then, you need to deploy a gpt-4o-mini-audio-preview model with your Azure OpenAI resource. For more information, see Create a resource and deploy a model with Azure OpenAI.

Microsoft Entra ID prerequisites

For the recommended keyless authentication with Microsoft Entra ID, you need to:

Install the Azure CLI used for keyless authentication with Microsoft Entra ID.
Assign the Cognitive Services User role to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.

Set up

Create a new folder audio-completions-quickstart to contain the application and open Visual Studio Code in that folder with the following command:

shell
```
mkdir audio-completions-quickstart && cd audio-completions-quickstart
```
Create a virtual environment. If you already have Python 3.10 or higher installed, you can create a virtual environment using the following commands:
- Windows
- Linux
- macOS
Bash
```
py -3 -m venv .venv
.venv\scripts\activate
```
Bash
```
python3 -m venv .venv
source .venv/bin/activate
```
Bash
```
python3 -m venv .venv
source .venv/bin/activate
```
Activating the Python environment means that when you run python or pip from the command line, you then use the Python interpreter contained in the .venv folder of your application. You can use the deactivate command to exit the python virtual environment, and can later reactivate it when needed.

Савет

We recommend that you create and activate a new Python environment to use to install the packages you need for this tutorial. Don't install packages into your global python installation. You should always use a virtual or conda environment when installing python packages, otherwise you can break your global installation of Python.
Install the OpenAI client library for Python with:

Console
```
pip install openai
```
For the recommended keyless authentication with Microsoft Entra ID, install the azure-identity package with:

Console
```
pip install azure-identity
```

Retrieve resource information

You need to retrieve the following information to authenticate your application with your Azure OpenAI resource:

Microsoft Entra ID
API key

Variable name	Value
`AZURE_OPENAI_ENDPOINT`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
`AZURE_OPENAI_DEPLOYMENT_NAME`	This value will correspond to the custom name you chose for your deployment when you deployed a model. This value can be found under Resource Management > Model Deployments in the Azure portal.
`OPENAI_API_VERSION`	Learn more about API Versions. You can change the version in code or use an environment variable.

Learn more about keyless authentication and setting environment variables.

Variable name	Value
`AZURE_OPENAI_ENDPOINT`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
`AZURE_OPENAI_API_KEY`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal. You can use either `KEY1` or `KEY2`.
`AZURE_OPENAI_DEPLOYMENT_NAME`	This value will correspond to the custom name you chose for your deployment when you deployed a model. This value can be found under Resource Management > Model Deployments in the Azure portal.
`OPENAI_API_VERSION`	Learn more about API Versions.

Learn more about finding API keys and setting environment variables.

Важно

Use API keys with caution. Don't include the API key directly in your code, and never post it publicly. If you use an API key, store it securely in Azure Key Vault. For more information about using API keys securely in your apps, see API keys with Azure Key Vault.

For more information about AI services security, see Authenticate requests to Azure AI services.

Generate audio from text input

Microsoft Entra ID
API key

Create the to-audio.py file with the following code:

Python

import requests
import base64 
import os 
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']

# Keyless authentication
client=AzureOpenAI(
    azure_ad_token_provider=token_provider,
    azure_endpoint=endpoint,
    api_version="2025-01-01-preview"
)

# Make the audio chat completions request
completion=client.chat.completions.create(
    model="gpt-4o-mini-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Is a golden retriever a good family dog?"
        }
    ]
)

print(completion.choices[0])

# Write the output audio data to a file
wav_bytes=base64.b64decode(completion.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
    f.write(wav_bytes)

Run the Python file.

shell
```
python to-audio.py
```

Create the to-audio.py file with the following code:

Python

import base64 
import os 
from openai import AzureOpenAI 

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']
api_key = os.environ['AZURE_OPENAI_API_KEY']

client = AzureOpenAI(
    api_version="2025-01-01-preview",  
    api_key=api_key,
    azure_endpoint=endpoint
)

# Make the audio chat completions request
completion = client.chat.completions.create(
    model="gpt-4o-mini-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Is a golden retriever a good family dog?"
        }
    ]
)

print(completion.choices[0])

# Write the output audio data to a file
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
    f.write(wav_bytes)

Run the Python file.

shell
```
python to-audio.py
```

Wait a few moments to get the response.

Output for audio generation from text input

The script generates an audio file named dog.wav in the same directory as the script. The audio file contains the spoken response to the prompt, "Is a golden retriever a good family dog?"

Generate audio and text from audio input

Microsoft Entra ID
API key

Create the from-audio.py file with the following code:

Python

import base64
import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']

# Keyless authentication
client=AzureOpenAI(
    azure_ad_token_provider=token_provider,
    azure_endpoint=endpoint
    api_version="2025-01-01-preview"
)

# Read and encode audio file  
with open('dog.wav', 'rb') as wav_reader: 
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8') 

# Make the audio chat completions request
completion = client.chat.completions.create( 
    model="gpt-4o-mini-audio-preview", 
    modalities=["text", "audio"], 
    audio={"voice": "alloy", "format": "wav"}, 
    messages=[ 
        { 
            "role": "user", 
            "content": [ 
                {  
                    "type": "text", 
                    "text": "Describe in detail the spoken audio input." 
                }, 
                { 
                    "type": "input_audio", 
                    "input_audio": { 
                        "data": encoded_string, 
                        "format": "wav" 
                    } 
                } 
            ] 
        }, 
    ] 
) 

print(completion.choices[0].message.audio.transcript)

# Write the output audio data to a file
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("analysis.wav", "wb") as f:
    f.write(wav_bytes)

Run the Python file.

shell
```
python from-audio.py
```

Create the from-audio.py file with the following code:

Python

import base64
import os
from openai import AzureOpenAI

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']
api_key = os.environ['AZURE_OPENAI_API_KEY']

client = AzureOpenAI(
    api_version="2025-01-01-preview",  
    api_key=api_key, 
    azure_endpoint=endpoint
)

# Read and encode audio file  
with open('dog.wav', 'rb') as wav_reader: 
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8') 

# Make the audio chat completions request
completion = client.chat.completions.create( 
    model="gpt-4o-mini-audio-preview", 
    modalities=["text", "audio"], 
    audio={"voice": "alloy", "format": "wav"}, 
    messages=[ 
        { 
            "role": "user", 
            "content": [ 
                {  
                    "type": "text", 
                    "text": "Describe in detail the spoken audio input." 
                }, 
                { 
                    "type": "input_audio", 
                    "input_audio": { 
                        "data": encoded_string, 
                        "format": "wav" 
                    } 
                } 
            ] 
        }, 
    ] 
) 

print(completion.choices[0].message.audio.transcript)

# Write the output audio data to a file
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("analysis.wav", "wb") as f:
    f.write(wav_bytes)

Run the Python file.

shell
```
python from-audio.py
```

Wait a few moments to get the response.

Output for audio and text generation from audio input

The script generates a transcript of the summary of the spoken audio input. It also generates an audio file named analysis.wav in the same directory as the script. The audio file contains the spoken response to the prompt.

Generate audio and use multi-turn chat completions

Microsoft Entra ID
API key

Create the multi-turn.py file with the following code:

Python

import base64 
import os 
from openai import AzureOpenAI 
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']

# Keyless authentication
client=AzureOpenAI(
    azure_ad_token_provider=token_provider,
    azure_endpoint=endpoint,
    api_version="2025-01-01-preview"
)

# Read and encode audio file  
with open('dog.wav', 'rb') as wav_reader: 
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8') 

# Initialize messages with the first turn's user input 
messages = [
    { 
        "role": "user", 
        "content": [ 
            { "type": "text", "text": "Describe in detail the spoken audio input." }, 
            { "type": "input_audio", 
                "input_audio": { 
                    "data": encoded_string, 
                    "format": "wav" 
                } 
            } 
        ] 
    }] 

# Get the first turn's response

completion = client.chat.completions.create( 
    model="gpt-4o-mini-audio-preview", 
    modalities=["text", "audio"], 
    audio={"voice": "alloy", "format": "wav"}, 
    messages=messages
) 

print("Get the first turn's response:")
print(completion.choices[0].message.audio.transcript) 

print("Add a history message referencing the first turn's audio by ID:")
print(completion.choices[0].message.audio.id)

# Add a history message referencing the first turn's audio by ID 
messages.append({ 
    "role": "assistant", 
    "audio": { "id": completion.choices[0].message.audio.id } 
}) 

# Add the next turn's user message 
messages.append({ 
    "role": "user", 
    "content": "Very briefly, summarize the favorability." 
}) 

# Send the follow-up request with the accumulated messages
completion = client.chat.completions.create( 
    model="gpt-4o-mini-audio-preview", 
    messages=messages
) 

print("Very briefly, summarize the favorability.")
print(completion.choices[0].message.content)

Run the Python file.

shell
```
python multi-turn.py
```

Create the multi-turn.py file with the following code:

Python

import base64 
import os 
from openai import AzureOpenAI 

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']
api_key = os.environ['AZURE_OPENAI_API_KEY']

client = AzureOpenAI(
    api_version="2025-01-01-preview",  
    api_key=api_key, 
    azure_endpoint=endpoint
)

# Read and encode audio file  
with open('dog.wav', 'rb') as wav_reader: 
    encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8') 

# Initialize messages with the first turn's user input 
messages = [
    { 
        "role": "user", 
        "content": [ 
            { "type": "text", "text": "Describe in detail the spoken audio input." }, 
            { "type": "input_audio", 
                "input_audio": { 
                    "data": encoded_string, 
                    "format": "wav" 
                } 
            } 
        ] 
    }] 

# Get the first turn's response 

completion = client.chat.completions.create( 
    model="gpt-4o-mini-audio-preview", 
    modalities=["text", "audio"], 
    audio={"voice": "alloy", "format": "wav"}, 
    messages=messages
) 

print("Get the first turn's response:")
print(completion.choices[0].message.audio.transcript) 

print("Add a history message referencing the first turn's audio by ID:")
print(completion.choices[0].message.audio.id)

# Add a history message referencing the first turn's audio by ID 
messages.append({ 
    "role": "assistant", 
    "audio": { "id": completion.choices[0].message.audio.id } 
}) 

# Add the next turn's user message 
messages.append({ 
    "role": "user", 
    "content": "Very briefly, summarize the favorability." 
}) 

# Send the follow-up request with the accumulated messages 
completion = client.chat.completions.create( 
    model="gpt-4o-mini-audio-preview", 
    messages=messages
) 

print("Very briefly, summarize the favorability.")
print(completion.choices[0].message.content)

Run the Python file.

shell
```
python multi-turn.py
```

Wait a few moments to get the response.

Output for multi-turn chat completions

The script generates a transcript of the summary of the spoken audio input. Then, it makes a multi-turn chat completion to briefly summarize the spoken audio input.

REST API Spec |

The gpt-4o-audio-preview and gpt-4o-mini-audio-preview models introduce the audio modality into the existing /chat/completions API. The audio model expands the potential for AI applications in text and voice-based interactions and audio analysis. Modalities supported in gpt-4o-audio-preview and gpt-4o-mini-audio-preview models include:  text, audio, and text + audio.

Here's a table of the supported modalities with example use cases:

Modality input	Modality output	Example use case
Text	Text + audio	Text to speech, audio book generation
Audio	Text + audio	Audio transcription, audio book generation
Audio	Text	Audio transcription
Text + audio	Text + audio	Audio book generation
Text + audio	Text	Audio transcription

By using audio generation capabilities, you can achieve more dynamic and interactive AI applications. Models that support audio inputs and outputs allow you to generate spoken audio responses to prompts and use audio inputs to prompt the model.

Supported models

Currently only gpt-4o-audio-preview and gpt-4o-mini-audio-preview version: 2024-12-17 supports audio generation.

For more information about region availability, see the models and versions documentation.

Currently the following voices are supported for audio out: Alloy, Echo, and Shimmer.

The maximum audio file size is 20 MB.

Напомена

The Realtime API uses the same underlying GPT-4o audio model as the completions API, but is optimized for low-latency, real-time audio interactions.

API support

Support for audio completions was first added in API version 2025-01-01-preview.

Prerequisites

An Azure subscription. Create one for free.
Python 3.8 or later version. We recommend using Python 3.10 or later, but having at least Python 3.8 is required. If you don't have a suitable version of Python installed, you can follow the instructions in the VS Code Python Tutorial for the easiest way of installing Python on your operating system.
An Azure OpenAI resource created in one of the supported regions. For more information about region availability, see the models and versions documentation.
Then, you need to deploy a gpt-4o-mini-audio-preview model with your Azure OpenAI resource. For more information, see Create a resource and deploy a model with Azure OpenAI.

Microsoft Entra ID prerequisites

For the recommended keyless authentication with Microsoft Entra ID, you need to:

Install the Azure CLI used for keyless authentication with Microsoft Entra ID.
Assign the Cognitive Services User role to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.

Set up

Create a new folder audio-completions-quickstart to contain the application and open Visual Studio Code in that folder with the following command:

shell
```
mkdir audio-completions-quickstart && cd audio-completions-quickstart
```
Create a virtual environment. If you already have Python 3.10 or higher installed, you can create a virtual environment using the following commands:
- Windows
- Linux
- macOS
Bash
```
py -3 -m venv .venv
.venv\scripts\activate
```
Bash
```
python3 -m venv .venv
source .venv/bin/activate
```
Bash
```
python3 -m venv .venv
source .venv/bin/activate
```
Activating the Python environment means that when you run python or pip from the command line, you then use the Python interpreter contained in the .venv folder of your application. You can use the deactivate command to exit the python virtual environment, and can later reactivate it when needed.

Савет

We recommend that you create and activate a new Python environment to use to install the packages you need for this tutorial. Don't install packages into your global python installation. You should always use a virtual or conda environment when installing python packages, otherwise you can break your global installation of Python.
Install the OpenAI client library for Python with:

Console
```
pip install openai
```
For the recommended keyless authentication with Microsoft Entra ID, install the azure-identity package with:

Console
```
pip install azure-identity
```

Retrieve resource information

You need to retrieve the following information to authenticate your application with your Azure OpenAI resource:

Microsoft Entra ID
API key

Variable name	Value
`AZURE_OPENAI_ENDPOINT`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
`AZURE_OPENAI_DEPLOYMENT_NAME`	This value will correspond to the custom name you chose for your deployment when you deployed a model. This value can be found under Resource Management > Model Deployments in the Azure portal.
`OPENAI_API_VERSION`	Learn more about API Versions. You can change the version in code or use an environment variable.

Learn more about keyless authentication and setting environment variables.

Variable name	Value
`AZURE_OPENAI_ENDPOINT`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
`AZURE_OPENAI_API_KEY`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal. You can use either `KEY1` or `KEY2`.
`AZURE_OPENAI_DEPLOYMENT_NAME`	This value will correspond to the custom name you chose for your deployment when you deployed a model. This value can be found under Resource Management > Model Deployments in the Azure portal.
`OPENAI_API_VERSION`	Learn more about API Versions.

Learn more about finding API keys and setting environment variables.

Важно

Use API keys with caution. Don't include the API key directly in your code, and never post it publicly. If you use an API key, store it securely in Azure Key Vault. For more information about using API keys securely in your apps, see API keys with Azure Key Vault.

For more information about AI services security, see Authenticate requests to Azure AI services.

Generate audio from text input

Microsoft Entra ID
API key

Create the to-audio.py file with the following code:

Python

import requests
import base64 
import os 
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']

# Keyless authentication
credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")

api_version = '2025-01-01-preview'
url = f"{endpoint}/openai/deployments/gpt-4o-mini-audio-preview/chat/completions?api-version={api_version}"
headers= { "Authorization": f"Bearer {token.token}", "Content-Type": "application/json" }
body = {
  "modalities": ["audio", "text"],
  "model": "gpt-4o-mini-audio-preview",
  "audio": {
      "format": "wav",
      "voice": "alloy"
  },
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Is a golden retriever a good family dog?"
        }
      ]
    }
  ]
}

# Make the audio chat completions request
completion = requests.post(url, headers=headers, json=body)
audio_data = completion.json()['choices'][0]['message']['audio']['data']

# Write the output audio data to a file
wav_bytes = base64.b64decode(audio_data)
with open("dog.wav", "wb") as f: 
  f.write(wav_bytes)

Run the Python file.

shell
```
python to-audio.py
```

Create the to-audio.py file with the following code:

Python

import requests
import base64 
import os 
from openai import AzureOpenAI 

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']
api_key = os.environ['AZURE_OPENAI_API_KEY']

api_version = '2025-01-01-preview'
url = f"{endpoint}/openai/deployments/gpt-4o-mini-audio-preview/chat/completions?api-version={api_version}"
headers= { "api-key": api_key, "Content-Type": "application/json" }
body = {
  "modalities": ["audio", "text"],
  "model": "gpt-4o-mini-audio-preview",
  "audio": {
      "format": "wav",
      "voice": "alloy"
  },
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Is a golden retriever a good family dog?"
        }
      ]
    }
  ]
}

# Make the audio chat completions request
completion = requests.post(url, headers=headers, json=body)
audio_data = completion.json()['choices'][0]['message']['audio']['data']

# Write the output audio data to a file 
wav_bytes = base64.b64decode(audio_data)
with open("dog.wav", "wb") as f: 
  f.write(wav_bytes)

Run the Python file.

shell
```
python to-audio.py
```

Wait a few moments to get the response.

Output for audio generation from text input

The script generates an audio file named dog.wav in the same directory as the script. The audio file contains the spoken response to the prompt, "Is a golden retriever a good family dog?"

Generate audio and text from audio input

Microsoft Entra ID
API key

Create the from-audio.py file with the following code:

Python

import requests
import base64
import os
from azure.identity import DefaultAzureCredential

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']

# Keyless authentication
credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")

# Read and encode audio file  
with open('dog.wav', 'rb') as wav_reader: 
  encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8') 

api_version = '2025-01-01-preview'
url = f"{endpoint}/openai/deployments/gpt-4o-mini-audio-preview/chat/completions?api-version={api_version}"
headers= { "Authorization": f"Bearer {token.token}", "Content-Type": "application/json" }
body = {
  "modalities": ["audio", "text"],
  "model": "gpt-4o-mini-audio-preview",
  "audio": {
      "format": "wav",
      "voice": "alloy"
  },
  "messages": [
    { 
        "role": "user", 
        "content": [ 
            {  
                "type": "text", 
                "text": "Describe in detail the spoken audio input." 
            }, 
            { 
                "type": "input_audio", 
                "input_audio": { 
                    "data": encoded_string, 
                    "format": "wav" 
                } 
            } 
        ] 
    }, 
  ]
}

completion = requests.post(url, headers=headers, json=body)

print(completion.json()['choices'][0]['message']['audio']['transcript'])

# Write the output audio data to a file
audio_data = completion.json()['choices'][0]['message']['audio']['data'] 
wav_bytes = base64.b64decode(audio_data)
with open("analysis.wav", "wb") as f: 
  f.write(wav_bytes)

Run the Python file.

shell
```
python from-audio.py
```

Create the from-audio.py file with the following code:

Python

import requests
import base64
import os

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']
api_key = os.environ['AZURE_OPENAI_API_KEY']

# Read and encode audio file  
with open('dog.wav', 'rb') as wav_reader: 
  encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8') 

api_version = '2025-01-01-preview'
url = f"{endpoint}/openai/deployments/gpt-4o-mini-audio-preview/chat/completions?api-version={api_version}"
headers= { "api-key": api_key, "Content-Type": "application/json" }
body = {
  "modalities": ["audio", "text"],
  "model": "gpt-4o-mini-audio-preview",
  "audio": {
      "format": "wav",
      "voice": "alloy"
  },
  "messages": [
    { 
        "role": "user", 
        "content": [ 
            {  
                "type": "text", 
                "text": "Describe in detail the spoken audio input." 
            }, 
            { 
                "type": "input_audio", 
                "input_audio": { 
                    "data": encoded_string, 
                    "format": "wav" 
                } 
            } 
        ] 
    }, 
  ]
}

completion = requests.post(url, headers=headers, json=body)

print(completion.json()['choices'][0]['message']['audio']['transcript'])

# Write the output audio data to a file
audio_data = completion.json()['choices'][0]['message']['audio']['data'] 
wav_bytes = base64.b64decode(audio_data)
with open("analysis.wav", "wb") as f: 
  f.write(wav_bytes)

Run the Python file.

shell
```
python from-audio.py
```

Wait a few moments to get the response.

Output for audio and text generation from audio input

The script generates a transcript of the summary of the spoken audio input. It also generates an audio file named analysis.wav in the same directory as the script. The audio file contains the spoken response to the prompt.

Generate audio and use multi-turn chat completions

Microsoft Entra ID
API key

Create the multi-turn.py file with the following code:

Python

import requests
import base64 
import os 
from openai import AzureOpenAI 
from azure.identity import DefaultAzureCredential

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']

# Keyless authentication
credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")

api_version = '2025-01-01-preview'
url = f"{endpoint}/openai/deployments/gpt-4o-mini-audio-preview/chat/completions?api-version={api_version}"
headers= { "Authorization": f"Bearer {token.token}", "Content-Type": "application/json" }

# Read and encode audio file  
with open('dog.wav', 'rb') as wav_reader: 
  encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8') 

# Initialize messages with the first turn's user input 
messages = [
    { 
        "role": "user", 
        "content": [ 
            {  
                "type": "text", 
                "text": "Describe in detail the spoken audio input." 
            }, 
            { 
                "type": "input_audio", 
                "input_audio": { 
                    "data": encoded_string, 
                    "format": "wav" 
                } 
            } 
        ] 
    }] 

body = {
  "modalities": ["audio", "text"],
  "model": "gpt-4o-mini-audio-preview",
  "audio": {
      "format": "wav",
      "voice": "alloy"
  },
  "messages": messages
}

# Get the first turn's response, including generated audio 
completion = requests.post(url, headers=headers, json=body)

print("Get the first turn's response:")
print(completion.json()['choices'][0]['message']['audio']['transcript']) 

print("Add a history message referencing the first turn's audio by ID:")
print(completion.json()['choices'][0]['message']['audio']['id'])

# Add a history message referencing the first turn's audio by ID 
messages.append({ 
    "role": "assistant", 
    "audio": { "id": completion.json()['choices'][0]['message']['audio']['id'] } 
}) 

# Add the next turn's user message 
messages.append({ 
    "role": "user", 
    "content": "Very briefly, summarize the favorability." 
}) 

body = {
  "model": "gpt-4o-mini-audio-preview",
  "messages": messages
}

# Send the follow-up request with the accumulated messages
completion = requests.post(url, headers=headers, json=body) 

print("Very briefly, summarize the favorability.")
print(completion.json()['choices'][0]['message']['content'])

Run the Python file.

shell
```
python multi-turn.py
```

Create the multi-turn.py file with the following code:

Python

import requests
import base64 
import os 
from openai import AzureOpenAI 

# Set environment variables or edit the corresponding values here.
endpoint = os.environ['AZURE_OPENAI_ENDPOINT']
api_key = os.environ['AZURE_OPENAI_API_KEY']

api_version = '2025-01-01-preview'
url = f"{endpoint}/openai/deployments/gpt-4o-mini-audio-preview/chat/completions?api-version={api_version}"
headers= { "api-key": api_key, "Content-Type": "application/json" }

# Read and encode audio file  
with open('dog.wav', 'rb') as wav_reader: 
  encoded_string = base64.b64encode(wav_reader.read()).decode('utf-8') 

# Initialize messages with the first turn's user input 
messages = [
    { 
        "role": "user", 
        "content": [ 
            {  
                "type": "text", 
                "text": "Describe in detail the spoken audio input." 
            }, 
            { 
                "type": "input_audio", 
                "input_audio": { 
                    "data": encoded_string, 
                    "format": "wav" 
                } 
            } 
        ] 
    }] 

body = {
  "modalities": ["audio", "text"],
  "model": "gpt-4o-mini-audio-preview",
  "audio": {
      "format": "wav",
      "voice": "alloy"
  },
  "messages": messages
}


# Get the first turn's response, including generated audio 
completion = requests.post(url, headers=headers, json=body)

print("Get the first turn's response:")
print(completion.json()['choices'][0]['message']['audio']['transcript']) 

print("Add a history message referencing the first turn's audio by ID:")
print(completion.json()['choices'][0]['message']['audio']['id'])

# Add a history message referencing the first turn's audio by ID 
messages.append({ 
    "role": "assistant", 
    "audio": { "id": completion.json()['choices'][0]['message']['audio']['id'] } 
}) 

# Add the next turn's user message 
messages.append({ 
    "role": "user", 
    "content": "Very briefly, summarize the favorability." 
}) 

body = {
  "model": "gpt-4o-mini-audio-preview",
  "messages": messages
}

# Send the follow-up request with the accumulated messages
completion = requests.post(url, headers=headers, json=body) 

print("Very briefly, summarize the favorability.")
print(completion.json()['choices'][0]['message']['content'])

Run the Python file.

shell
```
python multi-turn.py
```

Wait a few moments to get the response.

Output for multi-turn chat completions

The script generates a transcript of the summary of the spoken audio input. Then, it makes a multi-turn chat completion to briefly summarize the spoken audio input.

Reference documentation | Library source code | Package (npm) | Samples

The gpt-4o-audio-preview and gpt-4o-mini-audio-preview models introduce the audio modality into the existing /chat/completions API. The audio model expands the potential for AI applications in text and voice-based interactions and audio analysis. Modalities supported in gpt-4o-audio-preview and gpt-4o-mini-audio-preview models include:  text, audio, and text + audio.

Here's a table of the supported modalities with example use cases:

Modality input	Modality output	Example use case
Text	Text + audio	Text to speech, audio book generation
Audio	Text + audio	Audio transcription, audio book generation
Audio	Text	Audio transcription
Text + audio	Text + audio	Audio book generation
Text + audio	Text	Audio transcription

By using audio generation capabilities, you can achieve more dynamic and interactive AI applications. Models that support audio inputs and outputs allow you to generate spoken audio responses to prompts and use audio inputs to prompt the model.

Supported models

Currently only gpt-4o-audio-preview and gpt-4o-mini-audio-preview version: 2024-12-17 supports audio generation.

For more information about region availability, see the models and versions documentation.

Currently the following voices are supported for audio out: Alloy, Echo, and Shimmer.

The maximum audio file size is 20 MB.

Напомена

The Realtime API uses the same underlying GPT-4o audio model as the completions API, but is optimized for low-latency, real-time audio interactions.

API support

Support for audio completions was first added in API version 2025-01-01-preview.

Prerequisites

An Azure subscription - Create one for free
Node.js LTS or ESM support.
TypeScript installed globally.
An Azure OpenAI resource created in one of the supported regions. For more information about region availability, see the models and versions documentation.
Then, you need to deploy a gpt-4o-mini-audio-preview model with your Azure OpenAI resource. For more information, see Create a resource and deploy a model with Azure OpenAI.

Microsoft Entra ID prerequisites

For the recommended keyless authentication with Microsoft Entra ID, you need to:

Install the Azure CLI used for keyless authentication with Microsoft Entra ID.
Assign the Cognitive Services User role to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.

Set up

Create a new folder audio-completions-quickstart to contain the application and open Visual Studio Code in that folder with the following command:

shell
```
mkdir audio-completions-quickstart && cd audio-completions-quickstart
```
Create the package.json with the following command:

shell
```
npm init -y
```
Update the package.json to ECMAScript with the following command:

shell
```
npm pkg set type=module
```
Install the OpenAI client library for JavaScript with:

Console
```
npm install openai
```
For the recommended keyless authentication with Microsoft Entra ID, install the @azure/identity package with:

Console
```
npm install @azure/identity
```

Retrieve resource information

You need to retrieve the following information to authenticate your application with your Azure OpenAI resource:

Microsoft Entra ID
API key

Variable name	Value
`AZURE_OPENAI_ENDPOINT`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
`AZURE_OPENAI_DEPLOYMENT_NAME`	This value will correspond to the custom name you chose for your deployment when you deployed a model. This value can be found under Resource Management > Model Deployments in the Azure portal.
`OPENAI_API_VERSION`	Learn more about API Versions. You can change the version in code or use an environment variable.

Learn more about keyless authentication and setting environment variables.

Variable name	Value
`AZURE_OPENAI_ENDPOINT`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
`AZURE_OPENAI_API_KEY`	This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal. You can use either `KEY1` or `KEY2`.
`AZURE_OPENAI_DEPLOYMENT_NAME`	This value will correspond to the custom name you chose for your deployment when you deployed a model. This value can be found under Resource Management > Model Deployments in the Azure portal.
`OPENAI_API_VERSION`	Learn more about API Versions.

Learn more about finding API keys and setting environment variables.

Важно

Use API keys with caution. Don't include the API key directly in your code, and never post it publicly. If you use an API key, store it securely in Azure Key Vault. For more information about using API keys securely in your apps, see API keys with Azure Key Vault.

For more information about AI services security, see Authenticate requests to Azure AI services.

Опрез

To use the recommended keyless authentication with the SDK, make sure that the AZURE_OPENAI_API_KEY environment variable isn't set.

Generate audio from text input

Microsoft Entra ID
API key

Create the to-audio.ts file with the following code:

TypeScript

import { writeFileSync } from "node:fs";
import { AzureOpenAI } from "openai/index.mjs";
import {
    DefaultAzureCredential,
    getBearerTokenProvider,
  } from "@azure/identity";

// Set environment variables or edit the corresponding values here.
const endpoint: string = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const deployment: string = process.env.AZURE_OPENAI_DEPLOYMENT_NAME || "gpt-4o-mini-audio-preview"; 
const apiVersion: string = process.env.OPENAI_API_VERSION || "2025-01-01-preview"; 

// Keyless authentication 
const getClient = (): AzureOpenAI => {
    const credential = new DefaultAzureCredential();
    const scope = "https://cognitiveservices.azure.com/.default";
    const azureADTokenProvider = getBearerTokenProvider(credential, scope);
    const client = new AzureOpenAI({
      endpoint: endpoint,
      apiVersion: apiVersion,
      azureADTokenProvider,
    });
    return client;
};

const client = getClient();

async function main(): Promise<void> {

    // Make the audio chat completions request
    const response = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview", 
        modalities: ["text", "audio"], 
        audio: { voice: "alloy", format: "wav" }, 
        messages: [ 
        { 
            role: "user", 
            content: "Is a golden retriever a good family dog?" 
        } 
        ] 
    }); 

  // Inspect returned data 
  console.log(response.choices[0]); 

  // Write the output audio data to a file
  if (response.choices[0].message.audio) {
    writeFileSync( 
      "dog.wav", 
      Buffer.from(response.choices[0].message.audio.data, 'base64'), 
      { encoding: "utf-8" } 
    ); 
  } else {
    console.error("Audio data is null or undefined.");
  }
}

main().catch((err: Error) => {
  console.error("Error occurred:", err);
});

export { main };

Create the tsconfig.json file to transpile the TypeScript code and copy the following code for ECMAScript.

JSON

{
    "compilerOptions": {
      "module": "NodeNext",
      "target": "ES2022", // Supports top-level await
      "moduleResolution": "NodeNext",
      "skipLibCheck": true, // Avoid type errors from node_modules
      "strict": true // Enable strict type-checking options
    },
    "include": ["*.ts"]
}

Transpile from TypeScript to JavaScript.

shell
```
tsc
```
Sign in to Azure with the following command:

shell
```
az login
```
Run the code with the following command:

shell
```
node to-audio.js
```

Create the to-audio.ts file with the following code:

TypeScript

import { writeFileSync } from "node:fs";
import { AzureOpenAI } from "openai/index.mjs";

// Set environment variables or edit the corresponding values here.
const endpoint: string = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const apiKey: string = process.env.AZURE_OPENAI_API_KEY || "AZURE_OPENAI_API_KEY";
const apiVersion: string = "2025-01-01-preview"; 
const deployment: string = "gpt-4o-mini-audio-preview"; 

const client = new AzureOpenAI({ 
  endpoint, 
  apiKey, 
  apiVersion, 
  deployment 
});  

async function main(): Promise<void> {

    // Make the audio chat completions request
    const response = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview", 
        modalities: ["text", "audio"], 
        audio: { voice: "alloy", format: "wav" }, 
        messages: [ 
        { 
            role: "user", 
            content: "Is a golden retriever a good family dog?" 
        } 
        ] 
    }); 

  // Inspect returned data 
  console.log(response.choices[0]); 

  // Write the output audio data to a file
  if (response.choices[0].message.audio) {
    writeFileSync( 
      "dog.wav", 
      Buffer.from(response.choices[0].message.audio.data, 'base64'), 
      { encoding: "utf-8" } 
    ); 
  } else {
    console.error("Audio data is null or undefined.");
  }
}

main().catch((err: Error) => {
  console.error("Error occurred:", err);
});

export { main };

Create the tsconfig.json file to transpile the TypeScript code and copy the following code for ECMAScript.

JSON

{
    "compilerOptions": {
      "module": "NodeNext",
      "target": "ES2022", // Supports top-level await
      "moduleResolution": "NodeNext",
      "skipLibCheck": true, // Avoid type errors from node_modules
      "strict": true // Enable strict type-checking options
    },
    "include": ["*.ts"]
}

Transpile from TypeScript to JavaScript.

shell
```
tsc
```
Run the code with the following command:

shell
```
node to-audio.js
```

Wait a few moments to get the response.

Output for audio generation from text input

The script generates an audio file named dog.wav in the same directory as the script. The audio file contains the spoken response to the prompt, "Is a golden retriever a good family dog?"

Generate audio and text from audio input

Microsoft Entra ID
API key

Create the from-audio.ts file with the following code:

TypeScript

import { AzureOpenAI } from "openai";
import { writeFileSync } from "node:fs";
import { promises as fs } from 'fs';
import {
    DefaultAzureCredential,
    getBearerTokenProvider,
  } from "@azure/identity";

// Set environment variables or edit the corresponding values here.
const endpoint: string = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const apiVersion: string = "2025-01-01-preview"; 
const deployment: string = "gpt-4o-mini-audio-preview"; 

// Keyless authentication 
const getClient = (): AzureOpenAI => {
    const credential = new DefaultAzureCredential();
    const scope = "https://cognitiveservices.azure.com/.default";
    const azureADTokenProvider = getBearerTokenProvider(credential, scope);
    const client = new AzureOpenAI({
      endpoint: endpoint,
      apiVersion: apiVersion,
      azureADTokenProvider,
    });
    return client;
};

const client = getClient();

async function main(): Promise<void> {

    // Buffer the audio for input to the chat completion
    const wavBuffer = await fs.readFile("dog.wav"); 
    const base64str = Buffer.from(wavBuffer).toString("base64"); 

    // Make the audio chat completions request
    const response = await client.chat.completions.create({ 
      model: "gpt-4o-mini-audio-preview",
      modalities: ["text", "audio"], 
      audio: { voice: "alloy", format: "wav" },
      messages: [ 
        { 
          role: "user", 
          content: [ 
            { 
              type: "text", 
              text: "Describe in detail the spoken audio input." 
            }, 
            { 
              type: "input_audio", 
              input_audio: { 
                data: base64str, 
                format: "wav" 
              } 
            } 
          ] 
        } 
      ] 
    }); 

    console.log(response.choices[0]); 

    // Write the output audio data to a file
    if (response.choices[0].message.audio) {
        writeFileSync("analysis.wav", Buffer.from(response.choices[0].message.audio.data, 'base64'), { encoding: "utf-8" });
    }
    else {
        console.error("Audio data is null or undefined.");
  }
}

main().catch((err: Error) => {
  console.error("Error occurred:", err);
});

export { main };

Create the tsconfig.json file to transpile the TypeScript code and copy the following code for ECMAScript.

JSON

{
    "compilerOptions": {
      "module": "NodeNext",
      "target": "ES2022", // Supports top-level await
      "moduleResolution": "NodeNext",
      "skipLibCheck": true, // Avoid type errors from node_modules
      "strict": true // Enable strict type-checking options
    },
    "include": ["*.ts"]
}

Transpile from TypeScript to JavaScript.

shell
```
tsc
```
Sign in to Azure with the following command:

shell
```
az login
```
Run the code with the following command:

shell
```
node from-audio.js
```

Create the from-audio.ts file with the following code:

TypeScript

import { AzureOpenAI } from "openai";
import { writeFileSync } from "node:fs";
import { promises as fs } from 'fs';

// Set environment variables or edit the corresponding values here.
const endpoint: string = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const apiKey: string = process.env.AZURE_OPENAI_API_KEY || "AZURE_OPENAI_API_KEY";
const apiVersion: string = "2025-01-01-preview"; 
const deployment: string = "gpt-4o-mini-audio-preview"; 

const client = new AzureOpenAI({ 
  endpoint, 
  apiKey, 
  apiVersion, 
  deployment 
});  

async function main(): Promise<void> {

  // Buffer the audio for input to the chat completion
  const wavBuffer = await fs.readFile("dog.wav"); 
  const base64str = Buffer.from(wavBuffer).toString("base64"); 

  // Make the audio chat completions request
  const response = await client.chat.completions.create({ 
    model: "gpt-4o-mini-audio-preview",
    modalities: ["text", "audio"], 
    audio: { voice: "alloy", format: "wav" },
    messages: [ 
      { 
        role: "user", 
        content: [ 
          { 
            type: "text", 
            text: "Describe in detail the spoken audio input." 
          }, 
          { 
            type: "input_audio", 
            input_audio: { 
              data: base64str, 
              format: "wav" 
            } 
          } 
        ] 
      } 
    ] 
  }); 

  console.log(response.choices[0]); 

  // Write the output audio data to a file
  if (response.choices[0].message.audio) {
      writeFileSync("analysis.wav", Buffer.from(response.choices[0].message.audio.data, 'base64'), { encoding: "utf-8" });
  }
  else {
      console.error("Audio data is null or undefined.");
}
}

main().catch((err: Error) => {
console.error("Error occurred:", err);
});

export { main };

Create the tsconfig.json file to transpile the TypeScript code and copy the following code for ECMAScript.

JSON

{
    "compilerOptions": {
      "module": "NodeNext",
      "target": "ES2022", // Supports top-level await
      "moduleResolution": "NodeNext",
      "skipLibCheck": true, // Avoid type errors from node_modules
      "strict": true // Enable strict type-checking options
    },
    "include": ["*.ts"]
}

Transpile from TypeScript to JavaScript.

shell
```
tsc
```
Run the code with the following command:

shell
```
node from-audio.js
```

Wait a few moments to get the response.

Output for audio and text generation from audio input

The script generates a transcript of the summary of the spoken audio input. It also generates an audio file named analysis.wav in the same directory as the script. The audio file contains the spoken response to the prompt.

Generate audio and use multi-turn chat completions

Microsoft Entra ID
API key

Create the multi-turn.ts file with the following code:

TypeScript

import { AzureOpenAI } from "openai/index.mjs";
import { promises as fs } from 'fs';
import { ChatCompletionMessageParam } from "openai/resources/index.mjs";
import {
    DefaultAzureCredential,
    getBearerTokenProvider,
  } from "@azure/identity";

// Set environment variables or edit the corresponding values here.
const endpoint: string = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT";
const apiVersion: string = "2025-01-01-preview"; 
const deployment: string = "gpt-4o-mini-audio-preview"; 

// Keyless authentication 
const getClient = (): AzureOpenAI => {
    const credential = new DefaultAzureCredential();
    const scope = "https://cognitiveservices.azure.com/.default";
    const azureADTokenProvider = getBearerTokenProvider(credential, scope);
    const client = new AzureOpenAI({
      endpoint: endpoint,
      apiVersion: apiVersion,
      azureADTokenProvider,
    });
    return client;
};

const client = getClient(); 

async function main(): Promise<void> {

    // Buffer the audio for input to the chat completion
    const wavBuffer = await fs.readFile("dog.wav"); 
    const base64str = Buffer.from(wavBuffer).toString("base64"); 

    // Initialize messages with the first turn's user input 
    const messages: ChatCompletionMessageParam[] = [
      {
        role: "user",
        content: [
          { 
            type: "text", 
            text: "Describe in detail the spoken audio input." 
          },
          { 
            type: "input_audio", 
            input_audio: { 
              data: base64str, 
              format: "wav" 
            } 
          }
        ]
      }
    ];

    // Get the first turn's response 

    const response = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview",
        modalities: ["text", "audio"], 
        audio: { voice: "alloy", format: "wav" }, 
        messages: messages
    }); 

    console.log(response.choices[0]); 

    // Add a history message referencing the previous turn's audio by ID 
    messages.push({ 
        role: "assistant", 
        audio: response.choices[0].message.audio ? { id: response.choices[0].message.audio.id } : undefined
    });

    // Add a new user message for the second turn
    messages.push({ 
        role: "user", 
        content: [ 
            { 
              type: "text", 
              text: "Very concisely summarize the favorability." 
            } 
        ] 
    }); 

    // Send the follow-up request with the accumulated messages
    const followResponse = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview",
        messages: messages
    });

    console.log(followResponse.choices[0].message.content); 
}

main().catch((err: Error) => {
  console.error("Error occurred:", err);
});

export { main };

Create the tsconfig.json file to transpile the TypeScript code and copy the following code for ECMAScript.

JSON

{
    "compilerOptions": {
      "module": "NodeNext",
      "target": "ES2022", // Supports top-level await
      "moduleResolution": "NodeNext",
      "skipLibCheck": true, // Avoid type errors from node_modules
      "strict": true // Enable strict type-checking options
    },
    "include": ["*.ts"]
}

Transpile from TypeScript to JavaScript.

shell
```
tsc
```
Sign in to Azure with the following command:

shell
```
az login
```
Run the code with the following command:

shell
```
node multi-turn.js
```

Create the multi-turn.ts file with the following code:

TypeScript

import { AzureOpenAI } from "openai/index.mjs";
import { promises as fs } from 'fs';
import { ChatCompletionMessageParam } from "openai/resources/index.mjs";

// Set environment variables or edit the corresponding values here.
const endpoint: string = process.env.AZURE_OPENAI_ENDPOINT || "AZURE_OPENAI_ENDPOINT" as string;
const apiKey: string = process.env.AZURE_OPENAI_API_KEY || "AZURE_OPENAI_API_KEY";
const apiVersion: string = "2025-01-01-preview"; 
const deployment: string = "gpt-4o-mini-audio-preview"; 

const client = new AzureOpenAI({ 
  endpoint, 
  apiKey, 
  apiVersion, 
  deployment 
});  

async function main(): Promise<void> {

    // Buffer the audio for input to the chat completion
    const wavBuffer = await fs.readFile("dog.wav"); 
    const base64str = Buffer.from(wavBuffer).toString("base64"); 

    // Initialize messages with the first turn's user input 
    const messages: ChatCompletionMessageParam[] = [
      {
        role: "user",
        content: [
          { 
            type: "text", 
            text: "Describe in detail the spoken audio input." 
          },
          { 
            type: "input_audio", 
            input_audio: { 
              data: base64str, 
              format: "wav" 
            } 
          }
        ]
      }
    ];

    // Get the first turn's response 

    const response = await client.chat.completions.create({ 
      model: "gpt-4o-mini-audio-preview",
      modalities: ["text", "audio"], 
      audio: { voice: "alloy", format: "wav" }, 
      messages: messages
    }); 

    console.log(response.choices[0]); 

    // Add a history message referencing the previous turn's audio by ID 
    messages.push({ 
        role: "assistant", 
        audio: response.choices[0].message.audio ? { id: response.choices[0].message.audio.id } : undefined
    });

    // Add a new user message for the second turn
    messages.push({ 
        role: "user", 
        content: [ 
            { 
              type: "text", 
              text: "Very concisely summarize the favorability." 
            } 
        ] 
    }); 

    // Send the follow-up request with the accumulated messages
    const followResponse = await client.chat.completions.create({ 
        model: "gpt-4o-mini-audio-preview",
        messages: messages
    });

    console.log(followResponse.choices[0].message.content); 
}

main().catch((err: Error) => {
  console.error("Error occurred:", err);
});

export { main };

Create the tsconfig.json file to transpile the TypeScript code and copy the following code for ECMAScript.

JSON

{
    "compilerOptions": {
      "module": "NodeNext",
      "target": "ES2022", // Supports top-level await
      "moduleResolution": "NodeNext",
      "skipLibCheck": true, // Avoid type errors from node_modules
      "strict": true // Enable strict type-checking options
    },
    "include": ["*.ts"]
}

Transpile from TypeScript to JavaScript.

shell
```
tsc
```
Run the code with the following command:

shell
```
node multi-turn.js
```

Wait a few moments to get the response.

Output for multi-turn chat completions

The script generates a transcript of the summary of the spoken audio input. Then, it makes a multi-turn chat completion to briefly summarize the spoken audio input.

Делите путем

Quickstart: Get started using Azure OpenAI audio generation

Supported models

API support

Deploy a model for audio generation

Use GPT-4o audio generation

Supported models

API support

Prerequisites

Microsoft Entra ID prerequisites

Set up

Retrieve resource information

Generate audio from text input

Output for audio generation from text input

Generate audio and text from audio input

Output for audio and text generation from audio input

Generate audio and use multi-turn chat completions

Output for multi-turn chat completions

Supported models

API support

Prerequisites

Microsoft Entra ID prerequisites

Set up

Retrieve resource information

Generate audio from text input

Output for audio generation from text input

Generate audio and text from audio input

Output for audio and text generation from audio input

Generate audio and use multi-turn chat completions

Output for multi-turn chat completions

Supported models

API support

Prerequisites

Microsoft Entra ID prerequisites

Set up

Retrieve resource information

Generate audio from text input

Output for audio generation from text input

Generate audio and text from audio input

Output for audio and text generation from audio input

Generate audio and use multi-turn chat completions

Output for multi-turn chat completions

Supported models

API support

Prerequisites

Microsoft Entra ID prerequisites

Set up

Retrieve resource information

Generate audio from text input

Output for audio generation from text input

Generate audio and text from audio input

Output for audio and text generation from audio input

Generate audio and use multi-turn chat completions

Output for multi-turn chat completions

Clean-up resources

Related content

Повратне информације

Додатни ресурси