Share via


Azure AI Speech Transcription client library for JavaScript - version 1.0.0-beta.1

The Azure AI Speech Transcription client library provides easy access to Azure's speech-to-text transcription service, enabling you to convert audio to text with high accuracy.

Use the client library to:

  • Transcribe audio files to text
  • Support multiple languages and locales
  • Enable speaker diarization to identify different speakers
  • Apply profanity filtering
  • Use custom speech models
  • Process both local files and remote URLs
  • Use Enhanced Mode for LLM-powered transcription and translation

Key links:

Getting started

Currently supported environments

See our support policy for more details.

Prerequisites

Install the @azure/ai-speech-transcription package

Install the Azure AI Speech Transcription client library for JavaScript with npm:

npm install @azure/ai-speech-transcription

Create and authenticate a TranscriptionClient

To create a client object to access the Azure Transcription API, you will need the endpoint of your Azure Transcription resource and a credential. You can find the endpoint for your Azure Transcription resource in the Azure Portal.

Option 1: API Key Authentication

You can find your Speech resource's API key in the Azure Portal.

import { TranscriptionClient } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));

For production scenarios, it is recommended to use Entra ID authentication with managed identities or service principals. Install the @azure/identity package:

npm install @azure/identity

You will also need to assign the appropriate role (e.g., "Cognitive Services User") to your managed identity or service principal. For more information, see Azure AI Services authentication.

Using Node.js and Node-like environments, you can use the DefaultAzureCredential class to authenticate the client.

import { TranscriptionClient } from "@azure/ai-speech-transcription";
import { DefaultAzureCredential } from "@azure/identity";

const client = new TranscriptionClient("<endpoint>", new DefaultAzureCredential());

For browser environments, use the InteractiveBrowserCredential from the @azure/identity package to authenticate.

import { InteractiveBrowserCredential } from "@azure/identity";
import { TranscriptionClient } from "@azure/ai-speech-transcription";

const credential = new InteractiveBrowserCredential({
  tenantId: "<YOUR_TENANT_ID>",
  clientId: "<YOUR_CLIENT_ID>",
});
const client = new TranscriptionClient("<endpoint>", credential);

Service API versions

The client library targets the latest service API version by default. You can select a specific supported API version when instantiating the client:

import { TranscriptionClient, KnownServiceApiVersions } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"), {
  serviceVersion: KnownServiceApiVersions.V20251015,
});

JavaScript Bundle

To use this client library in the browser, first you need to use a bundler. For details on how to do this, please refer to our bundling documentation.

Key concepts

TranscriptionClient

TranscriptionClient is the primary interface for developers using the Azure AI Speech Transcription client library. It provides two overloaded transcribe methods — one for audio binary data and one for audio URLs.

Audio Formats

The service supports various audio formats including WAV, MP3, OGG, FLAC, and more. Audio must be:

  • Shorter than 2 hours in duration
  • Smaller than 250 MB in size

Transcription Options

You can customize transcription with options like:

  • Profanity filtering: Control how profanity is handled in transcriptions ("None", "Masked", "Removed", "Tags")
  • Speaker diarization: Identify different speakers in multi-speaker audio (up to 36 speakers)
  • Phrase lists: Provide domain-specific phrases to improve accuracy
  • Language detection: Automatically detect the spoken language, or specify known locales
  • Enhanced mode: Improve transcription quality with LLM-powered processing, translation, and prompt-based customization

Examples

Transcribe a local audio file

The most basic operation is to transcribe an audio file from your local filesystem:

import { TranscriptionClient } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";
import { readFileSync } from "node:fs";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const audioFile = readFileSync("path/to/audio.wav");
const result = await client.transcribe(audioFile);
console.log(`Duration: ${result.durationInMs}ms`);
console.log("Transcription:", result.combinedPhrases[0]?.text);

Transcribe audio from a URL

You can transcribe audio directly from a publicly accessible URL without downloading the file first:

import { TranscriptionClient } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const result = await client.transcribe("https://example.com/audio/sample.wav", {
  locales: ["en-US"],
});
console.log("Transcription:", result.combinedPhrases[0]?.text);

Access individual transcribed words

To access word-level details including timestamps, confidence scores, and individual words:

import { TranscriptionClient } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";
import { readFileSync } from "node:fs";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const audioFile = readFileSync("path/to/audio.wav");
const result = await client.transcribe(audioFile);
for (const phrase of result.phrases) {
  console.log(`Phrase: ${phrase.text}`);
  console.log(
    `  Offset: ${phrase.offsetMilliseconds}ms | Duration: ${phrase.durationMilliseconds}ms`,
  );
  console.log(`  Confidence: ${phrase.confidence.toFixed(2)}`);
  // Access individual words in the phrase
  for (const word of phrase.words ?? []) {
    console.log(`    Word: '${word.text}' | Offset: ${word.offsetMilliseconds}ms`);
  }
}

Identify speakers with diarization

Speaker diarization identifies who spoke when in multi-speaker conversations:

import { TranscriptionClient } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";
import { readFileSync } from "node:fs";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const audioFile = readFileSync("path/to/conversation.wav");
const result = await client.transcribe(audioFile, {
  diarizationOptions: {
    maxSpeakers: 4, // Expect up to 4 speakers in the conversation
  },
});
for (const phrase of result.phrases) {
  console.log(`Speaker ${phrase.speaker}: ${phrase.text}`);
}

Note: The total number of identified speakers will never exceed maxSpeakers. If the actual audio contains more speakers than specified, the service will consolidate them. Set a reasonable upper bound if you are unsure of the exact count.

Filter profanity

Control how profanity appears in your transcriptions using different filter modes:

import { TranscriptionClient, KnownProfanityFilterModes } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";
import { readFileSync } from "node:fs";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const audioFile = readFileSync("path/to/audio.wav");
const result = await client.transcribe(audioFile, {
  profanityFilterMode: KnownProfanityFilterModes.Masked, // Default - profanity replaced with asterisks
});
console.log("Transcription:", result.combinedPhrases[0]?.text);

Available modes:

  • "None": No filtering — profanity appears as spoken
  • "Masked": Profanity replaced with asterisks (e.g., f***)
  • "Removed": Profanity completely removed from text
  • "Tags": Profanity wrapped in XML tags (e.g., <profanity>word</profanity>)

Improve accuracy with custom phrases

Add custom phrases to help the service correctly recognize domain-specific terms, names, and acronyms:

import { TranscriptionClient } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";
import { readFileSync } from "node:fs";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const audioFile = readFileSync("path/to/audio.wav");
const result = await client.transcribe(audioFile, {
  phraseList: {
    phrases: ["Contoso", "Jessie", "Rehaan"],
  },
});
console.log("Transcription:", result.combinedPhrases[0]?.text);

Transcribe with a known language

When you know the language of the audio, specifying a single locale improves accuracy and reduces latency:

import { TranscriptionClient } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";
import { readFileSync } from "node:fs";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const audioFile = readFileSync("path/to/english-audio.mp3");
const result = await client.transcribe(audioFile, {
  locales: ["en-US"],
});
console.log("Transcription:", result.combinedPhrases[0]?.text);

For language identification when you are unsure of the language, specify multiple candidate locales and the service will automatically detect the language:

import { TranscriptionClient } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";
import { readFileSync } from "node:fs";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const audioFile = readFileSync("path/to/audio.mp3");
const result = await client.transcribe(audioFile, {
  locales: ["en-US", "es-ES"],
});
for (const phrase of result.phrases) {
  console.log(`[${phrase.locale}] ${phrase.text}`);
}

Use Enhanced Mode for highest accuracy

Enhanced Mode uses LLM-powered processing for the highest accuracy transcription:

import { TranscriptionClient, KnownProfanityFilterModes } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";
import { readFileSync } from "node:fs";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const audioFile = readFileSync("path/to/audio.wav");
const result = await client.transcribe(audioFile, {
  // Enhanced mode: LLM-powered speech recognition with prompt customization
  enhancedMode: {
    task: "transcribe",
    prompt: ["Output must be in lexical format."],
  },
  // Existing Fast Transcription options work alongside enhanced mode
  diarizationOptions: {
    maxSpeakers: 2,
  },
  profanityFilterMode: KnownProfanityFilterModes.Masked,
  activeChannels: [0, 1],
});
for (const phrase of result.phrases) {
  console.log(`[Speaker ${phrase.speaker}] ${phrase.text}`);
}

Translate with Enhanced Mode

Enhanced Mode also supports translating speech to a target language:

import { TranscriptionClient, KnownProfanityFilterModes } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";
import { readFileSync } from "node:fs";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const audioFile = readFileSync("path/to/chinese-audio.wav");
const result = await client.transcribe(audioFile, {
  enhancedMode: {
    task: "translate",
    targetLanguage: "ko", // Translate to Korean
  },
  profanityFilterMode: KnownProfanityFilterModes.Masked,
});
console.log("Translated to Korean:", result.combinedPhrases[0]?.text);

Combine multiple options

You can combine multiple transcription features for complex scenarios:

import { TranscriptionClient, KnownProfanityFilterModes } from "@azure/ai-speech-transcription";
import { AzureKeyCredential } from "@azure/core-auth";
import { readFileSync } from "node:fs";

const client = new TranscriptionClient("<endpoint>", new AzureKeyCredential("<api-key>"));
const audioFile = readFileSync("path/to/meeting.wav");
const result = await client.transcribe(audioFile, {
  // Enable speaker diarization
  diarizationOptions: {
    maxSpeakers: 5,
  },
  // Mask profanity
  profanityFilterMode: KnownProfanityFilterModes.Masked,
  // Add custom phrases
  phraseList: {
    phrases: ["action items", "Q4", "KPIs"],
  },
});
console.log("Full Transcript:");
console.log(result.combinedPhrases[0]?.text);
for (const phrase of result.phrases) {
  console.log(`Speaker ${phrase.speaker}: ${phrase.text}`);
}

Troubleshooting

Common issues

  • Authentication failures: Verify your API key or Entra ID credentials are correct and that your Speech resource is active.
  • Unsupported audio format: Ensure your audio is in a supported format (WAV, MP3, OGG, FLAC, etc.). The service automatically handles format detection.
  • Slow transcription: For large files, ensure your network connection is stable.
  • Poor accuracy: Try specifying the correct locale, adding custom phrases for domain-specific terms, or using Enhanced Mode.

Logging

Enabling logging may help uncover useful information about failures. In order to see a log of HTTP requests and responses, set the AZURE_LOG_LEVEL environment variable to info. Alternatively, logging can be enabled at runtime by calling setLogLevel in the @azure/logger:

import { setLogLevel } from "@azure/logger";

setLogLevel("info");

For more detailed instructions on how to enable logs, you can look at the @azure/logger package docs.

Next steps

Explore additional samples to learn more about advanced features:

Contributing

If you'd like to contribute to this library, please read the contributing guide to learn more about how to build and test the code.