Sdílet prostřednictvím


Get media transcription, translation, and language identification insights

Warning

Over the past year, Azure AI Video Indexer (VI) announced the removal of its dependency on Azure Media Services (AMS) due to its retirement. Features adjustments and changes were announced and a migration guide was provided.

The deadline to complete migration was June 30, 2024. VI has extended the update/migrate deadline so you can update your VI account and opt in to the AMS VI asset migration through July 15th, 2024. To use the AMS VI asset migration, you also must extend your AMS account through July. Navigate to your AMS account in the Azure portal and select Click here to extend.

However, after June 30, if you have not updated your VI account, you won't be able to index new videos nor will you be able to play any videos that have not been migrated. If you update your account after June 30, you can resume indexing immediately but you won't be able to play videos indexed before the account update until they are migrated through the AMS VI migration.

Media transcription, translation, and language identification

Transcription, translation, and language identification detects, transcribes, and translates the speech in media files into over 50 languages.

Azure AI Video Indexer (VI) processes the speech in the audio file to extract the transcription that is then translated into many languages. When selecting to translate into a specific language, both the transcription and the insights like keywords, topics, labels or OCR are translated into the specified language. Transcription can be used as is or be combined with speaker insights that map and assign the transcripts into speakers. Multiple speakers can be detected in an audio file. An ID is assigned to each speaker and is displayed under their transcribed speech.

Language identification (LID) recognizes the supported dominant spoken language in the video file. For more information, see Applying LID.

Multi-language identification (MLID) automatically recognizes the spoken languages in different segments in the audio file and sends each segment to be transcribed in the identified languages. At the end of this process, all transcriptions are combined into the same file. For more information, see Applying MLID. The resulting insights are generated in a categorized list in a JSON file that includes the ID, language, transcribed text, duration, and confidence score.

When indexing media files with multiple speakers, Azure AI Video Indexer performs speaker diarization that identifies each speaker in a video and attributes each transcribed line to a speaker. The speakers are given a unique identity such as Speaker #1 and Speaker #2. This allows for the identification of speakers during conversations and can be useful in various scenarios such as doctor-patient conversations, agent-customer interactions, and court proceedings.

Media transcription, translation, and language identification use cases

  • Promoting accessibility by making content available for people with hearing disabilities using Azure AI Video Indexer to generate speech to text transcription and translation into multiple languages.
  • Improving content distribution to a diverse audience in different regions and languages by delivering content in multiple languages using Azure AI Video Indexer’s transcription and translation capabilities.
  • Enhancing and improving manual closed captioning and subtitles generation by using Azure AI Video Indexer’s transcription and translation capabilities and by using the closed captions generated by Azure AI Video Indexer in one of the supported formats.
  • Using language identification (LID) or multi language identification (MLID) to transcribe videos in unknown languages to allow Azure AI Video Indexer to automatically identify the languages appearing in the video and generate the transcription accordingly.

View the insight JSON with the web portal

Once you have uploaded and indexed a video, insights are available in JSON format for download using the web portal.

  1. Select the Library tab.
  2. Select media you want to work with.
  3. Select Download and the Insights (JSON). The JSON file opens in a new browser tab.
  4. Look for the key pair described in the example response.

Use the API

  1. Use the Get Video Index request. We recommend passing &includeSummarizedInsights=false.
  2. Look for the key pair described in the example response.
    "insights": {
      "version": "1.0.0.0",
      "duration": "0:01:50.486",
      "sourceLanguage": "en-US",
      "sourceLanguages": [
        "en-US"
      ],
      "language": "en-US",
      "languages": [
        "en-US"
      ],
      "transcript": [
        {
          "id": 1,
          "text": "Hi, I'm Doug from office. We're talking about new features that office insiders will see first and I have a program manager,",
          "confidence": 0.8879,
          "speakerId": 1,
          "language": "en-US",
          "instances": [
            {
              "adjustedStart": "0:00:00",
              "adjustedEnd": "0:00:05.75",
              "start": "0:00:00",
              "end": "0:00:05.75"
            }
          ]
        },
        {
          "id": 2,
          "text": "Emily Tran, with office graphics.",
          "confidence": 0.8879,
          "speakerId": 1,
          "language": "en-US",
          "instances": [
            {
              "adjustedStart": "0:00:05.75",
              "adjustedEnd": "0:00:07.01",
              "start": "0:00:05.75",
              "end": "0:00:07.01"
            }
          ]
        },