Azure AI Video Indexer insights

When a video is indexed, Azure AI Video Indexer analyzes the video and audio content by running 30+ AI models, generating JSON containing the video insights including transcripts, optical character recognition elements (OCRs), face, topics, emotions, etc. Each insight type includes instances of time ranges that show when the insight appears in the video.

Use the links in the insights table to learn how to get each insight JSON response in the web portal and using the API.

Insights

Insight Description
Face detection Face detection detects faces in a media file, and then aggregates instances of similar faces into groups.Face detection insights are generated as a categorized list in a JSON file that includes a thumbnail and either a name or an ID for each face. In the web portal, selecting a face’s thumbnail displays information like the name of the person (if they were recognized), the percentage of the video that the person appears, and the person's biography, if they're a celebrity. You can also scroll between instances in the video where the person appears.
Labels identification Labels identification is an Azure AI Video Indexer AI feature that identifies visual objects like sunglasses or actions like swimming, appearing in the video footage of a media file. There are many labels identification categories and once extracted, labels identification instances are displayed in the Insights tab and can be translated into over 50 languages. Clicking a Label opens the instance in the media file, select Play Previous or Play Next to see more instances.
Object detection Azure AI Video Indexer detects objects in videos such as cars, handbags and backpacks, and laptops.
Observed people detection Observed people detection and matched faces automatically detect and match people in media files. Observed people detection and matched faces can be set to display insights on people, their clothing, and the exact timeframe of their appearance.)
OCR OCR extracts text from images like pictures, street signs and products in media files to create insights.
Post-production: clapper board detection Clapper board detection detects clapper boards used during filming that also provides the information detected on the clapper board as metadata, for example, production, roll, scene, take, etc. Clapper board is part of the post-production insights that you can select in the web portal advanced settings when you upload and index the file.
Post-production: digital patterns Digital patterns detection detects color bars used during filming. Digital patterns is part of the post-production insights that you can select in the web portal advanced settings when you upload and index the file.
Scenes, shots and keyframes Scene detection detects when a scene changes in a video based on visual cues.A scene depicts a single event and is composed of a series of shots, which are related.Shots are a series of frames distinguished by visual cues such as abrupt and gradual transitions in color scheme of adjacent frames. The shot's metadata includes start and end time, as well as a list of keyframes included in the shot.A keyframe is a frame from a shot that best represents a shot.

Audio insights

Insight Description
Audio effects detection Audio effects detection detects acoustic events and classifies them into categories such as laughter, crowd reactions, alarms and/or sirens.
Keywords extraction Keywords extraction detects insights on the different keywords discussed in media files. It extract insights in both single language and multi-language media files.
Named entities Named entities extraction uses Natural Language Processing (NLP) to extract insights on the locations, people, and brands appearing in audio and images in media files. The named entities extraction insight uses transcription and optical character recognition (OCR).
Text-based emotion detection Emotions detection detects emotions in video's transcript lines. Each sentence can either be detected as Anger, Fear, Joy, Sad, None if no other emotion was detected.
Topics inference Topics inference creates inferred insights derived from the transcribed audio, OCR content in visual text, and celebrities recognized in the video using the Video Indexer facial recognition model.
Transcription, translation, and language identification Transcription, translation, and language identification detects, transcribes, and translates the speech in media files into over 50 languages.