Transparency note for Azure Video Indexer

Articolo
06/25/2024

An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, its capabilities and limitations, and how to achieve the best performance.

What is transparency?

Microsoft’s Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system or share them with the people who will use or be affected by your system.  

Microsoft’s Transparency Notes are part of a broader effort at Microsoft to put our AI principles into practice.

To find out more, see Microsoft AI principles.

Introduction to Azure AI Video Indexer

Azure AI Video Indexer (VI) is a cloud-based tool that processes and analyzes uploaded video and audio files to generate different types of insights. These insights include detected objects, people, faces, key frames and translations or transcriptions in at least 60 languages. The insights and their time frames are displayed in a categorized list on the Azure AI Video Indexer website where each insight can be seen by pressing its Play button.

While processing the files, the Azure AI Video Indexer employs a portfolio of Microsoft AI algorithms to analyze, categorize, and index the video footage. The resulting insights are then archived and can be comprehensively accessed, shared, and reused. For example, a news media outlet may implement a deep search for insights related to the Empire State Building and then reuse their findings in different movies, trailers, or promos.  

The basics of Azure AI Video Indexer

Azure AI Video Indexer is a cloud-based Azure AI services product that is integrated with Azure AI services. It allows you to upload video and audio files, process the video (including running AI models on them) and then save the processed files and resulting data to a cloud-based Azure Media Services account.

To process the media files, Azure AI Video Indexer employs AI technologies like Optical Character Recognition (OCR), Natural Language Processing (NLP), and hierarchical ontology models with voice tonality analysis to extract insights like brands, keywords, topics, and text-based emotion detection.

Azure AI Video Indexer’s capabilities include searching for insights in archives, promoting content accessibility, content moderation and content editing.

Insights categories include:

Insight category	Description
Audio media	For example, transcriptions, translations, audio event detection like clapping and crowd laughter, gun shots and explosions
Video media	For example, faces, clothing detection
Video with audio media	For example, named entities in transcripts, and Optical Character Recognition (OCR), for example, names of locations, people, or brands

For more information, see Introduction to Azure AI Video Indexer.

Key terms and features

Term	Definition
Text-based emotion detection	Emotions such as joy, sadness, anger, and fear that were detected via transcript analysis.
Insight	The information and knowledge derived from the processing and analysis of video and audio files that generate different types of insights and can include detected objects, people, faces, key frames and translations or transcriptions. To view and download insights via the API, use the Azure AI Video Indexer portal.
Object detection	The ability to identify and find objects in an image or video. For example, a table, chair, or window.
Facial detection	Finds human faces in an image and returns bounding boxes indicating their locations. Face detection models alone do not find individually identifying features, only a bounding box marking the entire face. Facial detection doesn't involve distinguishing one fact from another face, predicting or classifying facial attributes, or creating a Face template.
Facial identification	"One-to-many" matching of a face in an unmanipulated image to a set of faces in a secure repository. An example is a touchless access control system in a building that replaces or augments physical cards and badges in which a smart camera captures the face of one person entering a secured door and attempts to find a match from a set of images of faces of individuals who are approved to access the building. This process is implemented by Azure AI Face service and involves the creation of Face templates.
Face template	Unique set of numbers generated from an image or video that represents the distinctive features of a face.
Observed people tracking and matched faces	Features that automatically detect and match people in media files. Observed people tracking and matched faces can be set to display insights on people, their clothing, and the exact time frame of their appearance.
Keyword extraction	The process of automatically detecting insights on the different keywords discussed in media files. Keywords extraction can extract insights in both single language and multi-language media files.
Deep search	The ability to retrieve only relevant video and audio files from a video library by searching for specific terms within the extracted insights.
Labels	The identification of visual objects and actions appearing in a frame. For example, identifying an object such as a dog, or an action such as running.
Named entities	Feature that uses Natural Language Processing (NLP) to extract insights on the locations, people and brands appearing in audio and images in media files.
Natural Language Processing (NLP)	The processing of human language as it is spoken and written.
Optical Character Recognition (OCR)	Extracts text from images like pictures, street signs, and products in media files to create insights. For more information, see OCR technology.
Hierarchical Ontology Model	A set of concepts or categories in a subject area or domain that possess shared properties and relationships.
Audio effects detection	Feature that detects insights on a variety of acoustic events and classifies them into acoustic categories. Audio effect detection can detect and classify different categories such as laughter, crowd reactions, alarms and/or sirens.
Transcription, translation and language identification	Feature that automatically detects, transcribes, and translates the speech in media files into over 50 languages.
Topics inference	Feature that automatically creates inferred insights derived from the transcribed audio, OCR content in visual text, and celebrities recognized in the video.
Speaker diarization	Feature that identifies each speaker in a video and attributes each transcribed line to a speaker. This allows for the identification of speakers during conversations and can be useful in a variety of scenarios.
Bring Your Own Model	Feature that allows you to send insights and artifacts generated by Azure AI Video Indexer to external AI models.
Textual Video Summarization	Feature that summarizes the uses artificial intelligence to summarize the content of a video.

Components of Azure AI Video Indexer

During the Azure AI Video Indexer procedure, a media file is processed using Azure APIs to extract different types of insights, as follows:

Component	Definition
Video uploader	The user uploads a media file to be processed by Azure AI Video Indexer.
Insights generation	Azure services APIs such as Azure AI services OCR and Transcription, extract insights. Internal AI models are run to generate insights like Detected Audio Events, Observed People, Detected Clothing, and Topics.
Insights processing	Additional logic such as confidence level threshold filtering is applied to the output of Insights generation to create the final insights that are then displayed in the Azure AI Video Indexer portal and in the JSON file that can be downloaded from the portal.
Storage	Output from the processed media file is saved in: • Azure Storage • Azure Search, where users can search for videos using specific insights like an actor’s name, a location, or a brand.
Notification	The user receives notification that the indexing process has been completed.

Limited Access features of Azure AI Video Indexer

Facial recognition features of Azure AI Video Indexer (including facial detection, facial identification, facial templates observed people tracking, and matched faces) are Limited Access and are only available to Microsoft managed customers and partners, and only for certain use cases selected at the time of registration. Access to the facial identification and celebrity recognition capabilities requires registration. Facial detection does not require registration. To learn more, visit Microsoft’s Limited Access policy.

Approved commercial use cases for Limited Access features

Facial Identification to search for a face in a media or entertainment video archive: to find a face within a video and generate metadata for media or entertainment use cases only.

Celebrity Recognition: to detect and identify celebrities within images or videos in digital asset management systems, for accessibility and/or media and entertainment use cases only.

Approved public sector use cases for Limited Access features

Facial identification for preservation and enrichment of public media archives: to identify individuals in public media or entertainment video archives for the purposes of preserving and enriching public media only. Examples of public media enrichment include identifying historical figures in video archives and generating descriptive metadata.

Facial identification to:

assist law enforcement or court officials in prosecution or defense of a criminal suspect who has already been apprehended, to the extent specifically authorized by a duly empowered government authority in a jurisdiction that maintains a fair and independent judiciary OR
assist officials of duly empowered international organizations in the prosecution of abuses of international criminal law, international human rights law, or international humanitarian law.

Facial identification for purposes of providing humanitarian aid, or identifying missing persons, deceased persons, or victims of crimes.

Example use cases for Azure AI Video Indexer

Azure AI Video Indexer can be used in multiple scenarios in a variety of industries, such as:  

Creating feature stories at news or media agencies by implementing deep searches for specific people and/or words to find what was said, by whom, where and when. Facial identification capabilities are Limited Access. For more information, visit Microsoft’s Limited Access policy.  
Creating promos and trailers using important moments previously extracted from videos. Azure AI Video Indexer can assist by adding keyframes, scene markers, timestamps and labelling so that content editors invest less time reviewing numerous files.
Promoting accessibility by translating and transcribing audio into multiple languages and adding captions, or by creating a verbal description of footage via OCR processing to enhance accessibility for the visually impaired.
Improving content distribution to a diverse audience in different regions and languages by delivering content in multiple languages using Azure AI Video Indexer’s transcription and translation capabilities.
Enhancing targeted advertising, industries like news media or social media can use Azure AI Video Indexer to extract insights to enhance the relevance of targeted advertising.
Enhancing user engagement using metadata, tags, keywords, and embedded customer insights to filter and tailor media to customer preferences.  
Moderating inappropriate content such as banned words using textual and visual content control to tag media as child approved or for adults only.
Accurately and quickly detecting violence incidents by classifying gunshots, explosions, and glass shattering in a smart-city system or in other public environments that include cameras and microphones.
Enhancing compliance with local standards by extracting text in warnings in online instructions and then translating the text for example, e-learning instructions for using equipment.
Enhancing and improving manual closed captioning and subtitles generation by leveraging Azure AI Video Indexer’s transcription and translation capabilities and by using the closed captions generated by Azure AI Video Indexer in one of the supported formats.
Transcribing videos in unknown languages by using language identification (LID) or multi language identification (MLID) to allow Azure AI Video Indexer to automatically identify the languages appearing in the video and generate the transcription accordingly.

Considerations when choosing a use case

Avoid using Video Indexer for decisions that may have serious adverse impacts. Decisions based on incorrect output could have serious adverse impacts. Additionally, it is advisable to include human review of decisions that have the potential for serious impacts on individuals.
The Video Indexer text-based emotion detection was not designed to assess employee performance or the emotional state of an individual.
Bring Your Own Model
- Azure AI Video Indexer isn't responsible for the way you use an external AI model. It is your responsibility to ensure that your external AI models are compliant with Responsible Artifical Intelligence standards.
- Azure AI Video Indexer isn't responsible for the custom insights you create while using the Bring Your Own Model feature as they are not generated by Azure Video Indexer models.

Characteristics and limitations of Video Indexer

The intended use of Azure AI Video Indexer is to generate insights from recorded media and entertainment content. Extracted insights are created in a JSON file that lists the insights in categories. Each insight holds a list of unique elements, and each element has its own metadata and a list of its instances. For example, a face might have an ID, a name, a thumbnail, other metadata, and a list of its temporal instances. The output of some insights may also display a confidence score to indicate its accuracy level.

A JSON file can be accessed in three ways:

Azure AI Video Indexer portal, an easy-to-use solution that lets you evaluate the product, manage the account, and customize models.  
API integration, via a REST API, which lets you integrate the solution into your apps and infrastructure.  
Embeddable widget, which lets you embed the Azure AI Video Indexer insights, player, and editor experiences into your app to customize the insights displayed in a web interface. For example, the list can be customized to display insights only about people appearing in a video. To find videos that include a specific celebrity, a content editor can implement a deep search using the name appearing in the Face or People insights categories.

Below are some considerations to keep in mind when using Azure AI Video Indexer:

Video

Azure AI Video Indexer only supports the processing of recorded footage, with a storage limit of 30GB and 4 hours for uploaded videos.
When uploading a file always use high-quality video content. The recommended maximum frame size is HD and frame rate is 30 FPS. A frame should contain no more than 10 people. When outputting frames from videos to AI models, only send around 2 or 3 frames per second. Processing 10 or more frames might delay the AI result.
People and faces in videos recorded by cameras that are high-mounted, down-angled or with a wide field of view (FOV) may have fewer pixels which may result in lower accuracy of the generated insights.
When uploading a file always use high quality audio content. At least 1 minute of spontaneous conversational speech is required to perform analysis. Audio effects are detected in non-speech segments only. The minimal duration of a non-speech section is 2 seconds. Voice commands and singing are not supported.
Typically, small people or objects under 200 pixels and people who are seated may not be detected. People wearing similar clothes or uniforms might be detected as being the same person and will be given the same ID number. People or objects that are obstructed may not be detected. Tracks of people with front and back poses may be split into different instances.
An observed person must first be detected and appear in the People category before they are matched. Tracks are optimized to handle observed people who frequently appear in the foreground. Obstructions like overlapping people or faces may cause mismatches between matched people and observed people. Mismatching may occur when different people appear in the same relative spatial position in the frame within a short period.
When detecting clothing, dresses and skirts are categorized as Dresses or Skirts, clothing the same color as a person’s skin is not detected, and a full view of the person is required. To optimize detection, both the upper and lower body should be included in the frame.
When extracting handwritten text, avoid using the OCR results of signatures which are hard to read for both humans and machines. A better way to use OCR is to use it for detecting the presence of a signature for further analysis.
Named entities only detects insights in audio and images. Logos in a brand name may not be detected.
Detectors may misclassify objects in videos that are in a "birds-eye" view as there were trained witha a frontal view of objects.

Audio

Avoid use of audio with very loud background music or music with repetitive and/or linearly scanned frequency, audio effects detection is designed for non-speech audio only and therefore cannot classify events in loud music. Music with repetitive and/or linearly scanned frequency many be incorrectly classified as an alarm or siren.

Textual summarization

Important

When using textual summarization, it's important to note that the system is not intended to replace the full viewing experience, especially for content where details and nuances are crucial. It's also not desinged for summarizing highly sensitive or confidential videos where context and privacy are paramount.

Non-English languages: The Textual Video Summary was primary tested and optimized for the English language. However, it is compatible with all languages supported by the specific GenAI model being used, i.e. GPT3.5 Turbo or GPT4.0. Consequently, when applied to non-English languages, the accuracy and quality of the summaries might vary. To mitigate this limitation, users employing the feature for non-English languages should be extra careful and verify the generated summaries for accuracy and completeness.
Videos with multiple languages: If a video incorporates speech in multiple languages, the Textual Video Summary may struggle to accurately recognize all the languages featured in the video content. Users should be aware of this potential limitation when utilizing the Textual Video Summarization feature for multilingual videos.
Highly specialized or technical videos: Video Summary AI models are typically trained on a wide variety of videos, including news, movies, and other general content. If the video is highly specialized or technical, the model may not be able to accurately extract the summary of the video.
Videos with poor audio quality nor OCR: Textual Video Summary AI models also rely on audio (among other insights) to extract the summary from the video or on OCR to extract the text appearing on screen , if the audio quality is poor and there is no OCR identified, the model may not be able to accurately extract the summary from the video. 
Videos with low lighting or fast motion: Videos that are shot in low lighting or have fast motion might be difficult for the model to process the insights, resulting in poor performance. 
Videos with uncommon accents or dialects: AI models are typically trained on a wide variety of speech, including different accents and dialects. However, if the video contains speech with an accent or dialect that is not well represented in the training data, the model may struggle to accurately extract the transcript from the video. 
Videos containing harmful content: Videos with harmful or sensitive content may be filtered out and excluded, leading to a partial summary.
User choices and customization: The Textual Summarization feature has settings that allow users to tailor the summarization process to their needs. These include summary length, quality, output format, and formal, casual, short or long text styles. However these settings also introduce variability in the system’s performance. You should be aware that it can enhance your experience, it may also influence the system’s accuracy and efficiency. It’s a balance between personalization and the system’s operational capabilities. The system is expected to be used responsibly, with an understanding of its limitations and the impact of your choices on the final output.

Textual summarization on an Edge device

If you are using the Edge extension, you can generate a summary from the video page in the web portal and use the same functionality such as customizations but there is no option to change the model deployment. Instead, every new extension created will include a local Phi-3-mini-4k-instruct model that is developed by Microsoft. There is no charge for requests to the model.

Specfications

Supported hardware: currently supports only Intel CPU and Nvidia GPU.
- CPU tested on: Standard_F64s_v2 (utilization: ~30-32 cores)
- GPU tested on: Standard_NC6s_v3
Average runtime ranges between 46-57% of video length on CPU, or 15-17% on GPU.

Known Limitations and Known Issues

CPU: Currently, running VI on AMD CPUs may lead to significantly longer runtimes and is not supported at this time.
The summarization feature is created by an AI language model and serves to provide a general overview. Although we aim for accuracy and reliability, the content may not fully encapsulate the essence of the original material. We recommend that a human reviews and edits the summary before use. It should not be viewed as professional or personalized advice.
The summary results are generally consistent within each summarization setting. However, editing the transcript or reindexing the video may lead to different output results.
Disclaimer for product documentation: When utilizing summarization settings, the Neutral style might occasionally resemble the Formal style. The Casual style might include content-related hashtags. Additionally, in some instances, a “Medium” length summary might be shorter than a “Short” summary.
Videos that have little content (such as very short videos) are typically not summarized to mitigate the potential model inaccuracies that can happen when dealing with short input.
The summary might occasionally include or reference internal instructions provided to it (referred to as “meta-prompt”). This could encompass directives to exclude harmful content.
The length of the summary might influence the level of detail extracted from the video summary. Longer summaries might result in less specific details being included.
The generated summary might contain inaccuracies, such as incorrect identification of gender, age, and other personal characteristics.
If the original video contains inappropriate content, the video summarization output might be affected in the following ways: it might be incomplete, contain disclaimers regarding the inappropriate content, and in certain instances, it might include the actual inappropriate quotes, which may be presented with or without a disclaimer.

Respect privacy

When used responsibly and carefully Azure AI Video Indexer is a valuable tool for many industries. To respect the privacy and safety of others, we recommend the following:  

Always respect an individual’s right to privacy, and only ingest videos for lawful and justifiable purposes.  
Do not purposely disclose inappropriate media showing young children or family members of celebrities or other content that may be detrimental or pose a threat to an individual’s personal freedom.  
Commit to respecting and promoting human rights in the design and deployment of your analyzed media.  
When using 3rd party materials, be aware of any existing copyrights or required permissions before distributing content derived from them.
Always seek legal advice when using media from unknown sources.
Always obtain appropriate legal and professional advice to ensure that your uploaded videos are secured and have adequate controls to preserve the integrity of your content and to prevent unauthorized access.
Provide a feedback channel that allows users and individuals to report issues with the service.  
Be aware of any applicable laws or regulations that exist in your area regarding processing, analyzing, and sharing media containing people.
Keep a human in the loop. Do not use any solution as a replacement for human oversight and decision-making.  
Fully examine and review the potential of any AI model you are using to understand its capabilities and limitations.

For more information, see Microsoft Global Human Rights Statement. 

This article contains basic guidelines for how to use Azure Video Indexer responsibly. To learn more about how to use Video Indexer insights responsibly, jump to the specific article for each of the features below:

Azure AI Video Indexer insights

Next steps

Learn more about responsible AI

Contact us

VI Support visupport@microsoft.com

Condividi tramite