What is a Transparency Note?
An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, its capabilities and limitations, and how to achieve the best performance.
Microsoft’s Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system or share them with the people who will use or be affected by your system.
Microsoft’s Transparency Notes are part of a broader effort at Microsoft to put our AI principles into practice.
To find out more, see Microsoft AI principles.
Introduction to Azure Video Indexer
Azure Video Indexer (VI) is a cloud-based tool that processes and analyzes uploaded video and audio files to generate different types of insights. These insights include detected objects, people, faces, animated characters, key frames and translations or transcriptions in at least 60 languages. The insights and their time frames are displayed in a categorized list on the Azure Video Indexer website where each insight can be seen by pressing its Play button.
While processing the files, the Azure Video Indexer employs a portfolio of Microsoft AI algorithms to analyze, categorize, and index the video footage. The resulting insights are then archived and can be comprehensively accessed, shared, and reused. For example, a news media outlet may implement a deep search for insights related to the Empire State Building and then reuse their findings in different movies, trailers, or promos.
The basics of Azure Video Indexer
Azure Video Indexer is a cloud-based Azure Applied AI Services product that is integrated with Cognitive Services. It allows you to upload video and audio files, process the video (including running AI models on them) and then save the processed files and resulting data to a cloud-based Azure Media Services account.
To process the media files, Azure Video Indexer employs AI technologies like Optical Character Recognition (OCR), Natural Language Processing (NLP), and hierarchical ontology models with voice tonality analysis to extract insights like brands, keywords, topics, and emotions.
Azure Video Indexer’s capabilities include searching for insights in archives, promoting content accessibility, content moderation and content editing.
Insights categories include:
|Audio media||For example, transcriptions, translations, audio event detection like clapping and crowd laughter, gun shots and explosions|
|Video media||For example, faces, animated characters, and clothing detection|
|Video with audio media||For example, named entities in transcripts, and Optical Character Recognition (OCR), for example, names of locations, people, or brands|
For more information, see Introduction to Azure Video Indexer.
Key terms and features
|Insight||The information and knowledge derived from the processing and analysis of video and audio files that generate different types of insights and can include detected objects, people, faces, animated characters, key frames and translations or transcriptions. To view and download insights via the API, use the Azure Video Indexer portal.|
|Object detection||The ability to identify and find objects in an image or video. For example, a table, chair, or window.|
|Facial detection||Finds human faces in an image and returns bounding boxes indicating their locations. Face detection models alone do not find individually identifying features, only a bounding box marking the entire face. Facial detection doesn't involve distinguishing one fact from another face, predicting or classifying facial attributes, or creating a Face template.|
|Facial identification||"One-to-many" matching of a face in an unmanipulated image to a set of faces in a secure repository. An example is a touchless access control system in a building that replaces or augments physical cards and badges in which a smart camera captures the face of one person entering a secured door and attempts to find a match from a set of images of faces of individuals who are approved to access the building. This process is implemented by Azure Face service and involves the creation of Face templates.|
|Face template||Unique set of numbers generated from an image or video that represents the distinctive features of a face.|
|Observed people tracking and matched faces||Features that automatically detect and match people in media files. Observed people tracking and matched faces can be set to display insights on people, their clothing, and the exact time frame of their appearance.|
|Keyword extraction||The process of automatically detecting insights on the different keywords discussed in media files. Keywords extraction can extract insights in both single language and multi-language media files.|
|Deep search||The ability to retrieve only relevant video and audio files from a video library by searching for specific terms within the extracted insights.|
|Labels||The identification of visual objects and actions appearing in a frame. For example, identifying an object such as a dog, or an action such as running.|
|Named entities||Feature that uses Natural Language Processing (NLP) to extract insights on the locations, people and brands appearing in audio and images in media files.|
|Natural Language Processing (NLP)||The processing of human language as it is spoken and written.|
|Optical Character Recognition (OCR)||Extracts text from images like pictures, street signs, and products in media files to create insights. For more information, see OCR technology.|
|Hierarchical Ontology Model||A set of concepts or categories in a subject area or domain that possess shared properties and relationships.|
|Voice Tonality Analysis||An acoustic voice tone spectrogram that detects the emotion of the speaker. For example, happiness, sadness, excitement, or fear in a speaker’s voice.|
|Audio effects detection||Feature that detects insights on a variety of acoustic events and classifies them into acoustic categories. Audio effect detection can detect and classify different categories such as laughter, crowd reactions, alarms and/or sirens.|
|Transcription, translation and language identification||Feature that automatically detects, transcribes, and translates the speech in media files into over 50 languages.|
|Topics inference||Feature that automatically creates inferred insights derived from the transcribed audio, OCR content in visual text, and celebrities recognized in the video.|
Components of Azure Video Indexer
During the Azure Video Indexer procedure, a media file is processed using Azure APIs to extract different types of insights, as follows:
|Video uploader||The user uploads a media file to be processed by Azure Video Indexer.|
|Insights generation||Azure services APIs such as Cognitive Services OCR and Transcription, extract insights.
Internal AI models are run to generate insights like Detected Audio Events, Observed People, Detected Clothing, and Topics.
|Insights processing||Additional logic such as confidence level threshold filtering is applied to the output of Insights generation to create the final insights that are then displayed in the Azure Video Indexer portal and in the JSON file that can be downloaded from the portal.|
|Storage||Output from the processed media file is saved in:
• Azure Storage
• Azure Search, where users can search for videos using specific insights like an actor’s name, a location, or a brand.
|Notification||The user receives notification that the indexing process has been completed.|
Limited Access features of Azure Video Indexer
Facial recognition features of Azure Video Indexer (including facial detection, facial identification, facial templates observed people tracking, and matched faces) are Limited Access and are only available to Microsoft managed customers and partners, and only for certain use cases selected at the time of registration. Access to the facial identification and celebrity recognition capabilities requires registration. Facial detection does not require registration. To learn more, visit Microsoft’s Limited Access policy.
Approved commercial use cases for Limited Access features
Facial Identification to search for a face in a media or entertainment video archive: to find a face within a video and generate metadata for media or entertainment use cases only.
Celebrity Recognition: to detect and identify celebrities within images or videos in digital asset management systems, for accessibility and/or media and entertainment use cases only.
Approved public sector use cases for Limited Access features
Facial identification for preservation and enrichment of public media archives: to identify individuals in public media or entertainment video archives for the purposes of preserving and enriching public media only. Examples of public media enrichment include identifying historical figures in video archives and generating descriptive metadata.
Facial identification to:
- assist law enforcement or court officials in prosecution or defense of a criminal suspect who has already been apprehended, to the extent specifically authorized by a duly empowered government authority in a jurisdiction that maintains a fair and independent judiciary OR
- assist officials of duly empowered international organizations in the prosecution of abuses of international criminal law, international human rights law, or international humanitarian law.
Facial identification for purposes of providing humanitarian aid, or identifying missing persons, deceased persons, or victims of crimes.
Example use cases for Azure Video Indexer
Azure Video Indexer can be used in multiple scenarios in a variety of industries, such as:
- Creating feature stories at news or media agencies by implementing deep searches for specific people and/or words to find what was said, by whom, where and when. Facial identification capabilities are Limited Access. For more information, visit Microsoft’s Limited Access policy.
- Creating promos and trailers using important moments previously extracted from videos. Azure Video Indexer can assist by adding keyframes, scene markers, timestamps and labelling so that content editors invest less time reviewing numerous files.
- Promoting accessibility by translating and transcribing audio into multiple languages and adding captions, or by creating a verbal description of footage via OCR processing to enhance accessibility for the visually impaired.
- Improving content distribution to a diverse audience in different regions and languages by delivering content in multiple languages using Azure Video Indexer’s transcription and translation capabilities.
- Enhancing targeted advertising, industries like news media or social media can use Azure Video Indexer to extract insights to enhance the relevance of targeted advertising.
- Enhancing user engagement using metadata, tags, keywords, and embedded customer insights to filter and tailor media to customer preferences.
- Moderating inappropriate content such as banned words using textual and visual content control to tag media as child approved or for adults only.
- Accurately and quickly detecting violence incidents by classifying gunshots, explosions, and glass shattering in a smart-city system or in other public environments that include cameras and microphones.
- Enhancing compliance with local standards by extracting text in warnings in online instructions and then translating the text for example, e-learning instructions for using equipment.
- Enhancing and improving manual closed captioning and subtitles generation by leveraging Azure Video Indexer’s transcription and translation capabilities and by using the closed captions generated by Azure Video Indexer in one of the supported formats.
- Transcribing videos in unknown languages by using language identification (LID) or multi language identification (MLID) to allow Azure Video Indexer to automatically identify the languages appearing in the video and generate the transcription accordingly.
Considerations when choosing a use case
Avoid using Video Indexer for decisions that may have serious adverse impacts. Decisions based on incorrect output could have serious adverse impacts. Additionally, it is advisable to include human review of decisions that have the potential for serious impacts on individuals.
Characteristics and limitations of Video Indexer
The intended use of Azure Video Indexer is to generate insights from recorded media and entertainment content. Extracted insights are created in a JSON file that lists the insights in categories. Each insight holds a list of unique elements, and each element has its own metadata and a list of its instances. For example, a face might have an ID, a name, a thumbnail, other metadata, and a list of its temporal instances. The output of some insights may also display a confidence score to indicate its accuracy level.
A JSON file can be accessed in three ways:
- Azure Video Indexer portal, an easy-to-use solution that lets you evaluate the product, manage the account, and customize models.
- API integration, via a REST API, which lets you integrate the solution into your apps and infrastructure.
- Embeddable widget, which lets you embed the Azure Video Indexer insights, player, and editor experiences into your app to customize the insights displayed in a web interface. For example, the list can be customized to display insights only about people appearing in a video. To find videos that include a specific celebrity, a content editor can implement a deep search using the name appearing in the Face or People insights categories.
Below are some considerations to keep in mind when using Azure Video Indexer:
- Azure Video Indexer only supports the processing of recorded footage, with a storage limit of 30GB and 4 hours for uploaded videos.
- When uploading a file always use high-quality video content. The recommended maximum frame size is HD and frame rate is 30 FPS. A frame should contain no more than 10 people. When outputting frames from videos to AI models, only send around 2 or 3 frames per second. Processing 10 or more frames might delay the AI result.
- People and faces in videos recorded by cameras that are high-mounted, down-angled or with a wide field of view (FOV) may have fewer pixels which may result in lower accuracy of the generated insights.
- When uploading a file always use high quality audio content. At least 1 minute of spontaneous conversational speech is required to perform analysis. Audio effects are detected in non-speech segments only. The minimal duration of a non-speech section is 2 seconds. Voice commands and singing are not supported.
- Typically, small people or objects under 200 pixels and people who are seated may not be detected. People wearing similar clothes or uniforms might be detected as being the same person and will be given the same ID number. People or objects that are obstructed may not be detected. Tracks of people with front and back poses may be split into different instances.
- An observed person must first be detected and appear in the People category before they are matched. Tracks are optimized to handle observed people who frequently appear in the foreground. Obstructions like overlapping people or faces may cause mismatches between matched people and observed people. Mismatching may occur when different people appear in the same relative spatial position in the frame within a short period.
- When detecting clothing, dresses and skirts are categorized as Dresses or Skirts, clothing the same color as a person’s skin is not detected, and a full view of the person is required. To optimize detection, both the upper and lower body should be included in the frame.
- Avoid use of audio with very loud background music or music with repetitive and/or linearly scanned frequency, audio effects detection is designed for non-speech audio only and therefore cannot classify events in loud music. Music with repetitive and/or linearly scanned frequency many be incorrectly classified as an alarm or siren.
- When extracting handwritten text, avoid using the OCR results of signatures which are hard to read for both humans and machines. A better way to use OCR is to use it for detecting the presence of a signature for further analysis.
- Named entities only detects insights in audio and images. Logos in a brand name may not be detected.
When used responsibly and carefully Azure Video Indexer is a valuable tool for many industries. To respect the privacy and safety of others, we recommend the following:
- Always respect an individual’s right to privacy, and only ingest videos for lawful and justifiable purposes.
- Do not purposely disclose inappropriate media showing young children or family members of celebrities or other content that may be detrimental or pose a threat to an individual’s personal freedom.
- Commit to respecting and promoting human rights in the design and deployment of your analyzed media.
- When using 3rd party materials, be aware of any existing copyrights or required permissions before distributing content derived from them.
- Always seek legal advice when using media from unknown sources.
- Always obtain appropriate legal and professional advice to ensure that your uploaded videos are secured and have adequate controls to preserve the integrity of your content and to prevent unauthorized access.
- Provide a feedback channel that allows users and individuals to report issues with the service.
- Be aware of any applicable laws or regulations that exist in your area regarding processing, analyzing, and sharing media containing people.
- Keep a human in the loop. Do not use any solution as a replacement for human oversight and decision-making.
- Fully examine and review the potential of any AI model you are using to understand its capabilities and limitations.
For more information, see Microsoft Global Human Rights Statement.
Learn more about responsible AI
- Microsoft Responsible AI principles
- Microsoft Responsible AI resources
- Microsoft principles for developing and deploying facial recognition technology
- Microsoft Azure Learning courses on Responsible AI
- Face Service Transparency Note
Azure Video Indexer insights
- Audio effects detection
- Face detection
- Keywords extraction
- Transcription, translation & language identification
- Labels identification
- Named entities
- Observed people tracking & matched faces
- Topics inference