Redact faces with Azure Media Analytics
Overview
Azure Media Redactor is an Azure Media Analytics media processor (MP) that offers scalable face redaction in the cloud. Face redaction enables you to modify your video in order to blur faces of selected individuals. You may want to use the face redaction service in public safety and news media scenarios. A few minutes of footage that contains multiple faces can take hours to redact manually, but with this service the face redaction process will require just a few simple steps.
This article gives details about Azure Media Redactor and shows how to use it with Media Services SDK for .NET.
Face redaction modes
Facial redaction works by detecting faces in every frame of video and tracking the face object both forwards and backwards in time, so that the same individual can be blurred from other angles as well. The automated redaction process is complex and does not always produce 100% of desired output, for this reason Media Analytics provides you with a couple of ways to modify the final output.
In addition to a fully automatic mode, there is a two-pass workflow, which allows the selection/de-selection of found faces via a list of IDs. Also, to make arbitrary per frame adjustments the MP uses a metadata file in JSON format. This workflow is split into Analyze and Redact modes. You can combine the two modes in a single pass that runs both tasks in one job; this mode is called Combined.
Note
Face Detector Media Processor has been deprecated as of June 2020, Azure Media Services legacy components. Consider using Azure Media Services v3 API. There is no planned replacement for the China region.
Combined mode
This produces a redacted mp4 automatically without any manual input.
Stage | File Name | Notes |
---|---|---|
Input asset | foo.bar | Video in WMV, MOV, or MP4 format |
Input config | Job configuration preset | {'version':'1.0', 'options': {'mode':'combined'}} |
Output asset | foo_redacted.mp4 | Video with blurring applied |
Analyze mode
The analyze pass of the two-pass workflow takes a video input and produces a JSON file of face locations, and jpg images of each detected face.
Stage | File Name | Notes |
---|---|---|
Input asset | foo.bar | Video in WMV, MPV, or MP4 format |
Input config | Job configuration preset | {'version':'1.0', 'options': {'mode':'analyze'}} |
Output asset | foo_annotations.json | Annotation data of face locations in JSON format. This can be edited by the user to modify the blurring bounding boxes. See sample below. |
Output asset | foo_thumb%06d.jpg [foo_thumb000001.jpg, foo_thumb000002.jpg] | A cropped jpg of each detected face, where the number indicates the labelId of the face |
Output example
{
"version": 1,
"timescale": 24000,
"offset": 0,
"framerate": 23.976,
"width": 1280,
"height": 720,
"fragments": [
{
"start": 0,
"duration": 48048,
"interval": 1001,
"events": [
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[
{
"index": 13,
"id": 1138,
"x": 0.29537,
"y": -0.18987,
"width": 0.36239,
"height": 0.80335
},
{
"index": 13,
"id": 2028,
"x": 0.60427,
"y": 0.16098,
"width": 0.26958,
"height": 0.57943
}
],
... truncated
Redact mode
The second pass of the workflow takes a larger number of inputs that must be combined into a single asset.
This includes a list of IDs to blur, the original video, and the annotations JSON. This mode uses the annotations to apply blurring on the input video.
The output from the Analyze pass does not include the original video. The video needs to be uploaded into the input asset for the Redact mode task and selected as the primary file.
Stage | File Name | Notes |
---|---|---|
Input asset | foo.bar | Video in WMV, MPV, or MP4 format. Same video as in step 1. |
Input asset | foo_annotations.json | annotations metadata file from phase one, with optional modifications. |
Input asset | foo_IDList.txt (Optional) | Optional new line separated list of face IDs to redact. If left blank, this blurs all faces. |
Input config | Job configuration preset | {'version':'1.0', 'options': {'mode':'redact'}} |
Output asset | foo_redacted.mp4 | Video with blurring applied based on annotations |
Example output
This is the output from an IDList with one ID selected.
Example foo_IDList.txt
1
2
3
Blur types
In the Combined or Redact mode, there are 5 different blur modes you can choose from via the JSON input configuration: Low, Med, High, Box, and Black. By default Med is used.
You can find samples of the blur types below.
Example JSON
{
'version':'1.0',
'options': {
'Mode': 'Combined',
'BlurType': 'High'
}
}
Low
Med
High
Box
Black
Elements of the output JSON file
The Redaction MP provides high precision face location detection and tracking that can detect up to 64 human faces in a video frame. Frontal faces provide the best results, while side faces and small faces (less than or equal to 24x24 pixels) are challenging.
The job produces a JSON output file that contains metadata about detected and tracked faces. The metadata includes coordinates indicating the location of faces, as well as a face ID number indicating the tracking of that individual. Face ID numbers are prone to reset under circumstances when the frontal face is lost or overlapped in the frame, resulting in some individuals getting assigned multiple IDs.
The output JSON includes the following elements:
Root JSON elements
Element | Description |
---|---|
version | This refers to the version of the Video API. |
timescale | "Ticks" per second of the video. |
offset | This is the time offset for timestamps. In version 1.0 of Video APIs, this will always be 0. In future scenarios we support, this value may change. |
width, hight | The width and hight of the output video frame, in pixels. |
framerate | Frames per second of the video. |
Fragments JSON elements
Element | Description |
---|---|
start | The start time of the first event in "ticks." |
duration | The length of the fragment, in “ticks.” |
index | (Applies to Azure Media Redactor only) defines the frame index of the current event. |
interval | The interval of each event entry within the fragment, in “ticks.” |
events | Each event contains the faces detected and tracked within that time duration. It is an array of events. The outer array represents one interval of time. The inner array consists of 0 or more events that happened at that point in time. An empty bracket [] means no faces were detected. |
id | The ID of the face that is being tracked. This number may inadvertently change if a face becomes undetected. A given individual should have the same ID throughout the overall video, but this cannot be guaranteed due to limitations in the detection algorithm (occlusion, etc.). |
x, y | The upper left X and Y coordinates of the face bounding box in a normalized scale of 0.0 to 1.0. -X and Y coordinates are relative to landscape always, so if you have a portrait video (or upside-down, in the case of iOS), you'll have to transpose the coordinates accordingly. |
width, height | The width and height of the face bounding box in a normalized scale of 0.0 to 1.0. |
facesDetected | This is found at the end of the JSON results and summarizes the number of faces that the algorithm detected during the video. Because the IDs can be reset inadvertently if a face becomes undetected (e.g., the face goes off screen, looks away), this number may not always equal the true number of faces in the video. |