Enable audio effects detection (preview)

Audio effects detection is one of Azure Video Indexer AI capabilities that detects various acoustics events and classifies them into different acoustic categories (such as dog barking, crowd reactions, laugher and more).

Some scenarios where this feature is useful:

  • Companies with a large set of video archives can easily improve accessibility with audio effects detection. The feature provides more context for persons who are hard of hearing, and enhances video transcription with non-speech effects.
  • In the Media & Entertainment domain, the detection feature can improve efficiency when creating raw data for content creators. Important moments in promos and trailers (such as laughter, crowd reactions, gunshot, or explosion) can be identified by using audio effects detection.
  • In the Public Safety & Justice domain, the feature can detect and classify gunshots, explosions, and glass shattering. It can be implemented in a smart-city system or in other public environments that include cameras and microphones to offer fast and accurate detection of violence incidents.

Supported audio categories

Audio effect detection can detect and classify 7 different categories. In the next table, you can find the different categories split in to the different presets, divided to Standard and Advanced. For more information, see pricing.

Indexing type Standard indexing Advanced indexing
Preset Name "Audio Only"
"Video + Audio"
"Advance Audio"
"Advance Video + Audio"
Appear in insights pane V
Crowd Reactions V
Silence V V
Gunshot or explosion V
Breaking glass V
Alarm or siren V
Laughter V
Dog barking V

Result formats

The audio effects are retrieved in the insights JSON that includes the category ID, type, name, and set of instances per category along with their specific timeframe and confidence score.

The name parameter will be presented in the language in which the JSON was indexed, while the type will always remain the same.

audioEffects: [{
        id: 0,
        type: "Gunshot or explosion",
        name: "Gunshot",
        instances: [{
                confidence: 0.649,
                adjustedStart: "0:00:13.9",
                adjustedEnd: "0:00:14.7",
                start: "0:00:13.9",
                end: "0:00:14.7"
            }, {
                confidence: 0.7706,
                adjustedStart: "0:01:54.3",
                adjustedEnd: "0:01:55",
                start: "0:01:54.3",
                end: "0:01:55"
    }, {
        id: 1,
        type: "CrowdReactions",
        name: "Crowd Reactions",
        instances: [{
                confidence: 0.6816,
                adjustedStart: "0:00:47.9",
                adjustedEnd: "0:00:52.5",
                start: "0:00:47.9",
                end: "0:00:52.5"
                confidence: 0.7314,
                adjustedStart: "0:04:57.67",
                adjustedEnd: "0:05:01.57",
                start: "0:04:57.67",
                end: "0:05:01.57"

How to index audio effects

In order to set the index process to include the detection of audio effects, select one of the Advanced presets under Video + audio indexing menu as can be seen below.

Index Audio Effects image

Closed Caption

When audio effects are retrieved in the closed caption files, they will be retrieved in square brackets the following structure:

Type Example
SRT 00:00:00,000 00:00:03,671
[Gunshot or explosion]
VTT 00:00:00.000 00:00:03.671
[Gunshot or explosion]
TTML Confidence: 0.9047
<p begin="00:00:00.000" end="00:00:03.671">[Gunshot or explosion]</p>
TXT [Gunshot or explosion]
CSV 0.9047,00:00:00.000,00:00:03.671, [Gunshot or explosion]

Audio Effects in closed captions file will be retrieved with the following logic employed:

  • Silence event type will not be added to the closed captions
  • Maximum duration to show an event I 5 seconds
  • Minimum timer duration to show an event is 700 milliseconds

Adding audio effects in closed caption files

Audio effects can be added to the closed captions files supported by Azure Video Indexer via the Get video captions API by choosing true in the includeAudioEffects parameter or via the video.ai website experience by selecting Download -> Closed Captions -> Include Audio Effects.

Audio Effects in CC


When using update transcript from closed caption files or update custom language model from closed caption files, audio effects included in those files will be ignored.

Limitations and assumptions

  • The audio effects are detected when present in non-speech segments only.
  • The model is optimized for cases where there is no loud background music.
  • Low quality audio may impact the detection results.
  • Minimal non-speech section duration is 2 seconds.
  • Music that is characterized with repetitive and/or linearly scanned frequency can be mistakenly classified as Alarm or siren.
  • The model is currently optimized for natural and non-synthetic gunshot and explosions sounds.
  • Door knocks and door slams can sometimes be mistakenly labeled as gunshot and explosions.
  • Prolonged shouting and human physical effort sounds can sometimes be mistakenly detected.
  • Group of people laughing can sometime be classified as both Laughter and Crowd reactions.

Next steps

Review overview