Get audio effects detection insights

Article
10/09/2024

Audio effects detection

Audio effects detection detects acoustic events and classifies them into categories such as laughter, crowd reactions, alarms and/or sirens.

Audio effects use cases

Improve accessibility by offering more context for a hearing- impaired audience by transcription of nonspeech effects.
Improving efficiency when creating raw data for content creators. Important moments in promos and trailers such as laughter, crowd reactions, gunshots, or explosions can be identified, for example, in Media and Entertainment.
Detect and classify gunshots, explosions, and glass shattering in a smart-city system or in other public environments that include cameras and microphones.

Supported audio categories

Audio effects detection can detect and classify effects into standard and advanced categories. For more information, see pricing.

The following table shows which categories are supported depending on Preset Name (Audio Only / Video + Audio vs. Advance Audio / Advance Video + Audio). When you're using the Advanced indexing, categories appear in the Insights pane of the website.

Class	Standard indexing	Advanced indexing
Crowd Reactions		✔️
Silence	✔️	✔️
Gunshot or explosion		✔️
Breaking glass		✔️
Alarm or siren		✔️
Laughter		✔️
Dog		✔️
Bell ringing		✔️
Bird		✔️
Car		✔️
Engine		✔️
Crying		✔️
Music playing		✔️
Screaming		✔️
Thunderstorm		✔️

View the insight JSON with the web portal

Once you have uploaded and indexed a video, insights are available in JSON format for download using the web portal.

Select the Library tab.
Select media you want to work with.
Select Download and the Insights (JSON). The JSON file opens in a new browser tab.
Look for the key pair described in the example response.

Use the API

Use the Get Video Index request. We recommend passing &includeSummarizedInsights=false.
Look for the key pairs described in the example response.

Example response

    "audioEffects": [
      {
        "id": 1,
        "type": "Silence",
        "instances": [
          {
            "confidence": 0,
            "adjustedStart": "0:01:46.243",
            "adjustedEnd": "0:01:50.434",
            "start": "0:01:46.243",
            "end": "0:01:50.434"
          }
        ]
      },
      {
        "id": 2,
        "type": "Speech",
        "instances": [
          {
            "confidence": 0,
            "adjustedStart": "0:00:00",
            "adjustedEnd": "0:01:43.06",
            "start": "0:00:00",
            "end": "0:01:43.06"
          }
        ]
      }
    ]

Important

It is important to read the transparency note overview for all VI features. Each insight also has transparency notes of its own:

Audio effects detection notes

Avoid use of short or low-quality audio, audio effects detection provides probabilistic and partial data on detected nonspeech audio events. For accuracy, audio effects detection requires at least 2 seconds of clear nonspeech audio. Voice commands or singing aren't supported.  
Avoid use of audio with loud background music or music with repetitive and/or linearly scanned frequency, audio effects detection is designed for nonspeech audio only and therefore can't classify events in loud music. Music with repetitive and/or linearly scanned frequency many be incorrectly classified as an alarm or siren.
To promote more accurate probabilistic data, ensure that:
- Audio effects can be detected in nonspeech segments only.
- The duration of a nonspeech section should be at least 2 seconds.
- Low quality audio might affect the detection results.
- Events in loud background music aren't classified.
- Music with repetitive and/or linearly scanned frequency might be incorrectly classified as an alarm or siren.
- Knocking on a door or slamming a door might be labeled as a gunshot or explosion.
- Prolonged shouting or sounds of physical human effort might be incorrectly classified.
- A group of people laughing might be classified as both laughter and crowd.
- Natural and nonsynthetic gunshot and explosions sounds are supported.

Audio effects detection components

During the audio effects detection procedure, audio in a media file is processed, as follows:

Component	Definition
Source file	The user uploads the source file for indexing.
Segmentation	The audio is analyzed, nonspeech audio is identified and then split into short overlapping internals.
Classification	An AI process analyzes each segment and classifies its contents into event categories such as crowd reaction or laughter. A probability list is then created for each event category according to department-specific rules.
Confidence level	The estimated confidence level of each audio effect is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Sample code

See all samples for VI

Closed captions

Audio effects in closed caption files appear as square brackets:

Type	Example
SRT	00:00:00,000 00:00:03,671 [Gunshot or explosion]
VTT	00:00:00.000 00:00:03.671 [Gunshot or explosion]
TTML	Confidence: 0.9047 `<p begin="00:00:00.000" end="00:00:03.671">[Gunshot or explosion]</p>`
TXT	[Gunshot or explosion]
CSV	0.9047,00:00:00.000,00:00:03.671, [Gunshot or explosion]

Note

Silence event type won't be added to the closed captions.
Minimum timer duration to show an event is 700 milliseconds.

Add audio effects to closed caption files

API

You can add audio effects to closed captions files with the Get video captions request and by choosing true for the includeAudioEffects parameter.