Difference in results for Batch speech to text in speech studio and API

Sidharth Ajul 40 Reputation points
2025-02-13T11:52:21.3666667+00:00

Hi, i was running a dual channel hindi audio file on speech studio as well as api on batch speech to text, but i am getting different results in both, and it is consistent. The speech studio version is the correct one as it recognises the phrases separately in a channel, while in the api( for which i took the parameters from speech studio as well) it is giving all the recognised phrases together instead of separate, for example, in a conversation: Channel_1: "Hello" Channel_0: "Hi" Channel_1: "How are you" Channel_0: "Im good thanks". It comes like this in speech studio but in api, it comes as: Channel_1: "Hello How are you" Channel_0: "Hi Im good thanks". I tried to combine it with the offset but still not working, the results are clearly different for speecg studio and the api. Also i tried with both, v3.2 as well as the new api-version=2024-11-15.

Speech studio output-

  "recognizedPhrases": [
    {
      "recognitionStatus": "Success",
      "channel": 0,
      "offset": "PT1.71S",
      "duration": "PT0.36S",
      "offsetInTicks": 17100000,
      "durationInTicks": 3600000,
      "durationMilliseconds": 360,
      "offsetMilliseconds": 1710,
      "nBest": [
        {
          "confidence": 0.6302351,
          "lexical": "हैलो",
          "itn": "हैलो",
          "maskedITN": "हैलो।",
          "display": "हैलो।",
          "words": [
            {
              "word": "हैलो",
              "offset": "PT1.71S",
              "duration": "PT0.36S",
              "offsetInTicks": 17100000,
              "durationInTicks": 3600000,
              "durationMilliseconds": 360,
              "offsetMilliseconds": 1710,
              "confidence": 0.6302351
            }
          ],
          "displayWords": [
            {
              "displayText": "हैलो।",
              "offset": "PT1.71S",
              "duration": "PT0.36S",
              "offsetInTicks": 17100000,
              "durationInTicks": 3600000,
              "durationMilliseconds": 360,
              "offsetMilliseconds": 1710
            }
          ]
        },
        {
          "confidence": 0.6302351,
          "lexical": "हैलो",
          "itn": "हैलो",
          "maskedITN": "हैलो",
          "display": "हैलो",
          "words": [
            {
              "word": "हैलो",
              "offset": "PT1.71S",
              "duration": "PT0.36S",
              "offsetInTicks": 17100000,
              "durationInTicks": 3600000,
              "durationMilliseconds": 360,
              "offsetMilliseconds": 1710,
              "confidence": 0.6302351
            }
          ]
        },
        {
          "confidence": 0.6302351,
          "lexical": "हैलो",
          "itn": "हैलो",
          "maskedITN": "हैलो",
          "display": "हैलो",
          "words": [
            {
              "word": "हैलो",
              "offset": "PT1.71S",
              "duration": "PT0.36S",
              "offsetInTicks": 17100000,
              "durationInTicks": 3600000,
              "durationMilliseconds": 360,
              "offsetMilliseconds": 1710,
              "confidence": 0.6302351
            }
          ]
        },
        {
          "confidence": 0.6302351,
          "lexical": "हैलो",
          "itn": "हैलो",
          "maskedITN": "हैलो",
          "display": "हैलो",
          "words": [
            {
              "word": "हैलो",
              "offset": "PT0.23S",
              "duration": "PT0.16S",
              "offsetInTicks": 2300000,
              "durationInTicks": 1600000,
              "durationMilliseconds": 160,
              "offsetMilliseconds": 230,
              "confidence": 0.6302351
            }
          ]
        },
        {
          "confidence": 0.37863052,
          "lexical": "हैलो हैलो",
          "itn": "हैलो हैलो",
          "maskedITN": "हैलो हैलो",
          "display": "हैलो हैलो",
          "words": [
            {
              "word": "हैलो",
              "offset": "PT0.23S",
              "duration": "PT0.16S",
              "offsetInTicks": 2300000,
              "durationInTicks": 1600000,
              "durationMilliseconds": 160,
              "offsetMilliseconds": 230,
              "confidence": 0.0028616383
            },
            {
              "word": "हैलो",
              "offset": "PT1.71S",
              "duration": "PT0.36S",
              "offsetInTicks": 17100000,
              "durationInTicks": 3600000,
              "durationMilliseconds": 360,
              "offsetMilliseconds": 1710,
              "confidence": 0.7543994
            }
          ]
        }
      ]
    }

Api output -

  "recognizedPhrases": [
    {
      "recognitionStatus": "Success",
      "channel": 0,
      "offset": "PT1.72S",
      "duration": "PT14.72S",
      "offsetInTicks": 17200000,
      "durationInTicks": 147200000,
      "durationMilliseconds": 14720,
      "offsetMilliseconds": 1720,
      "nBest": [
        {
          "confidence": 0.5169891,
          "lexical": "हैलो हैलो जी जी हैलो मैडम हैलो",
          "itn": "हैलो हैलो जी जी हैलो मैडम हैलो",
          "maskedITN": "हैलो हैलो जी जी हैलो मैडम हैलो",
          "display": "हैलो हैलो जी जी हैलो मैडम हैलो?",
          "displayWords": [
            {
              "displayText": "हैलो",
              "offset": "PT1.72S",
              "duration": "PT0.32S",
              "offsetInTicks": 17200000,
              "durationInTicks": 3200000,
              "durationMilliseconds": 320,
              "offsetMilliseconds": 1720
            },
            {
              "displayText": "हैलो",
              "offset": "PT4.52S",
              "duration": "PT0.52S",
              "offsetInTicks": 45200000,
              "durationInTicks": 5200000,
              "durationMilliseconds": 520,
              "offsetMilliseconds": 4520
            },
            {
              "displayText": "जी",
              "offset": "PT8.12S",
              "duration": "PT0.48S",
              "offsetInTicks": 81200000,
              "durationInTicks": 4800000,
              "durationMilliseconds": 480,
              "offsetMilliseconds": 8120
            },
            {
              "displayText": "जी",
              "offset": "PT11.28S",
              "duration": "PT0.44S",
              "offsetInTicks": 112800000,
              "durationInTicks": 4400000,
              "durationMilliseconds": 440,
              "offsetMilliseconds": 11280
            },
            {
              "displayText": "हैलो",
              "offset": "PT14.32S",
              "duration": "PT0.52S",
              "offsetInTicks": 143200000,
              "durationInTicks": 5200000,
              "durationMilliseconds": 520,
              "offsetMilliseconds": 14320
            },
            {
              "displayText": "मैडम",
              "offset": "PT14.84S",
              "duration": "PT0.24S",
              "offsetInTicks": 148400000,
              "durationInTicks": 2400000,
              "durationMilliseconds": 240,
              "offsetMilliseconds": 14840
            },
            {
              "displayText": "हैलो?",
              "offset": "PT15.92S",
              "duration": "PT0.52S",
              "offsetInTicks": 159200000,
              "durationInTicks": 5200000,
              "durationMilliseconds": 520,
              "offsetMilliseconds": 15920
            }
          ]
        }
Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,956 questions
{count} votes

Accepted answer
  1. Prashanth Veeragoni 1,755 Reputation points Microsoft External Staff
    2025-02-14T01:42:44.7933333+00:00

    Hi Sidharth Ajul,

    Thank you for reaching out to Microsoft Q&A forum!

    It looks like the root cause of the issue was that different models were used for transcription in Speech Studio and the API.

    Speech Studio used model dc55170a-871c-4747-886c-39c385f30e38 (2024-12-11) and API used model 22b2ceb8-baea-4e34-872c-db1dedbf0eef (2024-09-26 Batch Transcription).

    The issue was caused by the API using an older model (20240926 Batch Transcription) while Speech Studio automatically selected the latest model (20241211). Since Speech Studio dynamically assigns the best available model, but the API defaults to an older model unless specified, you experienced different transcription behaviours.

    Azure Speech API does not always default to the latest model unless explicitly specified. It typically selects the most stable version unless overridden by the user.

    To Fix this issue Specify the Latest Model in API Calls:

    To ensure the API uses the same model as Speech Studio, explicitly specify the latest model (20241211) in your API request. You can do this by adding the model parameter in your API request.

    Modify your batch transcription request JSON by including the latest model ID (dc55170a-871c-4747-886c-39c385f30e38).

    Re-run your API request and compare results with Speech Studio.

    Please refer below document - About Batch Transcription
    https://learn.microsoft.com/en-us/azure/ai-services/speech-service/batch-transcription

    Hope this helps. Do let us know if you any further queries.  

     

    ------------- 

    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    Thank you.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.