Share via

Inquiry about API's to provide voice notes capabilities

Sakshi Chhabra 40 Reputation points Microsoft Employee
2026-02-19T06:20:56.3633333+00:00

Hi, Need inputs for implementing voice notes with transcription in M365.

  1. Whether any team at Microsoft is currently using Whisper (directly or indirectly) in production or experimentation. Are there any internal references, docs, or examples comparing Whisper vs Microsoft‑native options (e.g., Azure Speech‑to‑Text).
  2. Any compliance, privacy, or scalability concerns when considering Whisper APIs.
  3. How Microsoft Teams handles Voice identification / speaker diarization, Translation and Transcription. Need any references/ api's for the same.

Thanks!

Microsoft Teams | Development
Microsoft Teams | Development
Building, integrating, or customizing apps and workflows within Microsoft Teams using developer tools and APIs
0 comments No comments
{count} votes

Answer accepted by question author
  1. Sayali-MSFT 5,191 Reputation points Microsoft External Staff Moderator
    2026-02-19T09:58:38.5133333+00:00

    Hello @Sakshi Chhabra,
    Whisper at Microsoft: Yes—Whisper is GA on Azure (Azure OpenAI & Azure AI Speech) and used by customers in production. Teams real‑time transcription does not use Whisper; it uses Azure Speech models.

    Whisper vs Azure Speech:

    1. Whisper → best for async/prerecorded voice notes, multilingual accuracy.
    2. Azure Speech → required for real‑time, Teams‑like, diarization, live captions/translation.

    Compliance & privacy: Use Azure‑hosted Whisper/Speech for enterprise controls (GDPR, SOC, data residency). Speaker diarization can be biometric → consent & retention policies are required.

    Teams capabilities: Teams uses Azure Speech for real‑time transcription, speaker diarization, and speech translation; transcripts are retrievable via Microsoft Graph APIs.

    Reference Document-
    1.https://learn.microsoft.com/en-us/azure/ai-services/speech-service/whisper-overview
    2.https://azure.microsoft.com/en-us/blog/accelerate-your-productivity-with-the-whisper-model-in-azure-ai-now-generally-available/
    3.https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=windows&pivots=programming-language-csharp


Answer accepted by question author
  1. TiNo-T 11,605 Reputation points Microsoft External Staff Moderator
    2026-02-19T06:58:11.1266667+00:00

    Dear @Sakshi Chhabra,

    Welcome to Microsoft Q&A Forum!

    Thanks for your interest in implementing voice notes with transcription in Microsoft 365. Based on my research, I would like to give a summary comparing OpenAI’s Whisper model and Microsoft’s native Azure Speech-to-Text (STT), along with relevant documentation links for deeper exploration. You may consult them:

    1.Accuracy:

    • OpenAI Whisper (via Azure): High accuracy across 57 languages; excels in multilingual transcription and translation to English.
    • Azure Speech-to-Text: Strong accuracy, especially for supported languages; customizable with domain-specific data.

    2.Real-Time Support:

    • OpenAI Whisper (via Azure): Not designed for real-time; processes audio in ~30s chunks. Best for batch transcription.
    • Azure Speech-to-Text: Optimized for real-time streaming with low latency. Ideal for live captions and meetings.

    3.Speaker Diarization:

    • OpenAI Whisper (via Azure): Available only via Azure AI Speech batch transcription.
    • Azure Speech-to-Text: Fully supported with speaker labeling and timestamps.

    4.Customization:

    • OpenAI Whisper (via Azure): Fine-tuning support coming soon via Azure Custom Speech.
    • Azure Speech-to-Text: Supports Custom Speech for domain-specific vocabulary and accents.

    5.Scalability:

    • OpenAI Whisper (via Azure): Requires GPU for self-hosting; Azure-hosted version handles scale.
    • Azure Speech-to-Text: Enterprise-grade scalability with Azure infrastructure.

    6.Compliance & Privacy:

    • OpenAI Whisper (via Azure): Azure-hosted Whisper ensures enterprise-grade security and data residency.
    • Azure Speech-to-Text: Fully compliant with Microsoft’s enterprise security and privacy standards.

    References:

    Additionally, regarding how Microsoft Teams handles voice identification, transcription, and translation, here are my insights:

    1.Voice Identification / Speaker Diarization:

    Microsoft Teams supports speaker attribution in meeting transcripts using two approaches:

    • Standard Speaker attribution: For remote participants, Teams uses each participant’s unique audio stream to identify speakers in real-time and post-meeting transcripts.
    • Intelligent Speakers for In-Room Diarization: In Microsoft Teams Rooms, Intelligent Speakers can identify up to 10 in-room participants using pre-enrolled voice profiles. This enables diarization even when multiple people share a single microphone.

    References:

    2.Transcription:

    Teams provides both live transcription during meetings and post-meeting transcripts. These are powered by Microsoft’s Automatic Speech Recognition (ASR) technology from Azure Cognitive Services.

    • Live transcription displays real-time captions with speaker names.
    • Post-meeting transcripts are saved and may be downloaded or accessed via Microsoft Graph APIs.

    References:

    3.Translation:

    Teams supports live translated captions using Azure Speech Translation. Participants can view captions in their preferred language during meetings.

    • Supports one spoken language and multiple subtitle languages.
    • Requires Teams Premium for full access after the preview period.

    References:

    Note: This information is provided as a convenience to you. These sites are not controlled by Microsoft, and Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. Please ensure that you fully understand the risks before using any suggestions from the above links.

    4.If you're looking to build custom solutions or integrate with Teams, here are some useful documents that you can consult:

    I hope this information can give you some insights of your concern. Wish you a pleasant day!


    If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment". 

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread. 

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Q&A Assist
    2026-02-19T06:22:09.2733333+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    Currently, the Whisper model in Azure OpenAI service is capable of transcribing speech audio files and translating languages, making it a potential option for implementing voice notes with transcription in Microsoft 365. However, the context does not provide specific information about whether any team at Microsoft is using Whisper in production or experimentation, nor does it include internal references or comparisons between Whisper and Microsoft-native options like Azure Speech-to-Text.

    Regarding compliance, privacy, or scalability concerns when considering Whisper APIs, the context emphasizes the importance of obtaining necessary permissions from end users before processing and storing audio input. It also notes that the real-time API does not store audio input and transcription output data separately, which may alleviate some privacy concerns.

    As for Microsoft Teams, the context does not provide specific details on how it handles voice identification, speaker diarization, translation, or transcription. Therefore, no references or APIs related to these functionalities in Microsoft Teams can be provided based on the available context.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.