Gebruik de snelle transcriptie-API met Azure AI Speech

2025-05-25

De snelle transcriptie-API wordt gebruikt om audiobestanden te transcriberen, waarbij resultaten synchroon en sneller dan in realtime worden teruggegeven. Gebruik snelle transcriptie in situaties waarin u zo snel mogelijk de transcriptie van een audio-opname nodig heeft met voorspelbare latentie, zoals:

Snelle audio- of videotranscriptie, ondertiteling en bewerking.
Videotranslatie

In tegenstelling tot de batch-transcriptie-API produceert de snelle transcriptie-API alleen transcripties in de weergavevorm (niet de lexicale vorm). Het weergaveformulier is een meer leesbare vorm van de transcriptie die interpunctie en hoofdlettergebruik bevat.

Vereiste voorwaarden

Een Azure AI Speech-resource in een van de regio's waar de snelle transcriptie-API beschikbaar is. De ondersteunde regio's zijn: Australië East, Brazilië South, Centraal India, Oost VS, Oost VS 2, Centraal Frankrijk, Oost Japan, North Central VS, Noord Europa, South Central VS, Zuidoost Azië, Centraal Zweden, South UK, West Europa, West VS, West VS 2, West VS 3. Voor meer informatie over regio's die worden ondersteund voor andere functies van de Speech-service kunt u Speech serviced regio's raadplegen.
Een audiobestand (minder dan 2 uur lang en kleiner dan 300 MB) in een van de indelingen en codecs die worden ondersteund door de batchtranscriptie-API: WAV, MP3, OPUS/MSP, FLAC, WMA, AAC, ALAW in WAV-container, MULAW in WAV-container, WMV, WebM en SPEEX. Voor meer informatie over ondersteunde audioformaten, zie ondersteunde audioformaten.

De snelle transcriptie-API gebruiken

Aanbeveling

Probeer snelle transcriptie uit in de Azure AI Foundry-portal.

We leren hoe we de snelle transcriptie-API (via Transcriptions - Transcribe) kunnen gebruiken met de volgende scenario's:

Bekende regio gespecificeerd: Transcribeer een audiobestand met een gespecificeerde regio. Als je de taalinstelling van het audiobestand kent, kun je deze specificeren om de nauwkeurigheid van de transcriptie te verbeteren en de wachttijd te minimaliseren.
Taalherkenning ingeschakeld: Transcribeer een audiobestand met ingeschakelde taalherkenning. Als u niet zeker weet wat de landinstelling van het audiobestand is, kunt u taalidentificatie inschakelen zodat de Speech-service de landinstelling kan identificeren (één landinstelling per audio).
Meertalige transcriptie (preview): een audiobestand transcriberen met het nieuwste meertalige spraaktranscriptiemodel. Als uw audio meertalige inhoud bevat die u continu en nauwkeurig wilt transcriberen, kunt u het nieuwste meertalige spraaktranscriptiemodel gebruiken zonder de landinstellingencodes op te geven.
Diarization on: Transcribeer een audiobestand met diarization on. Diarisatie onderscheidt verschillende sprekers in het gesprek. De Speech-service biedt informatie over welke spreker een bepaald deel van de getranscribeerde spraak sprak.
Multi-channel aan: Transcribeer een audiobestand dat één of twee kanalen heeft. Meerkanaals transcripties zijn nuttig voor audio-bestanden met meerdere kanalen, zoals audio-bestanden met meerdere sprekers of audio-bestanden met achtergrondgeluid. De snelle transcriptie-API voegt standaard alle invoerkanalen samen in één kanaal en voert vervolgens de transcriptie uit. Als dit niet wenselijk is, kunnen kanalen onafhankelijk worden getranscribeerd zonder samenvoeging.

Maak een POST-aanvraag met meerdere onderdelen/formuliergegevens naar het transcriptions eindpunt met het audiobestand en de eigenschappen van de aanvraagbody.

Het volgende voorbeeld laat zien hoe je een audiobestand kunt transcriberen met een opgegeven landinstelling. Als je de taalinstelling van het audiobestand kent, kun je deze specificeren om de nauwkeurigheid van de transcriptie te verbeteren en de wachttijd te minimaliseren.

Vervang YourSpeechResoureKey door uw Speech resource key.
Vervang YourServiceRegion door uw spraakbronnenregio.
Vervang YourAudioFile door het pad naar uw audiobestand.

Belangrijk

Voor de aanbevolen sleutelloze verificatie met Microsoft Entra ID, vervangt u --header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' door --header "Authorization: Bearer YourAccessToken". Zie de handleiding voor op rollen gebaseerd toegangsbeheer voor meer informatie over sleutelloze verificatie.

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"]}"'

Bouw de formulierdefinitie volgens de volgende instructies:

Stel de optionele (maar aanbevolen) locales eigenschap in die moet overeenkomen met de verwachte landinstelling van de audiogegevens om te transcriberen. In dit voorbeeld is de locatie ingesteld op en-US. Zie spraak-naar-tekst ondersteunde talen voor meer informatie over de ondersteunde landinstellingen.

Voor meer informatie over locales en andere eigenschappen van de snelle transcriptie-API, zie de sectie verzoekconfiguratieopties later in deze gids.

De reactie omvat durationMilliseconds, offsetMilliseconds en meer. De combinedPhrases eigenschap bevat de volledige transcripties voor alle sprekers.

{
	"durationMilliseconds": 182439,
	"combinedPhrases": [
		{
			"text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And you're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
		}
	],
	"phrases": [
		{
			"offsetMilliseconds": 960,
			"durationMilliseconds": 640,
			"text": "Good afternoon.",
			"words": [
				{
					"text": "Good",
					"offsetMilliseconds": 960,
					"durationMilliseconds": 240
				},
				{
					"text": "afternoon.",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 400
				}
			],
			"locale": "en-US",
			"confidence": 0.93554276
		},
		{
			"offsetMilliseconds": 1600,
			"durationMilliseconds": 640,
			"text": "This is Sam.",
			"words": [
				{
					"text": "This",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "is",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 120
				},
				{
					"text": "Sam.",
					"offsetMilliseconds": 1960,
					"durationMilliseconds": 280
				}
			],
			"locale": "en-US",
			"confidence": 0.93554276
		},
		{
			"offsetMilliseconds": 2240,
			"durationMilliseconds": 1040,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 2240,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 2440,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 2520,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 200
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 2840,
					"durationMilliseconds": 440
				}
			],
			"locale": "en-US",
			"confidence": 0.93554276
		},
		{
			"offsetMilliseconds": 3280,
			"durationMilliseconds": 640,
			"text": "How can I help?",
			"words": [
				{
					"text": "How",
					"offsetMilliseconds": 3280,
					"durationMilliseconds": 120
				},
				{
					"text": "can",
					"offsetMilliseconds": 3440,
					"durationMilliseconds": 120
				},
				{
					"text": "I",
					"offsetMilliseconds": 3560,
					"durationMilliseconds": 40
				},
				{
					"text": "help?",
					"offsetMilliseconds": 3600,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.93554276
		},
		{
			"offsetMilliseconds": 5040,
			"durationMilliseconds": 400,
			"text": "Hi there.",
			"words": [
				{
					"text": "Hi",
					"offsetMilliseconds": 5040,
					"durationMilliseconds": 240
				},
				{
					"text": "there.",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 160
				}
			],
			"locale": "en-US",
			"confidence": 0.93554276
		},
		{
			"offsetMilliseconds": 5440,
			"durationMilliseconds": 800,
			"text": "My name is Mary.",
			"words": [
				{
					"text": "My",
					"offsetMilliseconds": 5440,
					"durationMilliseconds": 80
				},
				{
					"text": "name",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5640,
					"durationMilliseconds": 80
				},
				{
					"text": "Mary.",
					"offsetMilliseconds": 5720,
					"durationMilliseconds": 520
				}
			],
			"locale": "en-US",
			"confidence": 0.93554276
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"offsetMilliseconds": 180320,
			"durationMilliseconds": 680,
			"text": "Thank you for your help.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 180320,
					"durationMilliseconds": 160
				},
				{
					"text": "you",
					"offsetMilliseconds": 180480,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 180560,
					"durationMilliseconds": 120
				},
				{
					"text": "your",
					"offsetMilliseconds": 180680,
					"durationMilliseconds": 120
				},
				{
					"text": "help.",
					"offsetMilliseconds": 180800,
					"durationMilliseconds": 200
				}
			],
			"locale": "en-US",
			"confidence": 0.92022026
		},
		{
			"offsetMilliseconds": 181960,
			"durationMilliseconds": 280,
			"text": "Thank you.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 181960,
					"durationMilliseconds": 200
				},
				{
					"text": "you.",
					"offsetMilliseconds": 182160,
					"durationMilliseconds": 80
				}
			],
			"locale": "en-US",
			"confidence": 0.92022026
		}
	]
}

Maak een POST-aanvraag met meerdere onderdelen/formuliergegevens naar het transcriptions eindpunt met het audiobestand en de eigenschappen van de aanvraagbody.

Het volgende voorbeeld laat zien hoe u een audiobestand kunt transcriberen met ingeschakelde taalidentificatie. Als je niet zeker bent van de lokale instelling, kun je meerdere lokale instellingen specificeren. Als je geen landinstelling opgeeft, of als de opgegeven landinstellingen niet in het audiobestand staan, probeert de spraakservice de landinstelling te identificeren.

Opmerking

De taalidentificatie in snelle transcriptie is ontworpen om één hoofdtaalregio per audiobestand te identificeren. Als u meertalige inhoud in de audio wilt transcriberen, kunt u overwegen om meertalige transcriptie (preview) te gebruiken.

Vervang YourSpeechResoureKey door uw Speech resource key.
Vervang YourServiceRegion door uw spraakbronnenregio.
Vervang YourAudioFile door het pad naar uw audiobestand.

Belangrijk

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US","ja-JP"]}"'

Bouw de formulierdefinitie volgens de volgende instructies:

Stel de optionele (maar aanbevolen) locales eigenschap in die moet overeenkomen met de verwachte landinstelling van de audiogegevens om te transcriberen. In dit voorbeeld zijn de landinstellingen ingesteld op en-US en ja-JP. De ondersteunde landinstellingen die u kunt opgeven, bevinden zich in alle ondersteunde talen.

Voor meer informatie over locales en andere eigenschappen van de snelle transcriptie-API, zie de sectie verzoekconfiguratieopties later in deze gids.

De reactie omvat durationMilliseconds, offsetMilliseconds en meer. De combinedPhrases eigenschap bevat de volledige transcripties voor alle sprekers.

{
	"durationMilliseconds": 185079,
	"combinedPhrases": [
		{
			"text": "Hello, thank you for calling Contoso. Who am I speaking with today? Hi, my name is Mary Rondo. I'm trying to enroll myself with Contoso. Hi, Mary. Are you calling because you need health insurance? Yes. Yeah, I'm calling to sign up for insurance. Great. Uh If you can answer a few questions, we can get you signed up in a Jiffy. Okay. So what's your full name? uh So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. Got it. And what's the best callback number in case we get disconnected? I only have a cell phone, so I can give you that. Yep, that'll be fine. Sure. So it's 234-554 and then 9312. Got it. So to confirm, it's 234-554-9312. Yep, that's right. Excellent. Let's get some additional information for your application. Do you have a job? Uh Yes, I am self-employed. Okay, so then you have a social security number as well? Uh Yes, I do. Okay, and what is your social security number, please? Uh Sure, so it's 412-253-4931. 6789. Sorry, was that a 25 or a 225? You cut out for a bit. It's double two, so 412, then another two, then five. Thank you so much. And could I have your e-mail address, please? Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. That sounds good. Thank you. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Actually, so I have one more question. Yes, of course. I'm curious, will I be getting a physical card as proof of coverage? So the default is a digital membership card, but we can send you a physical card if you prefer. Uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? Uh Yeah. uh So it's 2660 Unit A on Maple Avenue, Southeast Lansing, and then zip code is 48823. Absolutely. I've made a note on your file. Awesome. Thanks so much. You're very welcome. Thank you for calling Contoso and have a great day."
		}
	],
	"phrases": [
		{
			"offsetMilliseconds": 720,
			"durationMilliseconds": 1600,
			"text": "Hello, thank you for calling Contoso.",
			"words": [
				{
					"text": "Hello,",
					"offsetMilliseconds": 720,
					"durationMilliseconds": 480
				},
				{
					"text": "thank",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 1400,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 1480,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		{
			"offsetMilliseconds": 2320,
			"durationMilliseconds": 1120,
			"text": "Who am I speaking with today?",
			"words": [
				{
					"text": "Who",
					"offsetMilliseconds": 2320,
					"durationMilliseconds": 160
				},
				{
					"text": "am",
					"offsetMilliseconds": 2480,
					"durationMilliseconds": 80
				},
				{
					"text": "I",
					"offsetMilliseconds": 2560,
					"durationMilliseconds": 80
				},
				{
					"text": "speaking",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 320
				},
				{
					"text": "with",
					"offsetMilliseconds": 2960,
					"durationMilliseconds": 160
				},
				{
					"text": "today?",
					"offsetMilliseconds": 3120,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		{
			"offsetMilliseconds": 4480,
			"durationMilliseconds": 1600,
			"text": "Hi, my name is Mary Rondo.",
			"words": [
				{
					"text": "Hi,",
					"offsetMilliseconds": 4480,
					"durationMilliseconds": 400
				},
				{
					"text": "my",
					"offsetMilliseconds": 4880,
					"durationMilliseconds": 120
				},
				{
					"text": "name",
					"offsetMilliseconds": 5000,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5120,
					"durationMilliseconds": 160
				},
				{
					"text": "Mary",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 240
				},
				{
					"text": "Rondo.",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 560
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		{
			"offsetMilliseconds": 6120,
			"durationMilliseconds": 1800,
			"text": "I'm trying to enroll myself with Contoso.",
			"words": [
				{
					"text": "I'm",
					"offsetMilliseconds": 6120,
					"durationMilliseconds": 120
				},
				{
					"text": "trying",
					"offsetMilliseconds": 6240,
					"durationMilliseconds": 200
				},
				{
					"text": "to",
					"offsetMilliseconds": 6440,
					"durationMilliseconds": 80
				},
				{
					"text": "enroll",
					"offsetMilliseconds": 6520,
					"durationMilliseconds": 200
				},
				{
					"text": "myself",
					"offsetMilliseconds": 6720,
					"durationMilliseconds": 360
				},
				{
					"text": "with",
					"offsetMilliseconds": 7080,
					"durationMilliseconds": 120
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 7200,
					"durationMilliseconds": 720
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"offsetMilliseconds": 181520,
			"durationMilliseconds": 720,
			"text": "You're very welcome.",
			"words": [
				{
					"text": "You're",
					"offsetMilliseconds": 181520,
					"durationMilliseconds": 160
				},
				{
					"text": "very",
					"offsetMilliseconds": 181680,
					"durationMilliseconds": 200
				},
				{
					"text": "welcome.",
					"offsetMilliseconds": 181880,
					"durationMilliseconds": 360
				}
			],
			"locale": "en-US",
			"confidence": 0.90571773
		},
		{
			"offsetMilliseconds": 182320,
			"durationMilliseconds": 1840,
			"text": "Thank you for calling Contoso and have a great day.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 182320,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 182520,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 182600,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 182720,
					"durationMilliseconds": 280
				},
				{
					"text": "Contoso",
					"offsetMilliseconds": 183000,
					"durationMilliseconds": 520
				},
				{
					"text": "and",
					"offsetMilliseconds": 183520,
					"durationMilliseconds": 160
				},
				{
					"text": "have",
					"offsetMilliseconds": 183680,
					"durationMilliseconds": 120
				},
				{
					"text": "a",
					"offsetMilliseconds": 183800,
					"durationMilliseconds": 40
				},
				{
					"text": "great",
					"offsetMilliseconds": 183840,
					"durationMilliseconds": 200
				},
				{
					"text": "day.",
					"offsetMilliseconds": 184040,
					"durationMilliseconds": 120
				}
			],
			"locale": "en-US",
			"confidence": 0.90571773
		}
	]
}

Maak een POST-aanvraag met meerdere onderdelen/formuliergegevens naar het transcriptions eindpunt met het audiobestand en de eigenschappen van de aanvraagbody.

In het volgende voorbeeld ziet u hoe u een audiobestand transcribeert met het nieuwste meertalige spraaktranscriptiemodel. Als uw audio meertalige inhoud bevat die u continu en nauwkeurig wilt transcriberen, kunt u het nieuwste meertalige spraaktranscriptiemodel gebruiken zonder de landinstellingencodes op te geven.

Vervang YourSpeechResoureKey door uw Speech resource key.
Vervang YourServiceRegion door uw spraakbronnenregio.
Vervang YourAudioFile door het pad naar uw audiobestand.

Belangrijk

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":[]}"'

Bouw de formulierdefinitie volgens de volgende instructies:

U kunt de locales eigenschap leeg laten (zoals wordt weergegeven in het vorige voorbeeld) of weglaten.
De ondersteunde landinstelling voor audio-invoer met het huidige meertalige model zijn: de-DE, en-AU, en-CA, en-GB, en-IN, en-US, es-ES, es-MX, fr-CA, fr-FR, hi-IN, it-IT, ja-JP, ko-KRen zh-CN.
Het transcriptieresultaat wordt onderscheiden op taalniveau en volgt de "primaire landinstelling van deze taal" (bijvoorbeeld, het zal altijd "en-US" landinstellingscode uitvoeren, zelfs als de audio een Brits Engels of Indisch Engels accent heeft).

Voor meer informatie over locales en andere eigenschappen van de snelle transcriptie-API, zie de sectie verzoekconfiguratieopties later in deze gids.

De reactie omvat durationMilliseconds, offsetMilliseconds en meer. De combinedPhrases eigenschap bevat de volledige transcripties voor alle sprekers.

{
    "durationMilliseconds": 57187,
    "combinedPhrases": [
        {
            "text": "With custom speech,you can evaluate and improve the microsoft speech to text accuracy for your applications and products 现成的语音转文本,利用通用语言模型作为一个基本模型,使用microsoft自有数据进行训练,并反映常用的口语。此基础模型使用那些代表各常见领域的方言和发音进行了预先训练。 Quand vous effectuez une demande de reconnaissance vocale, le modèle de base le plus récent pour chaque langue prise en charge est utilisé par défaut. Le modèle de base fonctionne très bien dans la plupart des scénarios de reconnaissance vocale. A custom model can be used to augment the base model to improve recognition of domain specific vocabulary specified to the application by providing text data to train the model. It can also be used to improve recognition based for the specific audio conditions of the application by providing audio data with reference transcriptions."
        }
    ],
    "phrases": [
        {
            "offsetMilliseconds": 80,
            "durationMilliseconds": 6960,
            "text": "With custom speech,you can evaluate and improve the microsoft speech to text accuracy for your applications and products.",
            "words": [
                {
                    "text": "with",
                    "offsetMilliseconds": 80,
                    "durationMilliseconds": 160
                },
                {
                    "text": "custom",
                    "offsetMilliseconds": 240,
                    "durationMilliseconds": 480
                },
                {
                    "text": "speech",
                    "offsetMilliseconds": 720,
                    "durationMilliseconds": 360
                },
                {
                    "text": ",",
                    "offsetMilliseconds": 1080,
                    "durationMilliseconds": 10
                },
                {
                    "text": "you",
                    "offsetMilliseconds": 1200,
                    "durationMilliseconds": 240
                },
                {
                    "text": "can",
                    "offsetMilliseconds": 1440,
                    "durationMilliseconds": 160
                },
                {
                    "text": "evaluate",
                    "offsetMilliseconds": 1600,
                    "durationMilliseconds": 640
                },
                {
                    "text": "and",
                    "offsetMilliseconds": 2240,
                    "durationMilliseconds": 200
                },
                {
                    "text": "improve",
                    "offsetMilliseconds": 2440,
                    "durationMilliseconds": 280
                },
                {
                    "text": "the",
                    "offsetMilliseconds": 2720,
                    "durationMilliseconds": 160
                },
                {
                    "text": "microsoft",
                    "offsetMilliseconds": 2880,
                    "durationMilliseconds": 640
                },
                {
                    "text": "speech",
                    "offsetMilliseconds": 3520,
                    "durationMilliseconds": 320
                },
                {
                    "text": "to",
                    "offsetMilliseconds": 3840,
                    "durationMilliseconds": 200
                },
                {
                    "text": "text",
                    "offsetMilliseconds": 4040,
                    "durationMilliseconds": 360
                },
                {
                    "text": "accuracy",
                    "offsetMilliseconds": 4400,
                    "durationMilliseconds": 560
                },
                {
                    "text": "for",
                    "offsetMilliseconds": 4960,
                    "durationMilliseconds": 160
                },
                {
                    "text": "your",
                    "offsetMilliseconds": 5120,
                    "durationMilliseconds": 200
                },
                {
                    "text": "applications",
                    "offsetMilliseconds": 5320,
                    "durationMilliseconds": 760
                },
                {
                    "text": "and",
                    "offsetMilliseconds": 6080,
                    "durationMilliseconds": 200
                },
                {
                    "text": "products",
                    "offsetMilliseconds": 6280,
                    "durationMilliseconds": 680
                },
            ],
            "locale": "en-us",
            "confidence": 0.9539559
        },
        {
            "offsetMilliseconds": 8000,
            "durationMilliseconds": 8600,
            "text": "现成的语音转文本,利用通用语言模型作为一个基本模型,使用microsoft自有数据进行训练,并反映常用的口语。此基础模型使用那些代表各常见领域的方言和发音进行了预先训练。",
            "words": [
                {
                    "text": "现",
                    "offsetMilliseconds": 8000,
                    "durationMilliseconds": 40
                },
                {
                    "text": "成",
                    "offsetMilliseconds": 8040,
                    "durationMilliseconds": 40
                },
                {
                    "text": "的",
                    "offsetMilliseconds": 8160,
                    "durationMilliseconds": 40
                },
                {
                    "text": "语",
                    "offsetMilliseconds": 8200,
                    "durationMilliseconds": 40
                },
                {
                    "text": "音",
                    "offsetMilliseconds": 8240,
                    "durationMilliseconds": 40
                },
                {
                    "text": "转",
                    "offsetMilliseconds": 8280,
                    "durationMilliseconds": 40
                },
                {
                    "text": "文",
                    "offsetMilliseconds": 8320,
                    "durationMilliseconds": 40
                },
                {
                    "text": "本,",
                    "offsetMilliseconds": 8360,
                    "durationMilliseconds": 40
                },
                {
                    "text": "利",
                    "offsetMilliseconds": 8400,
                    "durationMilliseconds": 40
                },
                {
                    "text": "用",
                    "offsetMilliseconds": 8440,
                    "durationMilliseconds": 40
                },
                {
                    "text": "通",
                    "offsetMilliseconds": 8480,
                    "durationMilliseconds": 40
                },
                {
                    "text": "用",
                    "offsetMilliseconds": 8520,
                    "durationMilliseconds": 40
                },
                {
                    "text": "语",
                    "offsetMilliseconds": 8560,
                    "durationMilliseconds": 40
                },
                {
                    "text": "言",
                    "offsetMilliseconds": 8600,
                    "durationMilliseconds": 40
                },
                {
                    "text": "模",
                    "offsetMilliseconds": 8640,
                    "durationMilliseconds": 40
                },
                {
                    "text": "型",
                    "offsetMilliseconds": 8680,
                    "durationMilliseconds": 40
                },
                {
                    "text": "作",
                    "offsetMilliseconds": 8800,
                    "durationMilliseconds": 40
                },
                {
                    "text": "为",
                    "offsetMilliseconds": 8840,
                    "durationMilliseconds": 40
                },
                {
                    "text": "一",
                    "offsetMilliseconds": 9520,
                    "durationMilliseconds": 40
                },
                {
                    "text": "个",
                    "offsetMilliseconds": 9560,
                    "durationMilliseconds": 40
                },
                {
                    "text": "基",
                    "offsetMilliseconds": 9600,
                    "durationMilliseconds": 40
                },
                {
                    "text": "本",
                    "offsetMilliseconds": 9640,
                    "durationMilliseconds": 40
                },
                {
                    "text": "模",
                    "offsetMilliseconds": 9680,
                    "durationMilliseconds": 40
                },
                {
                    "text": "型,",
                    "offsetMilliseconds": 9720,
                    "durationMilliseconds": 40
                },
                {
                    "text": "使",
                    "offsetMilliseconds": 9760,
                    "durationMilliseconds": 40
                },
                {
                    "text": "用",
                    "offsetMilliseconds": 10080,
                    "durationMilliseconds": 320
                },
                {
                    "text": "microsoft",
                    "offsetMilliseconds": 10400,
                    "durationMilliseconds": 3600
                },
                {
                    "text": "自",
                    "offsetMilliseconds": 14000,
                    "durationMilliseconds": 40
                },
                {
                    "text": "有",
                    "offsetMilliseconds": 14040,
                    "durationMilliseconds": 40
                },
                {
                    "text": "数",
                    "offsetMilliseconds": 14160,
                    "durationMilliseconds": 40
                },
                {
                    "text": "据",
                    "offsetMilliseconds": 14200,
                    "durationMilliseconds": 40
                },
                {
                    "text": "进",
                    "offsetMilliseconds": 14320,
                    "durationMilliseconds": 40
                },
                {
                    "text": "行",
                    "offsetMilliseconds": 14360,
                    "durationMilliseconds": 40
                },
                {
                    "text": "训",
                    "offsetMilliseconds": 14400,
                    "durationMilliseconds": 40
                },
                {
                    "text": "练,",
                    "offsetMilliseconds": 14440,
                    "durationMilliseconds": 40
                },
                {
                    "text": "并",
                    "offsetMilliseconds": 14480,
                    "durationMilliseconds": 40
                },
                {
                    "text": "反",
                    "offsetMilliseconds": 14520,
                    "durationMilliseconds": 40
                },
                {
                    "text": "映",
                    "offsetMilliseconds": 14560,
                    "durationMilliseconds": 40
                },
                {
                    "text": "常",
                    "offsetMilliseconds": 14600,
                    "durationMilliseconds": 40
                },
                {
                    "text": "用",
                    "offsetMilliseconds": 14640,
                    "durationMilliseconds": 40
                },
                {
                    "text": "的",
                    "offsetMilliseconds": 14680,
                    "durationMilliseconds": 40
                },
                {
                    "text": "口",
                    "offsetMilliseconds": 14720,
                    "durationMilliseconds": 40
                },
                {
                    "text": "语",
                    "offsetMilliseconds": 14760,
                    "durationMilliseconds": 40
                },
                {
                    "text": "。",
                    "offsetMilliseconds": 14800,
                    "durationMilliseconds": 40
                },
                {
                    "text": "此",
                    "offsetMilliseconds": 14840,
                    "durationMilliseconds": 40
                },
                {
                    "text": "基",
                    "offsetMilliseconds": 14880,
                    "durationMilliseconds": 40
                },
                {
                    "text": "础",
                    "offsetMilliseconds": 14920,
                    "durationMilliseconds": 40
                },
                {
                    "text": "模",
                    "offsetMilliseconds": 14960,
                    "durationMilliseconds": 40
                },
                {
                    "text": "型",
                    "offsetMilliseconds": 15000,
                    "durationMilliseconds": 40
                },
                {
                    "text": "使",
                    "offsetMilliseconds": 15040,
                    "durationMilliseconds": 40
                },
                {
                    "text": "用",
                    "offsetMilliseconds": 15080,
                    "durationMilliseconds": 40
                },
                {
                    "text": "那",
                    "offsetMilliseconds": 15120,
                    "durationMilliseconds": 40
                },
                {
                    "text": "些",
                    "offsetMilliseconds": 15160,
                    "durationMilliseconds": 40
                },
                {
                    "text": "代",
                    "offsetMilliseconds": 15200,
                    "durationMilliseconds": 40
                },
                {
                    "text": "表",
                    "offsetMilliseconds": 15240,
                    "durationMilliseconds": 40
                },
                {
                    "text": "各",
                    "offsetMilliseconds": 15280,
                    "durationMilliseconds": 40
                },
                {
                    "text": "常",
                    "offsetMilliseconds": 15320,
                    "durationMilliseconds": 40
                },
                {
                    "text": "见",
                    "offsetMilliseconds": 15360,
                    "durationMilliseconds": 40
                },
                {
                    "text": "领",
                    "offsetMilliseconds": 15400,
                    "durationMilliseconds": 40
                },
                {
                    "text": "域",
                    "offsetMilliseconds": 15760,
                    "durationMilliseconds": 40
                },
                {
                    "text": "的",
                    "offsetMilliseconds": 15800,
                    "durationMilliseconds": 40
                },
                {
                    "text": "方",
                    "offsetMilliseconds": 15920,
                    "durationMilliseconds": 40
                },
                {
                    "text": "言",
                    "offsetMilliseconds": 15960,
                    "durationMilliseconds": 40
                },
                {
                    "text": "和",
                    "offsetMilliseconds": 16000,
                    "durationMilliseconds": 40
                },
                {
                    "text": "发",
                    "offsetMilliseconds": 16040,
                    "durationMilliseconds": 40
                },
                {
                    "text": "音",
                    "offsetMilliseconds": 16080,
                    "durationMilliseconds": 40
                },
                {
                    "text": "进",
                    "offsetMilliseconds": 16120,
                    "durationMilliseconds": 40
                },
                {
                    "text": "行",
                    "offsetMilliseconds": 16160,
                    "durationMilliseconds": 40
                },
                {
                    "text": "了",
                    "offsetMilliseconds": 16200,
                    "durationMilliseconds": 40
                },
                {
                    "text": "预",
                    "offsetMilliseconds": 16320,
                    "durationMilliseconds": 40
                },
                {
                    "text": "先",
                    "offsetMilliseconds": 16360,
                    "durationMilliseconds": 40
                },
                {
                    "text": "训",
                    "offsetMilliseconds": 16400,
                    "durationMilliseconds": 40
                },
                {
                    "text": "练",
                    "offsetMilliseconds": 16560,
                    "durationMilliseconds": 40
                },
            ],
            "locale": "zh-cn",
            "confidence": 0.9241725
        },
        {
            "offsetMilliseconds": 24320,
            "durationMilliseconds": 6640,
            "text": "Quand vous effectuez une demande de reconnaissance vocale, le modèle de base le plus récent pour chaque langue prise en charge est utilisé par défaut.",
            "words": [
                {
                    "text": "Quand",
                    "offsetMilliseconds": 24320,
                    "durationMilliseconds": 160
                },
                {
                    "text": "vous",
                    "offsetMilliseconds": 24480,
                    "durationMilliseconds": 80
                },
		// More transcription results...
	    // Redacted for brevity
                {
                    "text": "scénarios",
                    "offsetMilliseconds": 34200,
                    "durationMilliseconds": 400
                },
                {
                    "text": "de",
                    "offsetMilliseconds": 34600,
                    "durationMilliseconds": 120
                },
                {
                    "text": "reconnaissance",
                    "offsetMilliseconds": 34720,
                    "durationMilliseconds": 640
                },
                {
                    "text": "vocale.",
                    "offsetMilliseconds": 35360,
                    "durationMilliseconds": 480
                }
            ],
            "locale": "fr-fr",
            "confidence": 0.9308314
        },
        {
            "offsetMilliseconds": 36720,
            "durationMilliseconds": 10320,
            "text": "A custom model can be used to augment the base model to improve recognition of domain specific vocabulary spécifique to the application by providing text data to train the model.",
            "words": [
                {
                    "text": "A",
                    "offsetMilliseconds": 36720,
                    "durationMilliseconds": 80
                },
                {
                    "text": "custom",
                    "offsetMilliseconds": 36880,
                    "durationMilliseconds": 400
                },
                {
                    "text": "model",
                    "offsetMilliseconds": 37280,
                    "durationMilliseconds": 480
                },

		// More transcription results...
	    // Redacted for brevity
                {
                    "text": "with",
                    "offsetMilliseconds": 54720,
                    "durationMilliseconds": 200
                },
                {
                    "text": "reference",
                    "offsetMilliseconds": 54920,
                    "durationMilliseconds": 360
                },
                {
                    "text": "transcriptions.",
                    "offsetMilliseconds": 55280,
                    "durationMilliseconds": 1200
                }
            ],
            "locale": "en-us",
            "confidence": 0.92155737
        }
    ]
}

Maak een POST-aanvraag met meerdere onderdelen/formuliergegevens naar het transcriptions eindpunt met het audiobestand en de eigenschappen van de aanvraagbody.

Het volgende voorbeeld laat zien hoe je een audiobestand kunt transcriberen met diarisatie ingeschakeld. Diarisatie onderscheidt verschillende sprekers in het gesprek. De Speech-service biedt informatie over welke spreker een bepaald deel van de getranscribeerde spraak sprak.

Vervang YourSpeechResoureKey door uw Speech resource key.
Vervang YourServiceRegion door uw spraakbronnenregio.
Vervang YourAudioFile door het pad naar uw audiobestand.

Belangrijk

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"], 
    "diarization": {"maxSpeakers": 2,"enabled": true}}"'

Bouw de formulierdefinitie volgens de volgende instructies:

Stel de optionele (maar aanbevolen) locales eigenschap in die moet overeenkomen met de verwachte landinstelling van de audiogegevens om te transcriberen. In dit voorbeeld is de locatie ingesteld op en-US. De ondersteunde landinstellingen die u kunt opgeven zijn: de-DE, en-GB, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BRen zh-CN.
Stel de diarization eigenschap in om meerdere sprekers in één audiokanaal te herkennen en te scheiden. Bijvoorbeeld, specificeer "diarization": {"maxSpeakers": 2, "enabled": true}. Vervolgens bevat het transcriptiebestand speaker items voor elke getranscribeerde zin.

Voor meer informatie over locales, diarization, en andere kenmerken van de snelle transcriptie-API, raadpleeg de sectie Configuratie-opties voor aanvragen verderop in deze gids.

De reactie omvat durationMilliseconds, offsetMilliseconds en meer. In dit voorbeeld is diarisatie ingeschakeld, zodat de reactie speaker informatie bevat voor elke getranscribeerde uitspraak. De combinedPhrases eigenschap bevat de volledige transcripties voor alle sprekers in één kanaal.

{
	"durationMilliseconds": 182439,
	"combinedPhrases": [
		{
			"channel": 0,
			"text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh. Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? Uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And. You're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
		}
	],
	"phrases": [
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 960,
			"durationMilliseconds": 640,
			"text": "Good afternoon.",
			"words": [
				{
					"text": "Good",
					"offsetMilliseconds": 960,
					"durationMilliseconds": 240
				},
				{
					"text": "afternoon.",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 400
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 1600,
			"durationMilliseconds": 640,
			"text": "This is Sam.",
			"words": [
				{
					"text": "This",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "is",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 120
				},
				{
					"text": "Sam.",
					"offsetMilliseconds": 1960,
					"durationMilliseconds": 280
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 2240,
			"durationMilliseconds": 1040,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 2240,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 2440,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 2520,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 200
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 2840,
					"durationMilliseconds": 440
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 3280,
			"durationMilliseconds": 640,
			"text": "How can I help?",
			"words": [
				{
					"text": "How",
					"offsetMilliseconds": 3280,
					"durationMilliseconds": 120
				},
				{
					"text": "can",
					"offsetMilliseconds": 3440,
					"durationMilliseconds": 120
				},
				{
					"text": "I",
					"offsetMilliseconds": 3560,
					"durationMilliseconds": 40
				},
				{
					"text": "help?",
					"offsetMilliseconds": 3600,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 0,
			"offsetMilliseconds": 5040,
			"durationMilliseconds": 400,
			"text": "Hi there.",
			"words": [
				{
					"text": "Hi",
					"offsetMilliseconds": 5040,
					"durationMilliseconds": 240
				},
				{
					"text": "there.",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 160
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 0,
			"offsetMilliseconds": 5440,
			"durationMilliseconds": 800,
			"text": "My name is Mary.",
			"words": [
				{
					"text": "My",
					"offsetMilliseconds": 5440,
					"durationMilliseconds": 80
				},
				{
					"text": "name",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5640,
					"durationMilliseconds": 80
				},
				{
					"text": "Mary.",
					"offsetMilliseconds": 5720,
					"durationMilliseconds": 520
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"channel": 0,
			"speaker": 0,
			"offsetMilliseconds": 180320,
			"durationMilliseconds": 680,
			"text": "Thank you for your help.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 180320,
					"durationMilliseconds": 160
				},
				{
					"text": "you",
					"offsetMilliseconds": 180480,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 180560,
					"durationMilliseconds": 120
				},
				{
					"text": "your",
					"offsetMilliseconds": 180680,
					"durationMilliseconds": 120
				},
				{
					"text": "help.",
					"offsetMilliseconds": 180800,
					"durationMilliseconds": 200
				}
			],
			"locale": "en-US",
			"confidence": 0.9314801
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 181960,
			"durationMilliseconds": 280,
			"text": "Thank you.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 181960,
					"durationMilliseconds": 200
				},
				{
					"text": "you.",
					"offsetMilliseconds": 182160,
					"durationMilliseconds": 80
				}
			],
			"locale": "en-US",
			"confidence": 0.9314801
		}
    ]
}

Maak een POST-aanvraag met meerdere onderdelen/formuliergegevens naar het transcriptions eindpunt met het audiobestand en de eigenschappen van de aanvraagbody.

In het volgende voorbeeld ziet u hoe u een audiobestand met een of twee kanalen transcribeert. Meerkanaals transcripties zijn nuttig voor audio-bestanden met meerdere kanalen, zoals audio-bestanden met meerdere sprekers of audio-bestanden met achtergrondgeluid. De snelle transcriptie-API voegt standaard alle invoerkanalen samen in één kanaal en voert vervolgens de transcriptie uit. Als dit niet wenselijk is, kunnen kanalen onafhankelijk worden getranscribeerd zonder samenvoeging.

Vervang YourSpeechResoureKey door uw Speech resource key.
Vervang YourServiceRegion door uw spraakbronnenregio.
Vervang YourAudioFile door het pad naar uw audiobestand.

Belangrijk

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"], 
    "channels": [0,1]}"'

Bouw de formulierdefinitie volgens de volgende instructies:

Stel de optionele (maar aanbevolen) locales eigenschap in die moet overeenkomen met de verwachte landinstelling van de audiogegevens om te transcriberen. In dit voorbeeld is de locatie ingesteld op en-US. De ondersteunde landinstellingen die u kunt opgeven zijn: de-DE, en-GB, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BRen zh-CN.
Stel de channels eigenschap in om de nulgebaseerde indices van de kanalen op te geven die afzonderlijk getranscribeerd moeten worden. Er worden maximaal twee kanalen ondersteund, tenzij er diarization is ingeschakeld. In dit voorbeeld worden kanalen 0 en 1 gespecificeerd.

Voor meer informatie over locales, channels, en andere kenmerken van de snelle transcriptie-API, raadpleeg de sectie Configuratie-opties voor aanvragen verderop in deze gids.

De reactie omvat durationMilliseconds, offsetMilliseconds en meer. De channel eigenschap identificeert het kanaal als het audiobestand meerdere kanalen bevat. De combinedPhrases eigenschap bevat volledige transcripties gescheiden per audiokanaal. Zoek naar "channel": 0,"text" en "channel": 1,"text" om de volledige transcripties voor elk kanaal te identificeren.

{
	"durationMilliseconds": 185079,
	"combinedPhrases": [
		{
			"channel": 0,
			"text": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, Mary. Are you calling because you need health insurance? Great. If you can answer a few questions, we can get you signed up in the Jiffy. So what's your full name? Got it. And what's the best callback number in case we get disconnected? Yep, that'll be fine. Got it. So to confirm, it's 234-554-9312. Excellent. Let's get some additional information for your application. Do you have a job? OK, so then you have a Social Security number as well. OK, and what is your Social Security number please? Sorry, what was that, a 25 or a 225? You cut out for a bit. Alright, thank you so much. And could I have your e-mail address please? Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Uh Yes, of course. So the default is a digital membership card, but we can send you a physical card if you prefer. Uh, yeah. Absolutely. I've made a note on your file. You're very welcome. Thank you for calling Contoso and have a great day."
		},
		{
			"channel": 1,
			"text": "Hi, my name is Mary Rondo. I'm trying to enroll myself with Contuso. Yes, yeah, I'm calling to sign up for insurance. Okay. So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. I only have a cell phone so I can give you that. Sure, so it's 234-554 and then 9312. Yep, that's right. Uh Yes, I am self-employed. Yes, I do. Uh Sure, so it's 412256789. It's double two, so 412, then another two, then five. Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. That was quick. Thank you. Actually, so I have one more question. I'm curious, will I be getting a physical card as proof of coverage? uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? So it's 2660 Unit A on Maple Avenue SE, Lansing, and then zip code is 48823. Awesome. Thanks so much."
		}
	],
	"phrases": [
		{
			"channel": 0,
			"offsetMilliseconds": 720,
			"durationMilliseconds": 480,
			"text": "Hello.",
			"words": [
				{
					"text": "Hello.",
					"offsetMilliseconds": 720,
					"durationMilliseconds": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offsetMilliseconds": 1200,
			"durationMilliseconds": 1120,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 1400,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 1480,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offsetMilliseconds": 2320,
			"durationMilliseconds": 1120,
			"text": "Who am I speaking with today?",
			"words": [
				{
					"text": "Who",
					"offsetMilliseconds": 2320,
					"durationMilliseconds": 160
				},
				{
					"text": "am",
					"offsetMilliseconds": 2480,
					"durationMilliseconds": 80
				},
				{
					"text": "I",
					"offsetMilliseconds": 2560,
					"durationMilliseconds": 80
				},
				{
					"text": "speaking",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 320
				},
				{
					"text": "with",
					"offsetMilliseconds": 2960,
					"durationMilliseconds": 160
				},
				{
					"text": "today?",
					"offsetMilliseconds": 3120,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offsetMilliseconds": 9520,
			"durationMilliseconds": 400,
			"text": "Hi, Mary.",
			"words": [
				{
					"text": "Hi,",
					"offsetMilliseconds": 9520,
					"durationMilliseconds": 80
				},
				{
					"text": "Mary.",
					"offsetMilliseconds": 9600,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"channel": 1,
			"offsetMilliseconds": 4480,
			"durationMilliseconds": 1600,
			"text": "Hi, my name is Mary Rondo.",
			"words": [
				{
					"text": "Hi,",
					"offsetMilliseconds": 4480,
					"durationMilliseconds": 400
				},
				{
					"text": "my",
					"offsetMilliseconds": 4880,
					"durationMilliseconds": 120
				},
				{
					"text": "name",
					"offsetMilliseconds": 5000,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5120,
					"durationMilliseconds": 160
				},
				{
					"text": "Mary",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 240
				},
				{
					"text": "Rondo.",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 560
				}
			],
			"locale": "en-US",
			"confidence": 0.8989456
		},
		{
			"channel": 1,
			"offsetMilliseconds": 6080,
			"durationMilliseconds": 1920,
			"text": "I'm trying to enroll myself with Contuso.",
			"words": [
				{
					"text": "I'm",
					"offsetMilliseconds": 6080,
					"durationMilliseconds": 160
				},
				{
					"text": "trying",
					"offsetMilliseconds": 6240,
					"durationMilliseconds": 200
				},
				{
					"text": "to",
					"offsetMilliseconds": 6440,
					"durationMilliseconds": 80
				},
				{
					"text": "enroll",
					"offsetMilliseconds": 6520,
					"durationMilliseconds": 200
				},
				{
					"text": "myself",
					"offsetMilliseconds": 6720,
					"durationMilliseconds": 360
				},
				{
					"text": "with",
					"offsetMilliseconds": 7080,
					"durationMilliseconds": 120
				},
				{
					"text": "Contuso.",
					"offsetMilliseconds": 7200,
					"durationMilliseconds": 800
				}
			],
			"locale": "en-US",
			"confidence": 0.8989456
		},
		// More transcription results...
	    // Redacted for brevity
    ]
}

Opmerking

Spraakdienst is een flexibele dienst. Als je foutcode 429 (te veel aanvragen) ontvangt, volg dan de beste praktijken om het afknijpen tijdens autoscaling te verminderen.

Verzoek om configuratieopties

Hier zijn enkele eigenschappenopties om een transcriptie te configureren wanneer u de Transcriptions - Transcribe operatie oproept.

Eigendom	Beschrijving	Verplicht of optioneel
`channels`	De lijst met nul-gebaseerde indexen van de kanalen die afzonderlijk getranscribeerd moeten worden. Er worden maximaal twee kanalen ondersteund, tenzij er diarization is ingeschakeld. De snelle transcriptie-API voegt standaard alle invoerkanalen samen in één kanaal en voert vervolgens de transcriptie uit. Als dit niet wenselijk is, kunnen kanalen onafhankelijk worden getranscribeerd zonder samenvoeging. Als je de kanalen van een stereogeluidsbestand afzonderlijk wilt transcriberen, moet je `[0,1]`, `[0]`, of `[1]` specificeren. Anders wordt stereogeluid samengevoegd met mono en wordt slechts één kanaal getranscribeerd. Als het geluid stereo is en diarisatie is ingeschakeld, kun je de `channels` eigenschap niet instellen op `[0,1]`. De Speech-service biedt geen ondersteuning voor diarisatie van meerdere kanalen. Voor mono-audio wordt de `channels` eigenschap genegeerd en wordt de audio altijd getranscribeerd als een enkel kanaal.	Optioneel
`diarization`	De diariseringsconfiguratie. Diarisatie is het proces van het herkennen en scheiden van meerdere sprekers in één audiokanaal. Bijvoorbeeld, specificeer `"diarization": {"maxSpeakers": 2, "enabled": true}`. Vervolgens bevat het transcriptiebestand `speaker` items (zoals `"speaker": 0` of `"speaker": 1`) voor elke getranscribeerde zin.	Optioneel
`locales`	De lijst met locaties die moeten overeenkomen met de verwachte locatie van de audiodata om te transcriberen. Als je de taalinstelling van het audiobestand kent, kun je deze specificeren om de nauwkeurigheid van de transcriptie te verbeteren en de wachttijd te minimaliseren. Als er één landinstelling is opgegeven, wordt die landinstelling gebruikt voor transcriptie. Maar als u niet zeker weet wat de landinstelling is, kunt u meerdere landinstellingen opgeven om taalidentificatie te gebruiken. Taalidentificatie kan nauwkeuriger zijn met een meer precieze lijst van kandidaat locaties. Als u geen landinstellingen opgeeft, gebruikt de Speech-service het nieuwste meertalige model om de landinstelling te identificeren en continu te transcriberen. U kunt de meest recente ondersteunde talen ophalen via de Transcripties - List Supported Locales REST API (API versie 2024-11-15 of hoger). Voor meer informatie over locale-instellingen kunt u de documentatie Ondersteuning van spraakdiensttalen raadplegen.	Optioneel, maar aanbevolen als u de verwachte locaalinstelling kent.
`profanityFilterMode`	Hiermee geeft u op hoe grof taalgebruik moet worden verwerkt in herkenningsresultaten. Geaccepteerde waarden zijn `None` om grof taalgebruik uit te schakelen, `Masked` grof taalgebruik te vervangen door sterretjes, `Removed` om alle grof taalgebruik uit het resultaat te verwijderen of `Tags` om grof taalgebruikslabels toe te voegen. De standaardwaarde is `Masked`.	Optioneel

Delen via

Gebruik de snelle transcriptie-API met Azure AI Speech

Vereiste voorwaarden

De snelle transcriptie-API gebruiken

Verzoek om configuratieopties

Verwante inhoud

Feedback

Aanvullende resources