Yep - you're correct - “to create phonemes” is ambiguous or slightly misleading in that context, since phonemes are linguistic units representing sounds, not something created by prosodic parsing, but rather identified and given acoustic properties before synthesis.
Here’s a clearer revision of the relevant sentence that better reflects the process:
To synthesize speech, the system typically tokenizes the text into individual words and assigns phonetic sounds to each word. It then groups the phonetic transcription into prosodic units (such as phrases, clauses, or sentences) and adds acoustic properties to the phonemes, which are then converted into audio. These phonemes, enriched with timing, intonation, and stress, are synthesized as audio and can be modulated with a particular voice, speaking rate, pitch, and volume.
This forum is monitored by Microsoft staff - so I'd expect them to take a note of your suggestion and reach out to the team maintaining the MS Learn content regarding its update.
If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.
hth
Marcin