Hi Clinton,
When a TMX file shows “broken” sentence pairs in the Custom Translator portal it almost always means the file’s own segmentation isn’t what the service expects, so it falls back on its built-in sentence-breaker and produces the mis-alignment you see in the UI. Below is what’s happening and the quickest ways to fix it.
Why it happens
Root cause | What to look for in the TMX |
---|---|
Multi-sentence segments – a single <seg> contains two or more sentences | Hard line-breaks, <br>, or punctuation like “. ” inside one <seg> |
Multi-sentence segments – a single <seg> contains two or more sentences |
Hard line-breaks, <br> , or punctuation like “. ” inside one <seg> |
Paragraph-level segmentation (segtype="paragraph" or missing) |
Each <tu> is a whole paragraph; the portal then re-segments on periods |
Mixed or missing language tags | xml:lang="en" in the source but xml:lang="es-ES" in some targets |
Invisible control characters | Copy-paste from Word leaves 0x0D/0x0B that split sentences on upload |
Important: Custom Translator skips sentence breaking/alignment only when the TMX is already clean; otherwise it applies its own aligner to try to rescue the data. ([learn.microsoft.com](https://learn.microsoft.com/en-us/azure/ai-services/translator/custom-translator/faq"Frequently asked questions - Azure AI Custom Translator - Azure AI services | Microsoft Learn"))
How to resolve
- One sentence per
<tu>
/<seg>
- Re-segment the TMX in a CAT tool (Trados, memoQ, Okapi CheckMate) so every TU is a single sentence.
- Set
segtype="sentence"
in the TMX header if your tool supports it (keeps Translator from guessing). ([gala-global.org](https://www.gala-global.org/tmx-14b"TMX 1.4b | GALA Global"))
or
(LF/CR) and<br>
tags inside<seg>
; Azure treats them as new sentences. The sentence-alignment article explicitly warns that newlines inside a sentence “cause poor alignments.” ([learn.microsoft.com](https://learn.microsoft.com/en-us/azure/ai-services/translator/custom-translator/concepts/sentence-alignment"Sentence pairing and alignment - Azure AI Custom Translator - Azure AI services | Microsoft Learn")) Validate language codes Make sure every<tuv>
has the exact same ISO tag throughout the file (en
vsen-US
counts as different). Mismatches trigger partial re-alignment. Force-skip alignment with.align
(optional) If you have a clean sentence-per-line file but still see issues, convert it to two UTF-8 text files (source + target) with one line = one sentence, rename the pair to.align
, and upload. Files with the.align
extension tell Custom Translator to completely bypass its aligner. ([learn.microsoft.com](https://learn.microsoft.com/en-us/azure/ai-services/translator/custom-translator/concepts/document-formats-naming-convention"Document formats and naming conventions - Azure AI Custom Translator - Azure AI services | Microsoft Learn")) Run a TMX validator before upload Tools like Okapi CheckMate or SDL TMXValidator catch unbalanced tags, rogue control characters, and invalid XML that can confuse the importer.
Does the training fail if the portal looks wrong?
Usually, yes—if the UI shows split sentences the underlying alignment count drops, so you’ll get low “Aligned Sentences” metrics and the model quality will suffer. Fixing the segmentation before training is therefore worth the effort.
Quick checklist before your next upload
✔ Each TU contains exactly one source-sentence and one target-sentence
✔ xml:lang
attributes are consistent and use ISO codes Azure recognises
✔ No embedded hard returns or HTML line-breaks inside <seg>
✔ TMX passes a validator (well-formed XML, TMX 1.4b compliant)
✔ (Optional) export to .align
if you’re 100 % sure the sentences already match
Apply the above, re-upload the file, and the portal should display perfectly aligned snippets—training scores will improve too. If problems persist after cleaning, raise a support ticket with a small sample of the TMX so the team can reproduce the importer behaviour.
Hope that clears it up! Let me know your feedback.
Best Regards,
Jerald Felix