Source and Target Language Misalignment in TMX Files issues when uploading training data to Custom Translator

Clinton 25 Reputation points
2025-07-01T13:41:43.7966667+00:00

When I upload a .tmx training file, Azure will sometimes break up the text incorrectly so the Source and Target Language snippets are no longer in alignment. For example, in the attached screenshot the source and target are aligned. In the second screen shot (in Azure) the text is broken up in correctly and no longer aligns with the correct translation. User's image

User's image

Azure AI Translator
Azure AI Translator
An Azure service to easily conduct machine translation with a simple REST API call.
489 questions
0 comments No comments
{count} votes

Accepted answer
  1. Jerald Felix 2,410 Reputation points
    2025-07-01T16:28:54.8733333+00:00

    Hi Clinton,

    When a TMX file shows “broken” sentence pairs in the Custom Translator portal it almost always means the file’s own segmentation isn’t what the service expects, so it falls back on its built-in sentence-breaker and produces the mis-alignment you see in the UI. Below is what’s happening and the quickest ways to fix it.

    Why it happens

    Root cause What to look for in the TMX
    Multi-sentence segments – a single <seg> contains two or more sentences Hard line-breaks, <br>, or punctuation like “. ” inside one <seg>
    Multi-sentence segments – a single <seg> contains two or more sentences Hard line-breaks, &lt;br&gt;, or punctuation like “. ” inside one <seg>
    Paragraph-level segmentation (segtype="paragraph" or missing) Each <tu> is a whole paragraph; the portal then re-segments on periods
    Mixed or missing language tags xml:lang="en" in the source but xml:lang="es-ES" in some targets
    Invisible control characters Copy-paste from Word leaves 0x0D/0x0B that split sentences on upload

    Important: Custom Translator skips sentence breaking/alignment only when the TMX is already clean; otherwise it applies its own aligner to try to rescue the data. ([learn.microsoft.com](https://learn.microsoft.com/en-us/azure/ai-services/translator/custom-translator/faq"Frequently asked questions - Azure AI Custom Translator - Azure AI services | Microsoft Learn"))

    How to resolve

    1. One sentence per <tu> / <seg>
      • Re-segment the TMX in a CAT tool (Trados, memoQ, Okapi CheckMate) so every TU is a single sentence.
      • Set segtype="sentence" in the TMX header if your tool supports it (keeps Translator from guessing). ([gala-global.org](https://www.gala-global.org/tmx-14b"TMX 1.4b | GALA Global"))
      Strip hard line-breaks inside segments Search-and-replace &#10; or &#13; (LF/CR) and <br> tags inside <seg>; Azure treats them as new sentences. The sentence-alignment article explicitly warns that newlines inside a sentence “cause poor alignments.” ([learn.microsoft.com](https://learn.microsoft.com/en-us/azure/ai-services/translator/custom-translator/concepts/sentence-alignment"Sentence pairing and alignment - Azure AI Custom Translator - Azure AI services | Microsoft Learn")) Validate language codes Make sure every <tuv> has the exact same ISO tag throughout the file (en vs en-US counts as different). Mismatches trigger partial re-alignment. Force-skip alignment with .align (optional) If you have a clean sentence-per-line file but still see issues, convert it to two UTF-8 text files (source + target) with one line = one sentence, rename the pair to .align, and upload. Files with the .align extension tell Custom Translator to completely bypass its aligner. ([learn.microsoft.com](https://learn.microsoft.com/en-us/azure/ai-services/translator/custom-translator/concepts/document-formats-naming-convention"Document formats and naming conventions - Azure AI Custom Translator - Azure AI services | Microsoft Learn")) Run a TMX validator before upload Tools like Okapi CheckMate or SDL TMXValidator catch unbalanced tags, rogue control characters, and invalid XML that can confuse the importer.

    Does the training fail if the portal looks wrong?

    Usually, yes—if the UI shows split sentences the underlying alignment count drops, so you’ll get low “Aligned Sentences” metrics and the model quality will suffer. Fixing the segmentation before training is therefore worth the effort.

    Quick checklist before your next upload

    ✔ Each TU contains exactly one source-sentence and one target-sentence

    xml:lang attributes are consistent and use ISO codes Azure recognises

    ✔ No embedded hard returns or HTML line-breaks inside <seg>

    ✔ TMX passes a validator (well-formed XML, TMX 1.4b compliant)

    ✔ (Optional) export to .align if you’re 100 % sure the sentences already match

    Apply the above, re-upload the file, and the portal should display perfectly aligned snippets—training scores will improve too. If problems persist after cleaning, raise a support ticket with a small sample of the TMX so the team can reproduce the importer behaviour.

    Hope that clears it up! Let me know your feedback.

    Best Regards,

    Jerald Felix

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.