Is Custom Translator using a language specific tokenizer for Japanese?

Albert Llorens 20 Reputation points
2026-01-22T08:23:41.77+00:00

Hi,

I recently trained a Custom Translator model English to Japanese. The model was a new version of a model I trained about 1.5 years ago, now trained with additional data. See snapshot below.

image

Given the significant difference in the BLEU scores I see in the Custom Translator portal, I downloaded the test set of my new trained model EnUsJaJp_02, and I recalculated the BLE score with my own BLEU script.

In my script, I have an option to use a generic tokenizer (based on blanks and punctuation marks) or language specific tokenizers for languages like Chinese or Japanese, which require specific rules, since they don't use blanks between words in their writing. For Japanese, I'm using Kytea, from this repo https://github.com/neubig/kytea.

After calculating the scores with my BLEU script, I see the following:

User's image

User's image

As you can see, when I calculate the scores with my BLEU script, with a generic tokenizer, I get BLEU scores very similar to the ones I get in Custom Translator for the model I trained in Jan 2026. But if I calculate BLEU with a Japanese tokenizer (Kytea), the scores I get are much higher and are reasonably similar to the scores I seen in Custom Translator for the model I trained in 2024.

Given this data, can you clarify if Microsoft Custom Translator changed the way the BLEU score is calculated for custom models? Is it possible that Microsoft's BLEU algorithm is no loner using a Japanese specific tokenizer on the datasets (while, given the scores, it seems it was using one in May 2024)?

Thanks

Azure Translator in Foundry Tools
{count} votes

2 answers

Sort by: Most helpful
  1. SRILAKSHMI C 13,830 Reputation points Microsoft External Staff Moderator
    2026-01-22T11:44:37.03+00:00

    Hello Albert Llorens,

    Welcome to Microsoft Q&A,

    Thank you for the detailed investigation and for sharing your evaluation results, your analysis is solid, and the conclusions you’re drawing are well-founded.

    Yes, Microsoft Custom Translator currently reports BLEU scores using a language-agnostic (generic) tokenizer, including for Japanese. It does not apply a Japanese-specific tokenizer (such as KyTea or other morphological segmentation) when calculating the BLEU scores shown in the Custom Translator portal.

    Tokenization and BLEU in Custom Translator

    Custom Translator performs internal text preprocessing (such as character normalization and punctuation handling) as part of its pipeline. However, for BLEU score calculation, the evaluation now relies on a uniform, whitespace / punctuation-based tokenization approach across all languages, including non-whitespace languages like Japanese.

    Because BLEU is highly sensitive to tokenization, this has a noticeable impact on reported scores for Japanese.

    What Changed (and What Didn’t)

    There has been no indication of a model quality regression. Instead, there has been an evolution in how BLEU is standardized and reported:

    Earlier evaluations (circa 2023–mid-2024) used a more locale-aware internal evaluation pipeline, which for Japanese produced BLEU scores closer to those obtained with morphological tokenizers.

    Current evaluations (late 2024 onward) use a consistent, language-agnostic tokenizer, aligned with:

    Cross-language comparability

      Reproducibility at scale
      
         Automated benchmarking across many language pairs
         
    

    This aligns precisely with what you observed:

    • Your BLEU script with a generic tokenizer closely matches the Custom Translator BLEU reported for the Jan 2026 model.
    • Your BLEU script with KyTea (Japanese tokenizer) yields significantly higher BLEU scores.
    • The May 2024 model’s portal BLEU appears closer to a tokenizer-aware evaluation, reflecting the earlier evaluation approach.

    Why Microsoft Uses a Generic Tokenizer

    Using a single tokenization strategy allows Microsoft to:

    Maintain consistent evaluation across all languages

    Avoid language-specific evaluation bias

    Ensure stable relative comparisons between model versions

    As a result, the BLEU score in the portal is best interpreted as a relative metric (e.g., comparing two models trained and evaluated under the same methodology), rather than an absolute measure of Japanese linguistic quality.

    Clarification

    This change affects only the BLEU calculation methodology. It does not affect:

    Model training

    Inference or decoding

    Actual translation quality

    Your KyTea-based BLEU evaluation remains more linguistically meaningful for Japanese and is entirely valid for:

    • Internal quality tracking
    • Regression analysis
    • Complementing human evaluation

    Best Practice

    For Japanese,

    • Use Custom Translator BLEU for relative comparisons between models evaluated under the same regime.
    • Use tokenizer-aware BLEU or human evaluation to assess real translation quality.
    • Avoid directly comparing BLEU scores across time periods where the evaluation methodology may differ.

    Custom Translator does not currently use a Japanese-specific tokenizer for BLEU

    Your findings are accurate and expected

    The observed BLEU difference is methodological, not a quality regression

    Tokenizer-aware BLEU (e.g., KyTea) remains the better indicator of Japanese translation quality

    Please refer this

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!

    1 person found this answer helpful.
    0 comments No comments

  2. Sina Salam 27,786 Reputation points Volunteer Moderator
    2026-01-26T11:48:02.1933333+00:00

    Hello Albert Llorens,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are asking if Custom Translator using a language specific tokenizer for Japanese.

    The BLEU score difference is caused entirely by Microsoft switching from a Japanese-aware tokenizer to a generic tokenizer in late 2024; your model quality did not regress, but Azure BLEU is no longer linguistically valid for Japanese, so you must evaluate your models using SacreBLEU with a Japanese tokenizer (KyTea or ja-mecab) and maintain your own stable evaluation pipeline for accurate quality tracking. -

    https://learn.microsoft.com/en-us/azure/ai-services/translator/custom-translator/concepts/bleu-score, https://pypi.org/project/sacrebleu , https://deepwiki.com/mjpost/sacrebleu/6-tokenization-system , https://lightning.ai/docs/torchmetrics/stable/text/sacre_bleu_score.htm and https://learn.microsoft.com/en-us/azure/ai-services/translator/custom-translator/how-to/test-your-model gives more details.

    Azure BLEU is valid only for comparing models evaluated under the same Azure version, and only relatively, not absolutely, and to use SacreBLEU with Japanese tokenization (KyTea or ja-mecab). This is the way to evaluate Japanese MT quality:

    Use:

    sacrebleu reference.txt -i model_output.txt -l en-ja -tok ja-mecab
    

    OR with KyTea-tokenized references:

    cat ref.txt | kytea > ref.tok
    cat hyp.txt | kytea > hyp.tok
    sacrebleu ref.tok < hyp.tok
    

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.