Hello Albert Llorens,
Welcome to Microsoft Q&A,
Thank you for the detailed investigation and for sharing your evaluation results, your analysis is solid, and the conclusions you’re drawing are well-founded.
Yes, Microsoft Custom Translator currently reports BLEU scores using a language-agnostic (generic) tokenizer, including for Japanese. It does not apply a Japanese-specific tokenizer (such as KyTea or other morphological segmentation) when calculating the BLEU scores shown in the Custom Translator portal.
Tokenization and BLEU in Custom Translator
Custom Translator performs internal text preprocessing (such as character normalization and punctuation handling) as part of its pipeline. However, for BLEU score calculation, the evaluation now relies on a uniform, whitespace / punctuation-based tokenization approach across all languages, including non-whitespace languages like Japanese.
Because BLEU is highly sensitive to tokenization, this has a noticeable impact on reported scores for Japanese.
What Changed (and What Didn’t)
There has been no indication of a model quality regression. Instead, there has been an evolution in how BLEU is standardized and reported:
Earlier evaluations (circa 2023–mid-2024) used a more locale-aware internal evaluation pipeline, which for Japanese produced BLEU scores closer to those obtained with morphological tokenizers.
Current evaluations (late 2024 onward) use a consistent, language-agnostic tokenizer, aligned with:
Cross-language comparability
Reproducibility at scale
Automated benchmarking across many language pairs
This aligns precisely with what you observed:
- Your BLEU script with a generic tokenizer closely matches the Custom Translator BLEU reported for the Jan 2026 model.
- Your BLEU script with KyTea (Japanese tokenizer) yields significantly higher BLEU scores.
- The May 2024 model’s portal BLEU appears closer to a tokenizer-aware evaluation, reflecting the earlier evaluation approach.
Why Microsoft Uses a Generic Tokenizer
Using a single tokenization strategy allows Microsoft to:
Maintain consistent evaluation across all languages
Avoid language-specific evaluation bias
Ensure stable relative comparisons between model versions
As a result, the BLEU score in the portal is best interpreted as a relative metric (e.g., comparing two models trained and evaluated under the same methodology), rather than an absolute measure of Japanese linguistic quality.
Clarification
This change affects only the BLEU calculation methodology. It does not affect:
Model training
Inference or decoding
Actual translation quality
Your KyTea-based BLEU evaluation remains more linguistically meaningful for Japanese and is entirely valid for:
- Internal quality tracking
- Regression analysis
- Complementing human evaluation
Best Practice
For Japanese,
- Use Custom Translator BLEU for relative comparisons between models evaluated under the same regime.
- Use tokenizer-aware BLEU or human evaluation to assess real translation quality.
- Avoid directly comparing BLEU scores across time periods where the evaluation methodology may differ.
Custom Translator does not currently use a Japanese-specific tokenizer for BLEU
Your findings are accurate and expected
The observed BLEU difference is methodological, not a quality regression
Tokenizer-aware BLEU (e.g., KyTea) remains the better indicator of Japanese translation quality
Please refer this
- Custom Translator Overview
- Understanding Translation Accuracy in Azure AI Translator
- Custom Translator for Beginners
- Data Filtering
I Hope this helps. Do let me know if you have any further queries.
Thank you!