What is a BLEU score?
BLEU (Bilingual Evaluation Understudy) is a measurement of the difference between an automatic translation and human-created reference translations of the same source sentence.
The BLEU algorithm compares consecutive phrases of the automatic translation with the consecutive phrases it finds in the reference translation, and counts the number of matches, in a weighted fashion. These matches are position independent. A higher match degree indicates a higher degree of similarity with the reference translation, and higher score. Intelligibility and grammatical correctness aren't taken into account.
How BLEU works?
The BLEU score's strength is that it correlates well with human judgment. BLEU averages out individual sentence judgment errors over a test corpus, rather than attempting to devise the exact human judgment for every sentence.
A more extensive discussion of BLEU scores is here.
BLEU results depend strongly on the breadth of your domain; consistency of test, training and tuning data; and how much data you have available for training. If your models have been trained on a narrow domain, and your training data is consistent with your test data, you can expect a high BLEU score.
A comparison between BLEU scores is only justifiable when BLEU results are compared with the same Test set, the same language pair, and the same MT engine. A BLEU score from a different test set is bound to be different.