How to use language detection

The Language Detection feature can evaluate text, and return a language identifier that indicates the language a document was written in.

Language detection is useful for content stores that collect arbitrary text, where language is unknown. You can parse the results of this analysis to determine which language is used in the input document. The response also returns a score between 0 and 1 that reflects the confidence of the model.

The Language Detection feature can detect a wide range of languages, variants, dialects, and some regional or cultural languages.

Development options

To use language detection, you submit raw unstructured text for analysis and handle the API output in your application. Analysis is performed as-is, with no additional customization to the model used on your data. There are two ways to use language detection:

Development option Description
Language studio Language Studio is a web-based platform that lets you try entity linking with text examples without an Azure account, and your own data when you sign up. For more information, see the Language Studio website or language studio quickstart.
REST API or Client library (Azure SDK) Integrate language detection into your applications using the REST API, or the client library available in a variety of languages. For more information, see the language detection quickstart.
Docker container Use the available Docker container to deploy this feature on-premises. These docker containers enable you to bring the service closer to your data for compliance, security, or other operational reasons.

Determine how to process the data (optional)

Specify the language detection model

By default, language detection will use the latest available AI model on your text. You can also configure your API requests to use a specific model version.

Input languages

When you submit documents to be evaluated, language detection will attempt to determine if the text was written in any of the supported languages.

If you have content expressed in a less frequently used language, you can try the Language Detection feature to see if it returns a code. The response for languages that can't be detected is unknown.

Submitting data

Tip

You can use a Docker containerfor language detection, so you can use the API on-premises.

Analysis is performed upon receipt of the request. Using the language detection feature synchronously is stateless. No data is stored in your account, and results are returned immediately in the response.

When using this feature asynchronously, the API results are available for 24 hours from the time the request was ingested, and is indicated in the response. After this time period, the results are purged and are no longer available for retrieval.

Getting language detection results

When you get results from language detection, you can stream the results to an application or save the output to a file on the local system.

Language detection will return one predominant language for each document you submit, along with it's ISO 639-1 name, a human-readable name, and a confidence score. A positive score of 1 indicates the highest possible confidence level of the analysis.

Ambiguous content

In some cases it may be hard to disambiguate languages based on the input. You can use the countryHint parameter to specify an ISO 3166-1 alpha-2 country/region code. By default the API uses "US" as the default country hint. To remove this behavior, you can reset this parameter by setting this value to empty string countryHint = "" .

For example, "communication" is common to both English and French and if given with limited context the response will be based on the "US" country/region hint. If the origin of the text is known to be coming from France that can be given as a hint.

Input

{
    "documents": [
        {
            "id": "1",
            "text": "communication"
        },
        {
            "id": "2",
            "text": "communication",
            "countryHint": "fr"
        }
    ]
}

The language detection model now has additional context to make a better judgment:

Output

{
    "documents":[
        {
            "detectedLanguage":{
                "confidenceScore":0.62,
                "iso6391Name":"en",
                "name":"English"
            },
            "id":"1",
            "warnings":[
                
            ]
        },
        {
            "detectedLanguage":{
                "confidenceScore":1.0,
                "iso6391Name":"fr",
                "name":"French"
            },
            "id":"2",
            "warnings":[
                
            ]
        }
    ],
    "errors":[
        
    ],
    "modelVersion":"2022-10-01"
}

If the analyzer can't parse the input, it returns (Unknown). An example is if you submit a text string that consists solely of numbers.

{
    "documents": [
        {
            "id": "1",
            "detectedLanguage": {
                "name": "(Unknown)",
                "iso6391Name": "(Unknown)",
                "confidenceScore": 0.0
            },
            "warnings": []
        }
    ],
    "errors": [],
    "modelVersion": "2021-01-05"
}

Mixed-language content

Mixed-language content within the same document returns the language with the largest representation in the content, but with a lower positive rating. The rating reflects the marginal strength of the assessment. In the following example, input is a blend of English, Spanish, and French. The analyzer counts characters in each segment to determine the predominant language.

Input

{
    "documents": [
        {
            "id": "1",
            "text": "Hello, I would like to take a class at your University. ¿Se ofrecen clases en español? Es mi primera lengua y más fácil para escribir. Que diriez-vous des cours en français?"
        }
    ]
}

Output

The resulting output consists of the predominant language, with a score of less than 1.0, which indicates a weaker level of confidence.

{
    "documents": [
        {
            "id": "1",
            "detectedLanguage": {
                "name": "Spanish",
                "iso6391Name": "es",
                "confidenceScore": 0.88
            },
            "warnings": []
        }
    ],
    "errors": [],
    "modelVersion": "2021-01-05"
}

Service and data limits

For information on the size and number of requests you can send per minute and second, see the service limits article.

See also