Detect Languages
Important
Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.
Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.
- See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
- Learn more about Azure Machine Learning.
ML Studio (classic) documentation is being retired and may not be updated in the future.
Detects the language of each line in the input file
Category: Text Analytics
Note
Applies to: Machine Learning Studio (classic) only
Similar drag-and-drop modules are available in Azure Machine Learning designer.
Module Overview
This article describes how to use the Detect Languages module in Machine Learning Studio (classic) to analyze text input and identify the language associated with each record in the input.
The language detection algorithm can identify many different languages. Just specify the string column to analyze, and the total number of languages to detect. The algorithm will analyze each row of text, and assign a probability score for each language. The language in the first result column is the language that got the highest score.
How to configure Detect Languages
Add the dataset containing the text you want to analyze to an experiment in Machine Learning Studio (classic). The column with the text to analyze must be the string data type.
The datset need not contain a label column; the language detection algorithm works purely on linguistic features of the supported languages.
If you are importing new data, make sure that your data is saved in the UTF-8 format. Other Unicode formats are not supported.
Add the Detect Languages module to your experiment, and connect the dataset with the text for language detection.
For Text column, choose the column you want to analyze.
For Upper bound on number of languages to detect, indicate the maximum number of languages to detect.
Setting an upper bound on the number of languages can improve performance.
Run the experiment.
Results
The Detect Languages module outputs a language identifier and score for each row.
For example, the following table contains a sample analysis on test data.
The first two columns col1 and language label are columns passed through from the input dataset. In this example, because the input dataset was designed for testing the module, the expected language was already known, and is provided in the label column.
The remaining columns are generated by the Detect Languages module. If there are equi-probable language matches, several languages might be listed, with a score for each. In this case, the module predicts just one language for each row, together with the probability score for that language.
If the module fails to detect any language with a sufficiently high score, a result of (Unknown) with a score of 0 is output. However, the languages supported by the module can change over time as the API is updated.
Col1 | Language label | Col1 Language | Col1 Iso6391 Language | Col1 Iso6391 Language Score |
---|---|---|---|---|
It was a wonderful hotel with a friendly staff and good service | English | English | en | 100 |
Es war ein wunderbares Hotel mit freundlichem Personal und guter service | German | German | de | 100 |
C’est un magnifique hôtel avec un personnel sympathique et un service de qualité | French | French | fr | 100 |
Det var et dejligt hotel med et venligt personale og god service | Danish | Danish | nl | 100 |
Va ser un magnífic hotel amb un personal amable i bon servei | Catalan | Catalan | ca | 92.30769348 |
とても素敵なホテルで、スタッフは親切で、サービスもよかった | Japanese | (Unknown) | 0 | |
qu mebpa'mey naQ friendly QaQ chavmoH je | Klingon | French | fr | 77.5 |
Examples
For examples of how the Detect Languages module is used in an experiment, see the Azure AI Gallery:
- Filter Movie Titles by Language: Detects the language used in movie names, and then uses the language identifier to split the dataset into English vs non-English movies.
Technical notes
For a general idea of the languages that potentially can be detected, refer to Bing Translator.
Many more languages can be detected than Machine Learning currently supports for advanced text analytics. We recommend that you use the results of Detect Languages to filter the results that you send to other modules that require language-specific processing.
The underlying linguistic services are also used by the Text Analytics service in Azure Cognitive Services.
Expected inputs
Name | Type | Description |
---|---|---|
Dataset | Data Table | The input |
Module parameters
Name | Type | Range | Optional | Default | Description |
---|---|---|---|---|---|
Upper bound on number of languages to detect | Integer | [1;184] | Required | 1 | Upper bound on number of languages to detect. |
Text column | ColumnSelection | Required | Name or one-based index of text column. |
Outputs
Name | Type | Description |
---|---|---|
Results dataset | Data Table | The result |
Exceptions
Exception | Description |
---|---|
Error 0003 | Exception occurs if one or more of inputs are null or empty. |
Error 0010 | Exception occurs if input datasets have column names that should match but do not. |
Error 0016 | Exception occurs if input datasets passed to the module should have compatible column types but do not. |
Error 0008 | Exception occurs if parameter is not in range. |
For a list of errors specific to Studio (classic) modules, see Machine Learning Error codes.
For a list of API exceptions, see Machine Learning REST API Error Codes.