Calculate similarity with the `ai.similarity` function

2025-03-05

The ai.similarity function uses Generative AI to compare two string expressions and then calculate a semantic similarity score—all with a single line of code. You can compare text values from one column of a DataFrame with a single common text value or with pairwise text values in another column.

AI functions turbocharge data engineering by putting the power of Fabric's built-in large languages models into your hands. To learn more, visit this overview article.

Important

This feature is in preview, for use in the Fabric 1.3 runtime and higher.

Review the prerequisites in this overview article, including the library installations that are temporarily required to use AI functions.
By default, AI functions are currently powered by the gpt-3.5-turbo (0125) model. To learn more about billing and consumption rates, visit this article.
Although the underlying model can handle several languages, most of the AI functions are optimized for use on English-language texts.
During the initial rollout of AI functions, users are temporarily limited to 1,000 requests per minute with Fabric's built-in AI endpoint.

Use `ai.similarity` with pandas

The ai.similarity function extends the pandas Series class. Call the function on a pandas DataFrame text column to calculate the semantic similarity of each input row with respect to a single common text value. Alternatively, the function can calculate the semantic similarity of each row with respect to corresponding pairwise values in another column that has the same dimensions as the input column.

The function returns a pandas Series containing similarity scores, which can be stored in a new DataFrame column.

df["similarity"] = df["col1"].ai.similarity("value")

df["similarity"] = df["col1"].ai.similarity(df["col2"])

Parameters

Name	Description
`other` Required	Either a string that contains a single common text value, which is used to compute similarity scores for each input row, OR another pandas Series with the same dimensions as the input, which contains text values that are used to compute pairwise similarity scores for each input row.

Returns

A pandas Series that contains similarity scores for each input text row. The output similarity scores are relative, and they're best used for ranking. Scores can range from -1 (opposites) to 1 (identical). A score of 0 indicates that the values are unrelated in meaning.

Example

Comparing with a single value
Comparing with pairwise values

# This code uses AI. Always review output for mistakes. 
# Read terms: https://azure.microsoft.com/support/legal/preview-supplemental-terms/

df = pd.DataFrame([ 
        ("Bill Gates"), 
        ("Satya Nadella"), 
        ("Joan of Arc")
    ], columns=["name"])
    
df["similarity"] = df["name"].ai.similarity("Microsoft")
display(df)

# This code uses AI. Always review output for mistakes. 
# Read terms: https://azure.microsoft.com/support/legal/preview-supplemental-terms/

df = pd.DataFrame([ 
        ("Bill Gates", "Microsoft"), 
        ("Satya Nadella", "Toyota"), 
        ("Joan of Arc", "Nike") 
    ], columns=["names", "companies"])
    
df["similarity"] = df["names"].ai.similarity(df["companies"])
display(df)

Use `ai.similarity` with PySpark

The ai.similarity function is also available for Spark DataFrames. You must specify the name of an existing input column as a parameter. You must also specify a single common text value for comparisons, or the name of another column for pairwise comparisons.

The function returns a new DataFrame, with similarity scores for each row of input text stored in an output column.

Syntax

Comparing with a single value
Comparing with pairwise values

df.ai.similarity(input_col="col1", other="value", output_col="similarity")

df.ai.similarity(input_col="col1", other_col="col2", output_col="similarity")

Parameters

Name	Description
`input_col` Required	A string that contains the name of an existing column with input text values to be used for computing similarity scores.
`other` or `other_col` Required	Only one of these parameters is required. The `other` parameter is a string that contains a single common text value used to compute similarity scores with respect to each row of input. The `other_col` parameter is a string that designates the name of a second existing column, with text values used to compute pairwise similarity scores.
`output_col` Optional	A string that contains the name of a new column to store calculated similarity scores for each input text row. If this parameter isn't set, a default name is generated for the output column.
`error_col` Optional	A string that contains the name of a new column that stores any OpenAI errors that result from processing each input text row. If this parameter isn't set, a default name is generated for the error column. If an input row has no errors, this column has a `null` value.

Returns

A Spark DataFrame with a new column that contains generated similarity scores for each input text row. The output similarity scores are relative, and they're best used for ranking. Scores can range from -1 (opposites) to 1 (identical). A score of 0 indicates that the values are unrelated in meaning.

Example

Comparing with a single value
Comparing with pairwise values

# This code uses AI. Always review output for mistakes. 
# Read terms: https://azure.microsoft.com/support/legal/preview-supplemental-terms/

df = spark.createDataFrame([
        ("Bill Gates",), 
        ("Sayta Nadella",), 
        ("Joan of Arc",) 
    ], ["names"])

similarity = df.ai.similarity(input_col="names", other="Microsoft", output_col="similarity")
display(similarity)

# This code uses AI. Always review output for mistakes. 
# Read terms: https://azure.microsoft.com/support/legal/preview-supplemental-terms/

df = spark.createDataFrame([
        ("Bill Gates", "Microsoft"), 
        ("Satya Nadella", "Toyota"), 
        ("Joan of Arc", "Nike")
    ], ["names", "companies"])

similarity = df.ai.similarity(input_col="names", other_col="companies", output_col="similarity")
display(similarity)

Categorize text with ai.classify.
Detect sentiment with ai.analyze_sentiment.
Extract entities with ai_extract.
Fix grammar with ai.fix_grammar.
Summarize text with ai.summarize.
Translate text with ai.translate.
Answer custom user prompts with ai.generate_response.
Learn more about the full set of AI functions here.
Learn how to customize the configuration of AI functions here.
Did we miss a feature you need? Suggest it on the Fabric Ideas forum.

Share via

Calculate similarity with the ai.similarity function

Use ai.similarity with pandas

Syntax

Parameters

Returns

Example

Use ai.similarity with PySpark

Syntax

Parameters

Returns

Example

Related content

Feedback

Additional resources

Calculate similarity with the `ai.similarity` function

Use `ai.similarity` with pandas

Use `ai.similarity` with PySpark