Collation/encoding error when reading from a Dedicated SQL pool view to a dataframe in a Synapse (spark pool) notebook

Aldrin Malin (YIASC) ext 40 Reputation points
2024-03-26T14:17:03.9733333+00:00

Hi! I'm running a Synapse notebook connected to an Apache Spark pool and uses this code in order to read from a SQL view from a Dedicated SQL pool:

# Read from existing internal table
dfToReadFromTable2 = (spark.read
                     # If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument
                     # to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
                     .option(Constants.SERVER, "xxx.sql.azuresynapse.net")
                     # Defaults to storage path defined in the runtime configurations
                     .option(Constants.TEMP_FOLDER, "abfss://******@xxx.dfs.core.windows.net/xxx")
                     # Three-part table name from where data will be read.
                     .synapsesql("database.schema.view")
                     # Fetch a sample of 10 records
                     .limit(10))

# Show contents of the dataframe
dfToReadFromTable2.show()

I am able to use this for other tables/views in the database but for a specific view I receive this error message that I don't understand. It appears when I want to show the dataframe.

Py4JJavaError: An error occurred while calling o3667.showString. : com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed. Column ordinal: 27, Expected data type: VARCHAR(7) collate SQL_Latin1_General_CP1_CI_AS NOT NULL.

Is it a collation error between the SQL Pool and the dataframe in the Spark pool or something else? I have checked the column it refers to to make sure it does not contain null values, length > 7 etc. It contains a column including a space sign if that can affect to collation.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,381 questions
{count} votes

Accepted answer
  1. Smaran Thoomu 24,750 Reputation points Microsoft External Staff Moderator
    2024-03-27T08:47:30.2733333+00:00

    Hi @Aldrin Malin (YIASC) ext

    Thank you for reaching out to the community forum with your query.

    Based on the error message you received, it seems like there is a collation/encoding error between the SQL Pool and the dataframe in the Spark pool. The error message indicates that the column ordinal 27 is expected to be of type VARCHAR(7) collate SQL_Latin1_General_CP1_CI_AS NOT NULL, but the data in the column does not match this expected data type.

    To resolve this issue, you can try changing the collation of the column in the SQL view to match the collation of the Spark pool. You can also try casting the column to the correct data type in the Spark code.

    Here's an example of how you can cast the column to the correct data type in the Spark code:

    from pyspark.sql.functions import col
    
    # Read from existing internal table
    dfToReadFromTable2 = (spark.read
                         # If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument
                         # to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
                         .option(Constants.SERVER, "xxx.sql.azuresynapse.net")
                         # Defaults to storage path defined in the runtime configurations
                         .option(Constants.TEMP_FOLDER, "abfss://******@xxx.dfs.core.windows.net/xxx")
                         # Three-part table name from where data will be read.
                         .synapsesql("database.schema.view")
                         # Cast column ordinal 27 to the correct data type
                         .withColumn("column_name", col("column_name").cast("string"))
                         # Fetch a sample of 10 records
                         .limit(10))
    
    # Show contents of the dataframe
    dfToReadFromTable2.show()
    

    If the issue persists, you can also try checking the encoding of the data in the column and ensure that it matches the encoding of the Spark pool.

    I hope this helps! Let me know if you have any further questions.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.