Collation/encoding error when reading from a Dedicated SQL pool view to a dataframe in a Synapse (spark pool) notebook

Question

Collation/encoding error when reading from a Dedicated SQL pool view to a dataframe in a Synapse (spark pool) notebook

Aldrin Malin (YIASC) ext 40

Hi! I'm running a Synapse notebook connected to an Apache Spark pool and uses this code in order to read from a SQL view from a Dedicated SQL pool:

# Read from existing internal table
dfToReadFromTable2 = (spark.read
                     # If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument
                     # to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
                     .option(Constants.SERVER, "xxx.sql.azuresynapse.net")
                     # Defaults to storage path defined in the runtime configurations
                     .option(Constants.TEMP_FOLDER, "abfss://******@xxx.dfs.core.windows.net/xxx")
                     # Three-part table name from where data will be read.
                     .synapsesql("database.schema.view")
                     # Fetch a sample of 10 records
                     .limit(10))

# Show contents of the dataframe
dfToReadFromTable2.show()

I am able to use this for other tables/views in the database but for a specific view I receive this error message that I don't understand. It appears when I want to show the dataframe.

Py4JJavaError: An error occurred while calling o3667.showString. : com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed. Column ordinal: 27, Expected data type: VARCHAR(7) collate SQL_Latin1_General_CP1_CI_AS NOT NULL.

Is it a collation error between the SQL Pool and the dataframe in the Spark pool or something else? I have checked the column it refers to to make sure it does not contain null values, length > 7 etc. It contains a column including a space sign if that can affect to collation.

Smaran Thoomu 24,750 Reputation points Microsoft External Staff Moderator

2024-03-28T10:30:02.0833333+00:00

@Aldrin Malin (YIASC) ext We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Aldrin Malin (YIASC) ext 40 Reputation points

2024-04-02T12:46:39.1933333+00:00

Hi and thank you for reaching out! I first tried to add it to the Spark code but it complained so I casted the specific column to the right collation in the view like you also suggested and then it worked fine, problem solved. Thank you for helping me!

Accepted answer

0 additional answers

Your answer

Smaran Thoomu 24,750 Reputation points Microsoft External Staff Moderator

2024-03-28T10:30:02.0833333+00:00

@Aldrin Malin (YIASC) ext We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Aldrin Malin (YIASC) ext 40 Reputation points

2024-04-02T12:46:39.1933333+00:00

Hi and thank you for reaching out! I first tried to add it to the Spark code but it complained so I casted the specific column to the right collation in the view like you also suggested and then it worked fine, problem solved. Thank you for helping me!

Answer 1

Hi @Aldrin Malin (YIASC) ext

Thank you for reaching out to the community forum with your query.

Based on the error message you received, it seems like there is a collation/encoding error between the SQL Pool and the dataframe in the Spark pool. The error message indicates that the column ordinal 27 is expected to be of type VARCHAR(7) collate SQL_Latin1_General_CP1_CI_AS NOT NULL, but the data in the column does not match this expected data type.

To resolve this issue, you can try changing the collation of the column in the SQL view to match the collation of the Spark pool. You can also try casting the column to the correct data type in the Spark code.

Here's an example of how you can cast the column to the correct data type in the Spark code:

from pyspark.sql.functions import col

# Read from existing internal table
dfToReadFromTable2 = (spark.read
                     # If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument
                     # to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
                     .option(Constants.SERVER, "xxx.sql.azuresynapse.net")
                     # Defaults to storage path defined in the runtime configurations
                     .option(Constants.TEMP_FOLDER, "abfss://******@xxx.dfs.core.windows.net/xxx")
                     # Three-part table name from where data will be read.
                     .synapsesql("database.schema.view")
                     # Cast column ordinal 27 to the correct data type
                     .withColumn("column_name", col("column_name").cast("string"))
                     # Fetch a sample of 10 records
                     .limit(10))

# Show contents of the dataframe
dfToReadFromTable2.show()

If the issue persists, you can also try checking the encoding of the data in the column and ensure that it matches the encoding of the Spark pool.

I hope this helps! Let me know if you have any further questions.

Share via

Collation/encoding error when reading from a Dedicated SQL pool view to a dataframe in a Synapse (spark pool) notebook

0 additional answers

Your answer