Azure Synapse pyspark translates STRING datatype into varchar(8000) for external table

Question

Azure Synapse pyspark translates STRING datatype into varchar(8000) for external table

Grützmacher, Sven 5

Hi there,

we try to load some external tables in Azure Synpase using a PySpark notebook but the datatypes seem to mismatch. When we initialize the table we execute (stripped down example):

CREATE OR REPLACE TABLE LANDING_DB.Opportunity (
 AccountId varchar(255),
 Id varchar(255),
 Name STRING,
 Description STRING,
 dwh_deleted BOOLEAN)
USING DELTA LOCATION 'abfss://******@XXX.dfs.core.windows.net/Landing/ABC/Opportunity/'

via spark.sql().

Yet the auto-generated table uses 'varchar(8000)' as datatype for the Name and Description column. We would like to use varchar(max) as especially the Description column can have a lot of text.

Are we missing some settings or is this a bug?

Help would be appreciated,

best, Sven

PRADEEPCHEEKATLA 90,646 Reputation points Moderator

2023-06-21T05:08:21.88+00:00
@Grützmacher, Sven - Thanks for the question and using MS Q&A platform.

It seems that you are facing a datatype mismatch issue while loading external tables in Azure Synapse using a PySpark notebook.

When you create an external table in Azure Synapse using PySpark, the STRING datatype is translated into varchar(8000) by default. This is because the maximum length of a VARCHAR column in SQL Server is 8000 characters.

To use the VARCHAR(MAX) datatype instead of varchar(8000), you can explicitly define the schema in a WITH clause with the larger VARCHAR(MAX) column type.

Here's an example:

CREATE OR REPLACE TABLE LANDING_DB.Opportunity ( AccountId varchar(255), Id varchar(255), Name varchar(max), Description varchar(max), dwh_deleted BOOLEAN ) USING DELTA LOCATION 'abfss://******@XXX.dfs.core.windows.net/Landing/ABC/Opportunity/' OPTIONS ( 'schema', ' AccountId string, Id string, Name varchar(max), Description varchar(max), dwh_deleted boolean ' )

In this example, we explicitly define the schema in the OPTIONS clause with the larger VARCHAR(MAX) column type for the Name and Description columns.

For more details, refer to https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/resources-self-help-sql-on-demand?tabs=x80070002#string-or-binary-data-would-be-truncated

Hope this helps. Do let us know if you any further queries.
Ben Eady 6 Reputation points

2024-08-06T12:48:19.36+00:00

Could you provide more details on this approach? We have not been successful in using this in our environment to try and convert Delta String values to Varchar Max. Can this be executed from the serverless sql endpoint to interact with tables in the lake database or does it apply to some other scenario?

The link provided describes "If you are getting this error, explicitly define the schema in a WITH clause with the larger VARCHAR(MAX) column type to resolve this error" but does not give an example of this syntax.
Sumesh 0 Reputation points

2024-12-03T20:30:33.4333333+00:00

Do we have the example code for using With clause

Your answer

PRADEEPCHEEKATLA 90,646 Reputation points Moderator

2023-06-21T05:08:21.88+00:00

@Grützmacher, Sven - Thanks for the question and using MS Q&A platform.

It seems that you are facing a datatype mismatch issue while loading external tables in Azure Synapse using a PySpark notebook.

When you create an external table in Azure Synapse using PySpark, the STRING datatype is translated into varchar(8000) by default. This is because the maximum length of a VARCHAR column in SQL Server is 8000 characters.

To use the VARCHAR(MAX) datatype instead of varchar(8000), you can explicitly define the schema in a WITH clause with the larger VARCHAR(MAX) column type.

Here's an example:

CREATE OR REPLACE TABLE LANDING_DB.Opportunity ( AccountId varchar(255), Id varchar(255), Name varchar(max), Description varchar(max), dwh_deleted BOOLEAN ) USING DELTA LOCATION 'abfss://******@XXX.dfs.core.windows.net/Landing/ABC/Opportunity/' OPTIONS ( 'schema', ' AccountId string, Id string, Name varchar(max), Description varchar(max), dwh_deleted boolean ' )

In this example, we explicitly define the schema in the OPTIONS clause with the larger VARCHAR(MAX) column type for the Name and Description columns.

For more details, refer to https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/resources-self-help-sql-on-demand?tabs=x80070002#string-or-binary-data-would-be-truncated

Hope this helps. Do let us know if you any further queries.
Ben Eady 6 Reputation points

2024-08-06T12:48:19.36+00:00

Could you provide more details on this approach? We have not been successful in using this in our environment to try and convert Delta String values to Varchar Max. Can this be executed from the serverless sql endpoint to interact with tables in the lake database or does it apply to some other scenario?

The link provided describes "If you are getting this error, explicitly define the schema in a WITH clause with the larger VARCHAR(MAX) column type to resolve this error" but does not give an example of this syntax.
Sumesh 0 Reputation points

2024-12-03T20:30:33.4333333+00:00

Do we have the example code for using With clause

Share via

Azure Synapse pyspark translates STRING datatype into varchar(8000) for external table

Your answer