Unable to write the data into the DWH (dedicated sql pool) from databricks when we are loading the data into the table when the column is greater than 8000

MadhuVamsi-2459 32

Hi team ,

we are writing the data into the DWH from databricks after reading into the dataframe.

We are creating an empty table with the datatypes and columns matching to the dataframe but when the column length is greater than 8000 then we are not able to load that particular column , when we drop that column from the dataframe and load then it is working fine , Please let me know in case if there is any other way and in case if I am doing any thing mismatch.

@PRADEEPCHEEKATLA-MSFT

Thanks and Regards
Bannuru Sri Madhu Vamsi

PRADEEPCHEEKATLA-MSFT 76,921 Reputation points Microsoft Employee

2022-09-23T08:24:39.833+00:00

Hello anonymous userVamsi-2459,

Thanks for the question and using MS Q&A platform.

Could you please help with the error message you are experiencing?

Meanwhile, I would suggest you to check the similar thread: https://learn.microsoft.com/en-us/answers/questions/126343/adf-copy-activity-with-polybase-column-size-limit.html
MadhuVamsi-2459 32 Reputation points

2022-09-23T08:30:54.683+00:00

Hi @PRADEEPCHEEKATLA-MSFT , Thanks for replying over the question as we are writing the data from using databricks , we are getting the truncation error when the column size of greater than 8000 is included in the dataframe while loading the data , if we drop that column , we are able to load the data into DWH (dedicatedSQL Pool) with out any error , we could see from the documentation that it will not support the columns greater than 8000 , is there any other method which we can load the data ,Thanks
PRADEEPCHEEKATLA-MSFT 76,921 Reputation points Microsoft Employee

2022-09-26T05:58:55.02+00:00

Hello anonymous userVamsi-2459,

Thanks for the details, We are reaching out to the internal team to get insights on the above issue, I will update this thread once I have update from internal team.
PRADEEPCHEEKATLA-MSFT 76,921 Reputation points Microsoft Employee

2022-09-27T05:13:51.347+00:00

Hello anonymous userVamsi-2459,

What is the data type of the column in your destination table? Is it varchar(8000)? (8,000 is the largest size you specify for a varchar column.) You could try to re-create the column as varchar(max), which will allow the column to hold up to 2GB of data. Otherwise, you will have to truncate the column in your dataframe down to 8,000 so it will fit in the Synapse table.
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-service-capacity-limits#database-objects
MadhuVamsi-2459 32 Reputation points

2022-09-27T07:38:38.423+00:00

Hi @PRADEEPCHEEKATLA-MSFT ,
Thanks , but when I am creating the table with varchar(max) , I am facing the error , pasting the error below ,please help me if we can see this any other way , I can create it using heap() , but I still face error while loading the data into the DWH from DB ,one small doubt , here when we see it supports 2 GB , is it for the entire column it should be 2GB or for every row it will support 2GB data .The column which we are loading is of around 80 GB totally so please let me know if anything can be modified. Thanks
MadhuVamsi-2459 32 Reputation points

2022-09-27T07:39:35.137+00:00

@PRADEEPCHEEKATLA-MSFT the code which we are using for the reference and the error which we are facing , for your reference

The code which we are using to write the data into DWH from databricks and the error which we are facing is -

df.write \
.format("com.databricks.spark.sqldw") \
.mode("overwrite") \
.option("url",jdbcurl) \
.option("tempDir",blobDirPath) \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "tablename").save()

Py4JJavaError: An error occurred while calling o816.save. : com.databricks.spark.sqldw.SqlDWSideException: Azure Synapse Analytics failed to execute the JDBC query produced by the connector. Underlying SQLException(s): - com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopSqlException: String or binary data would be truncated. [ErrorCode = 107090] [SQLState = S0001]

Thanks
PRADEEPCHEEKATLA-MSFT 76,921 Reputation points Microsoft Employee

2022-09-28T03:41:27.143+00:00
Hello anonymous userVamsi-2459,

Try setting the below parameter, it defaults to 256.

To find the <maxvalue> load the data into a dataframe and find the max length of the string columns. That will give
you some idea as to which column is causing the issue.

.option("maxStrLength","<maxvalue>")
MadhuVamsi-2459 32 Reputation points

2022-09-28T05:20:18.37+00:00

Hello @PRADEEPCHEEKATLA-MSFT

yeah I have tried to create an external table in databrikcks and tried to run the below query for sample to find the length -

select max(length((column name)
from tablename

.option("maxStrLength","<maxvalue>") (And I have also added this value as 8000 as this is the max limit)
but the column which I was inserting was found to be 32000 , so this command is also not working )

Please let me know in case if I am doing something wrong .

Thanks

1 answer

PRADEEPCHEEKATLA-MSFT 76,921 Reputation points Microsoft Employee

2022-10-10T04:06:34.57+00:00

Hello @MadhuVamsi-2459 ,

Apologies for the delay in response.

This is a problem that that goes back 15 or more years. Reviewing your table design may be more effective.
When designing your table, you want to aim for rows less than 8,060 bytes to fit on a page.
if you exceed this IN_ROW_DATA it needs to use the ROW_OVERFLOW_DATA Allocation unit.

You need to use a type of varchar(max), nvarchar(max), varbinary(max).

The 2GB limit is per row. According to the docs, the total size of a columnstore table is unlimited... so a table with 80GB should be fine.

I would suggest you try a different approach. Trying to shove that volume of data through an ancient JDBC connection will be troublesome. The recommended pattern for moving data from Databricks to Azure Synapse is to use the Azure Synapse Dedicated SQL Pool Connector for Apache Spark.

This will then create a separate page for the larger column but keep a 24-byte pointer in the original page.

I would start with looking at the physical data length of the data in the column and set the type appropriately if you can.

Hope this helps.
Please sign in to rate this answer.
PRADEEPCHEEKATLA-MSFT 76,921 Reputation points Microsoft Employee

2022-10-12T06:34:50.177+00:00

Hello @MadhuVamsi-2459 ,

Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.

MadhuVamsi-2459 32 Reputation points

2022-10-17T07:35:55.44+00:00

Hi Pradeep Thanks for your suggestion and I have tried creating an empty DDL structures at the primary and then doing append operation instead of overwrite , it is working fine with out any issue , Thanks for your help .

PRADEEPCHEEKATLA-MSFT 76,921 Reputation points Microsoft Employee

2022-10-17T08:08:56.71+00:00

Hello @MadhuVamsi-2459 ,

Glad to know your issue has resolved.
Sign in to comment