Synapse - How to store images in SQL DW

Dheeraj 351 Reputation points
2021-11-22T15:16:43.89+00:00

I have some *.png Image files in Azure data lake that I want to move to SQL DW as base64 string OR binary data.
I am able to read the images into data frame but Synapse spark gives while writing it to SQL.
Here's the code:

image_df = spark.read.format("image").load(imagePath)  

image_df.createOrReplaceTempView("Mta_ImageTable")  
image_df.printSchema()  

above line gives below schema:
151498-png-synapse.png

Now when I try to insert this dataframe to SQL as below it gives error:

%%spark  
import org.apache.spark.sql.DataFrame  
import com.microsoft.spark.sqlanalytics.utils.Constants._  
import org.apache.spark.sql.SqlAnalyticsConnector._  

var tb_mtaImage = spark.sql("select data from Mta_ImageTable")  
tb_mtaImage.write.  
    option(Constants.SERVER, "wrk-pr-us-datawarehouse01.sql.azuresynapse.net").  
    synapsesql("Natus_Pool1.dbo.SampleImages_Temp", Constants.INTERNAL)  

The error is:
java.sql.SQLException: com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column 'data' has a data type that cannot participate in a columnstore index.

OR

Can't get JDBC type for struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>
at com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsJDBCWrapper

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,375 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Bradley Schacht 76 Reputation points Microsoft Employee
    2021-11-23T05:40:28.71+00:00

    Hi @Dheeraj ,

    I'm not sure about the second error at the moment, but as far as the first error I believe the table created is a round robin table with a clustered columnstore index. The issue is there being that the varbinary data type is not supported in a clustered columnstore index. You would need the table to be a HEAP, but unfortunately as far as I am aware you do not get the option with the built-in connector to set any options on the final table.

    A workaround would be to create the table ahead of time (manually, through some code inside your notebook, or a call to a T-SQL script that would create it possibly) ensuring it is a HEAP then load the data using a different method like JDBC. Unfortunately, that will be a slower data transfer. If the volume of data is causing the process to be too slow you may consider alternative approaches that would require some additional work like performing your data processing in the notebook, write to data lake, then use PolyBase/COPY INTO to load the table. If using PolyBase, the table would not need to exist ahead of time (similar to your current Scala code). If using the COPY statement, the table would need to exist ahead of time.

    I hope this helps get you going in the right direction.

    0 comments No comments

  2. MartinJaffer-MSFT 26,236 Reputation points
    2021-11-23T18:35:47.277+00:00

    I agree with @Bradley Schacht 's assessment that the problem is caused by using a variable-length data type in the index.

    However I have a simpler suggestion @Dheeraj . Change the SQL Table's schema to use something else as the index. Use the Height or Width or add an ID column. Do not use the Origin or Data columns in the index.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.