Synapse - How to store images in SQL DW

Question

Synapse - How to store images in SQL DW

Dheeraj 351

I have some *.png Image files in Azure data lake that I want to move to SQL DW as base64 string OR binary data.
I am able to read the images into data frame but Synapse spark gives while writing it to SQL.
Here's the code:

image_df = spark.read.format("image").load(imagePath)  

image_df.createOrReplaceTempView("Mta_ImageTable")  
image_df.printSchema()

above line gives below schema:

Now when I try to insert this dataframe to SQL as below it gives error:

%%spark  
import org.apache.spark.sql.DataFrame  
import com.microsoft.spark.sqlanalytics.utils.Constants._  
import org.apache.spark.sql.SqlAnalyticsConnector._  

var tb_mtaImage = spark.sql("select data from Mta_ImageTable")  
tb_mtaImage.write.  
    option(Constants.SERVER, "wrk-pr-us-datawarehouse01.sql.azuresynapse.net").  
    synapsesql("Natus_Pool1.dbo.SampleImages_Temp", Constants.INTERNAL)

The error is:
java.sql.SQLException: com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column 'data' has a data type that cannot participate in a columnstore index.

OR

Can't get JDBC type for struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>
at com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsJDBCWrapper

2 answers

Your answer

Answer 1

Hi @Dheeraj ,

I'm not sure about the second error at the moment, but as far as the first error I believe the table created is a round robin table with a clustered columnstore index. The issue is there being that the varbinary data type is not supported in a clustered columnstore index. You would need the table to be a HEAP, but unfortunately as far as I am aware you do not get the option with the built-in connector to set any options on the final table.

A workaround would be to create the table ahead of time (manually, through some code inside your notebook, or a call to a T-SQL script that would create it possibly) ensuring it is a HEAP then load the data using a different method like JDBC. Unfortunately, that will be a slower data transfer. If the volume of data is causing the process to be too slow you may consider alternative approaches that would require some additional work like performing your data processing in the notebook, write to data lake, then use PolyBase/COPY INTO to load the table. If using PolyBase, the table would not need to exist ahead of time (similar to your current Scala code). If using the COPY statement, the table would need to exist ahead of time.

I hope this helps get you going in the right direction.

Answer 2

MartinJaffer-MSFT 26,236

I agree with @Bradley Schacht 's assessment that the problem is caused by using a variable-length data type in the index.

However I have a simpler suggestion @Dheeraj . Change the SQL Table's schema to use something else as the index. Use the Height or Width or add an ID column. Do not use the Origin or Data columns in the index.

Bradley Schacht 76 Reputation points Microsoft Employee

2021-11-23T19:01:43.183+00:00

The challenge there is that the table is created at execution time with no option for specifying any table properties and you cannot write to an existing table using this method. Therefore, the table will always be created as a CCI which cannot include the VARBINARY data type and leads to the error shown in the original post. The CCI being created is the entire table, so you cannot exclude the "data" column or use something else as the index.

Share via

Synapse - How to store images in SQL DW

2 answers

Your answer