Hi all,
I did save a Dataframe as Delta-format with pyspark and created a managed table in a Lake Database, to access it aswell with SQL Script and the Serverless-Endpoint
raw_stream\
.writeStream\
.format("delta")\
.outputMode("append") \
.option("checkpointLocation", bronze_path_chk_path)\
.start(bronze_table_path)
spark.sql(f'CREATE TABLE IF NOT EXISTS bronze_layer.{entity} USING DELTA LOCATION \'{bronze_table_path}/\'')
As some columns, which are nested JSON, have a lenght of more then 10000, I receive an error when querying over the SQL Serverless
Error: Column 'JSON_column' of type 'NVARCHAR(4000)' is not compatible with external data type 'JSON string. (underlying parquet nested/repeatable column must be read as VARCHAR or CHAR)'.
Primary Question : How can I define the columntype as varchar(max) from within a notebook? So the table can be directly queried.
Secondary Question: If from within a notebook is not possible, how to set it up in a Lake Database? Serverless Database is not an option, as I need the managed Delta Tables.
One more hint, the following Query returns proper results within SQL on the Lake Database
SELECT
column1,
column1,
JSON_VALUE(JSON_column, '$.property1') AS property1,
JSON_VALUE(JSON_column, '$.property2') AS property2,
FROM
OPENROWSET(
BULK 'https://***.dfs.core.windows.net/bronze/entity/',
FORMAT = 'DELTA'
)
WITH
(
column1,
column2,
JSON_column VARCHAR(MAX)
) query
Thanks