error reading delta parquet with new field added

arkiboys 9,686 Reputation points
2022-05-11T06:24:10.03+00:00

Hello,
loading daily delta parquet files into day folders each day
i.e.
...
/year=2022/month=05/day=09
/year=2022/month=05/day=10

today I added one more column to the load and so in day=11 the new field should be present

This is what I use to read delta parquet files in each day folder.
It works fine for any previous day except today and I suspect it is to do with the new field inside the load for today?
Do you know how to solve this?

delta_split_delivery_folder_path = "/prints/dloads/*"

df_delta_split = spark.read.parquet(f"abfss://{marketing_container_name}@{storage_account_name}.dfs.core.windows.net{delta_split_delivery_folder_path}")

yearNo=2022
monthNo=05

day=11 gives the error as you see below but for any other previous days it works fine

dayNo=11
df_today = df_delta_split.filter("year=" + str(yearNo) + " and month=" + str(monthNo) + " and day =" + str(dayNo))

display(df_today)

the display gives this error:

UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,068 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. ShaikMaheer-MSFT 38,326 Reputation points Microsoft Employee
    2022-05-12T15:42:13.707+00:00

    Hi @arkiboys ,

    Thanks for posting query in Microsoft Q&A Platform.

    Could you please try using StringType When writing to parquet and see if that helps?

    Similar error was discussed in below post. Where user confirms that writing data as StringType helps to resolve issue. Please let us know how it goes. Thank you.
    https://stackoverflow.com/questions/41133327/spark-error-reading-datetype-columns-in-partitioned-parquet-data