error reading delta parquet with new field added

Question

Hello,
loading daily delta parquet files into day folders each day
i.e.
...
/year=2022/month=05/day=09
/year=2022/month=05/day=10

today I added one more column to the load and so in day=11 the new field should be present

This is what I use to read delta parquet files in each day folder.
It works fine for any previous day except today and I suspect it is to do with the new field inside the load for today?
Do you know how to solve this?

delta_split_delivery_folder_path = "/prints/dloads/*"

df_delta_split = spark.read.parquet(f"abfss://{marketing_container_name}@{storage_account_name}.dfs.core.windows.net{delta_split_delivery_folder_path}")

yearNo=2022
monthNo=05

day=11 gives the error as you see below but for any other previous days it works fine

dayNo=11
df_today = df_delta_split.filter("year=" + str(yearNo) + " and month=" + str(monthNo) + " and day =" + str(dayNo))

display(df_today)

the display gives this error:

UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

Answer

Hi @arkiboys ,

Thanks for posting query in Microsoft Q&A Platform.

Could you please try using StringType When writing to parquet and see if that helps?

Similar error was discussed in below post. Where user confirms that writing data as StringType helps to resolve issue. Please let us know how it goes. Thank you.
https://stackoverflow.com/questions/41133327/spark-error-reading-datetype-columns-in-partitioned-parquet-data

error reading delta parquet with new field added

day=11 gives the error as you see below but for any other previous days it works fine

1 answer