question

arkiboys avatar image
0 Votes"
arkiboys asked arkiboys commented

Parquet column cannot be converted.

In ADF dataflow using the derivedcolumn, I convert the columns to the appropriate datatypes.
i.e. string to date or to decimal...
the sink is in delta parquet.
Then in databricks I try to read the delta parquet but there is an error:

Parquet column cannot be converted. Column: [title_date], Expected: StringType, Found: INT32

note ethat I do not convert to int.
Any suggestions?

Thank you

azure-databricks
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

KranthiPakala-MSFT avatar image
0 Votes"
KranthiPakala-MSFT answered arkiboys commented

Hello @arkiboys,

Thanks for the question and using MS Q&A platform.

Is your cluster Databricks Runtime 7.3 LTS or above?

Cause:
If that is the case then this error usually occurs when the vectorized Parquet reader is decoding the decimal type column to a binary format

The vectorized Parquet reader is enabled by default in Databricks Runtime 7.3 and above for reading datasets in Parquet files. The read schema uses atomic data types: binary, boolean, date, string, and timestamp.

Note: Please note that this error only occurs if you have decimal type columns in the source data.


Resolution:

If you have decimal type columns in your source data, you should disable the vectorized Parquet reader.

Set spark.sql.parquet.enableVectorizedReader to false in the cluster’s Spark configuration to disable the vectorized Parquet reader at the cluster level.


You can also disable the vectorized Parquet reader at the notebook level by running:

     spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")

Important Note: The vectorized Parquet reader enables native record-level filtering using push-down filters, improving memory locality, and cache utilization. If you disable the vectorized Parquet reader, there may be a minor performance impact. You should only disable it, if you have decimal type columns in your source data.


For detailed info please refer to this document: Apache Spark job fails with Parquet column cannot be converted error

Hope this will help. Please let us know if any further queries.


  • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification

  • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators


· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi,
Yes the cluster is 7.3
and yes I am converting to decimal in ADF
I read this in the documentation and in the notebook I do run the line for:
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")

But in databricks when I read the delta parquet files I get the same error.
note at present in my daily folders for example up until yesterday I use string and no datatype conversion
today day18 I have converted the datatypes...
the current structure of the storage hierarchy is as follows:

containername/folder1/data/year=2002/month=05/day=1
containername/folder1/data/year=2002/month=05/day=2
...
containername/folder1/data/year=2002/month=05/day=17
containername/folder1/data/year=2002/month=05/day=18

I am reading the delta like: /containername/folder1/*
and so I get the error mentioned previously.
BUT, if I have only day=18 and remove all the previous days then there is no error and works fine.
So the question is why it does not let me read the delta parquet files when all the days are in one data folder as mentioned above?

Thank you

0 Votes 0 ·