We have been able to bypass this issue for now by altering the SQL queries of our upstream parquet source (google big query) to export datetimes/timestamps as strings in the following format: UNIX_MILLIS(TIMESTAMP(datecolumn)). The ADF dataflow parquet sources were able to cast these strings as timestamps, requiring no modifications to our Transformation layer.
Has the Dataflow IR upgrade to spark 3.4 broken Timestamps stored in parquet?
It appears that over the past couple of days, the Dataflow Integration Runtimes have migrated to spark 3.4. When the source is a parquet file, ADF Dataflows are now nulling all source timestamps stored as parquet int64 where the field metadata isAdjustedToUTC=false. Looking at the spark migration guide: https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-33-to-34, we can see there were changes:
Since Spark 3.4, when schema inference on external Parquet files, INT64 timestamps with annotation
isAdjustedToUTC=false
will be inferred as TimestampNTZ type instead of Timestamp type. To restore the legacy behavior, setspark.sql.parquet.inferTimestampNTZ.enabled
tofalse
.
How should we proceed here? Try to modify the metadata upstream? I think it would prudent to rollback to spark 3.3, as its clear this hasn't been tested.