Has the Dataflow IR upgrade to spark 3.4 broken Timestamps stored in parquet?

Question

Has the Dataflow IR upgrade to spark 3.4 broken Timestamps stored in parquet?

Neil Sisson 5

It appears that over the past couple of days, the Dataflow Integration Runtimes have migrated to spark 3.4. When the source is a parquet file, ADF Dataflows are now nulling all source timestamps stored as parquet int64 where the field metadata isAdjustedToUTC=false. Looking at the spark migration guide: https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-33-to-34, we can see there were changes:

Since Spark 3.4, when schema inference on external Parquet files, INT64 timestamps with annotation isAdjustedToUTC=false will be inferred as TimestampNTZ type instead of Timestamp type. To restore the legacy behavior, set spark.sql.parquet.inferTimestampNTZ.enabled to false.

How should we proceed here? Try to modify the metadata upstream? I think it would prudent to rollback to spark 3.3, as its clear this hasn't been tested.

Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-05-29T18:27:23.5433333+00:00

@Neil Sisson

Yes, you're absolutely right in your assessment. The recent upgrade of the Data Flow Integration Runtime to Spark 3.4 appears to have introduced a breaking change in how timestamp fields are inferred from Parquet sources.

As you've noted, in Spark 3.4, INT64 timestamps with isAdjustedToUTC=false are now interpreted as TimestampNTZ (timestamps without timezone) instead of the previous Timestamp (UTC-based) type. This change can cause Data Flows to return null values or misinterpret timestamp data, especially if downstream logic expects UTC timestamps.

At the moment, Azure Data Factory does not expose a setting to override the default Spark behavior (e.g., setting spark.sql.parquet.inferTimestampNTZ.enabled=false)

Upstream Metadata Control - If you manage the Parquet file creation process, you may be able to modify the way timestamps are written — ensuring isAdjustedToUTC=true, or converting timestamps to a UTC-based format prior to writing.

Monitor Release Notes - Microsoft may release updates or workarounds if this becomes a widely reported issue. I recommend keeping an eye on the Azure Updates page and Data Flow documentation.

Appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Thank you.
Neil Sisson 5 Reputation points

2025-05-29T19:52:06.59+00:00

@Chandra Boorla Is this the sort of thing that the ADF team would consider rolling back for? Or will they just perform a fix down the line, and we are supposed to triage the upstream parquet file creation in the meantime?

1 answer

Your answer

Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-05-29T18:27:23.5433333+00:00

@Neil Sisson

Yes, you're absolutely right in your assessment. The recent upgrade of the Data Flow Integration Runtime to Spark 3.4 appears to have introduced a breaking change in how timestamp fields are inferred from Parquet sources.

As you've noted, in Spark 3.4, INT64 timestamps with isAdjustedToUTC=false are now interpreted as TimestampNTZ (timestamps without timezone) instead of the previous Timestamp (UTC-based) type. This change can cause Data Flows to return null values or misinterpret timestamp data, especially if downstream logic expects UTC timestamps.

At the moment, Azure Data Factory does not expose a setting to override the default Spark behavior (e.g., setting spark.sql.parquet.inferTimestampNTZ.enabled=false)

Upstream Metadata Control - If you manage the Parquet file creation process, you may be able to modify the way timestamps are written — ensuring isAdjustedToUTC=true, or converting timestamps to a UTC-based format prior to writing.

Monitor Release Notes - Microsoft may release updates or workarounds if this becomes a widely reported issue. I recommend keeping an eye on the Azure Updates page and Data Flow documentation.

Appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Thank you.
Neil Sisson 5 Reputation points

2025-05-29T19:52:06.59+00:00

@Chandra Boorla Is this the sort of thing that the ADF team would consider rolling back for? Or will they just perform a fix down the line, and we are supposed to triage the upstream parquet file creation in the meantime?

Answer 1

Neil Sisson 5

We have been able to bypass this issue for now by altering the SQL queries of our upstream parquet source (google big query) to export datetimes/timestamps as strings in the following format: UNIX_MILLIS(TIMESTAMP(datecolumn)). The ADF dataflow parquet sources were able to cast these strings as timestamps, requiring no modifications to our Transformation layer.

Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-05-30T16:36:39.9266667+00:00

@Neil Sisson

That's a great workaround, and it's worth sharing with the community as it could help others facing the same issue.

Exporting timestamps as millisecond epoch strings using UNIX_MILLIS(TIMESTAMP(...)) from BigQuery is a clever way to bypass the TimestampNTZ inference issue introduced in Spark 3.4. It’s especially helpful that ADF Data Flows can still cast those strings back into proper timestamps without needing changes in the transformation logic.

Appreciate you sharing the workaround!

Share via

Has the Dataflow IR upgrade to spark 3.4 broken Timestamps stored in parquet?

1 answer

Your answer