Thanks for reaching out to Microsoft Q&A.
You're right, it can be confusing when Synapse treats timestamps differently than PySpark. Here's the breakdown of why you're seeing UTC timestamps in Synapse:
PySpark Behavior:
- PySpark allows you to configure the session timezone ("spark.sql.session.timeZone"). When you read the data, PySpark interprets the timestamp based on this setting.
- In your case, setting it to "Europe/Berlin" interprets the timestamp as CET and displays it as "2024-02-07T00:00:00".
- Setting it to "UTC" interprets the timestamp as-is and displays it as "2024-02-06T23:00:00".
Synapse Serverless SQL Behavior:
- Synapse Serverless SQL currently assumes all timestamps are in UTC internally. When you read the Parquet data, it interprets the raw value (without timezone information) as UTC. This explains why you always see "2024-02-06T23:00:00" regardless of your local settings.
The Challenge:
The issue arises because Parquet doesn't store timezone information natively, and Synapse assumes UTC. This requires you to handle the conversion explicitly.
Possible Solutions:
- Convert in PySpark: Before storing the data in ADLS, use PySpark to convert the timestamps to a consistent format (e.g., UTC) and store them that way. This ensures everyone accessing the data interprets it consistently.
- Convert in Synapse: Use T-SQL functions like
AT TIME ZONE
within your Synapse queries to convert the UTC timestamps to the desired timezone based on user needs. - Power BI: If connecting to Synapse from Power BI, you might need to configure Power BI to interpret the timestamps correctly. Look for options related to "data source time zone" or similar settings.
- User Awareness: Make sure your users are aware that timestamps in Synapse are stored in UTC and need conversion if necessary.
Hope this helps. Do let us know if you any further queries.