Trying to understand how timestamps are handled by Synapse Serverless SQL and PowerBI

Question

Trying to understand how timestamps are handled by Synapse Serverless SQL and PowerBI

Victor Seifert 151

Maybe i have a knot in my brain because it's friday evening but I can't wrap my head around this:

I have Parquet data in an ADLS which I have a timestamp column.

In the raw file, it is stored as '2024-02-07T00:00:00.000**+0100**'. It has a definite timezone.

This means,

sitting in Central Europe (CET) the time would read as 2024-02-07T00:00:00
sitting in UTC the time would read as 2024-02-06T23:00:00

When I read the data with PySpark ("spark.sql.session.timeZone" == "Europe/Berlin"), it shows up as "2024-02-07T00:00:00".

When I read the data with PySpark ("spark.sql.session.timeZone" == "UTC"), it shows up as "2024-02-06T23:00:00".

Wonderful. So far I understand it. Now what I do not understand:

When I query the data over the the Synapse serverless SQL pool (either directly in Synapse or via an external table from PowerBI), I ALWAYS get the timestamp as "2024-02-06T23:00:00", no matter what my local time setup is.

Why? What is the reason behind this? Now, either:

Any user that retrieves the data from the Synapse serverless SQL pool has to be aware, that all timestamps are in UTC and they need to do timezone conversion themselves (which they might not be able to)
Alternatively: I need to change the underlying data by the timezone offset of my users (e.g. UTC -> CET) so all users in Berlin/Germany have their usual time. But this also means any other user who directly reads from ADLS will read wrong data ("wrong" by the offset of UTC-CET).

For new people who connect to Synapse with PowerBI, it is not trivial to convert the timestamp to the local time as PowerBI will continue to display it in UTC, even if the local computer setup is Europe/Berlin.

Also anyone else reading from the Synapse SQL pool suddenly has to explicitely convert it from UTC.

But when my users read the data from ADLS, they don't have to do the conversion.

Why would Synapse get rid of the timezone information in the timestamp and deliver it as UTC instead of delivering it as-is with the timezone information?

Example of the Query in Synapse, reading from teh ADLS

SELECT
    TOP 100 
     id
      my_timestamp, 
      DATEADD(hh, 1, my_timestamp) as my_modified_timestamp_in_cet  -- only during winter time O.o

FROM
    OPENROWSET(
        BULK 'https://<storageaccount>.dfs.core.windows.net/<container>/mydata/**',
        FORMAT = 'PARQUET'
    ) AS [result]

phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-03-18T09:34:23.5733333+00:00

@Victor Seifert Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Accepted answer

0 additional answers

Your answer

phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-03-18T09:34:23.5733333+00:00

@Victor Seifert Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

@Victor Seifert

Thanks for reaching out to Microsoft Q&A.

You're right, it can be confusing when Synapse treats timestamps differently than PySpark. Here's the breakdown of why you're seeing UTC timestamps in Synapse:

PySpark Behavior:

PySpark allows you to configure the session timezone ("spark.sql.session.timeZone"). When you read the data, PySpark interprets the timestamp based on this setting.
In your case, setting it to "Europe/Berlin" interprets the timestamp as CET and displays it as "2024-02-07T00:00:00".
Setting it to "UTC" interprets the timestamp as-is and displays it as "2024-02-06T23:00:00".

Synapse Serverless SQL Behavior:

Synapse Serverless SQL currently assumes all timestamps are in UTC internally. When you read the Parquet data, it interprets the raw value (without timezone information) as UTC. This explains why you always see "2024-02-06T23:00:00" regardless of your local settings.

The Challenge:

The issue arises because Parquet doesn't store timezone information natively, and Synapse assumes UTC. This requires you to handle the conversion explicitly.

Possible Solutions:

Convert in PySpark: Before storing the data in ADLS, use PySpark to convert the timestamps to a consistent format (e.g., UTC) and store them that way. This ensures everyone accessing the data interprets it consistently.
Convert in Synapse: Use T-SQL functions like AT TIME ZONE within your Synapse queries to convert the UTC timestamps to the desired timezone based on user needs.
Power BI: If connecting to Synapse from Power BI, you might need to configure Power BI to interpret the timestamps correctly. Look for options related to "data source time zone" or similar settings.
User Awareness: Make sure your users are aware that timestamps in Synapse are stored in UTC and need conversion if necessary.

Hope this helps. Do let us know if you any further queries.

Victor Seifert 151 Reputation points

2024-03-18T09:29:03.72+00:00

Okay, that makes sense. Thanks a lot, especially for the possible solutions!
Victor Seifert 151 Reputation points

2024-03-18T09:29:22.1266667+00:00

Solved by phemanth in the comments.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-03-20T11:22:36.73+00:00

@Victor Seifert Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Trying to understand how timestamps are handled by Synapse Serverless SQL and PowerBI

0 additional answers

Your answer