Trying to understand how timestamps are handled by Synapse Serverless SQL and PowerBI

Victor Seifert 151 Reputation points
2024-03-15T16:04:06.24+00:00

Maybe i have a knot in my brain because it's friday evening but I can't wrap my head around this:

I have Parquet data in an ADLS which I have a timestamp column.

In the raw file, it is stored as '2024-02-07T00:00:00.000**+0100**'. It has a definite timezone.

This means,

  • sitting in Central Europe (CET) the time would read as 2024-02-07T00:00:00
  • sitting in UTC the time would read as 2024-02-06T23:00:00

When I read the data with PySpark ("spark.sql.session.timeZone" == "Europe/Berlin"), it shows up as "2024-02-07T00:00:00".

When I read the data with PySpark ("spark.sql.session.timeZone" == "UTC"), it shows up as "2024-02-06T23:00:00".

Wonderful. So far I understand it. Now what I do not understand:

When I query the data over the the Synapse serverless SQL pool (either directly in Synapse or via an external table from PowerBI), I ALWAYS get the timestamp as "2024-02-06T23:00:00", no matter what my local time setup is.

Why? What is the reason behind this? Now, either:

  1. Any user that retrieves the data from the Synapse serverless SQL pool has to be aware, that all timestamps are in UTC and they need to do timezone conversion themselves (which they might not be able to)
  2. Alternatively: I need to change the underlying data by the timezone offset of my users (e.g. UTC -> CET) so all users in Berlin/Germany have their usual time. But this also means any other user who directly reads from ADLS will read wrong data ("wrong" by the offset of UTC-CET).

For new people who connect to Synapse with PowerBI, it is not trivial to convert the timestamp to the local time as PowerBI will continue to display it in UTC, even if the local computer setup is Europe/Berlin.

Also anyone else reading from the Synapse SQL pool suddenly has to explicitely convert it from UTC.

But when my users read the data from ADLS, they don't have to do the conversion.

Why would Synapse get rid of the timezone information in the timestamp and deliver it as UTC instead of delivering it as-is with the timezone information?

Example of the Query in Synapse, reading from teh ADLS

SELECT
    TOP 100 
     id
      my_timestamp, 
      DATEADD(hh, 1, my_timestamp) as my_modified_timestamp_in_cet  -- only during winter time O.o

FROM
    OPENROWSET(
        BULK 'https://<storageaccount>.dfs.core.windows.net/<container>/mydata/**',
        FORMAT = 'PARQUET'
    ) AS [result]
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,379 questions
{count} votes

Accepted answer
  1. phemanth 15,765 Reputation points Microsoft External Staff Moderator
    2024-03-18T09:24:35.16+00:00

    @Victor Seifert

    Thanks for reaching out to Microsoft Q&A.

    You're right, it can be confusing when Synapse treats timestamps differently than PySpark. Here's the breakdown of why you're seeing UTC timestamps in Synapse:

    PySpark Behavior:

    • PySpark allows you to configure the session timezone ("spark.sql.session.timeZone"). When you read the data, PySpark interprets the timestamp based on this setting.
    • In your case, setting it to "Europe/Berlin" interprets the timestamp as CET and displays it as "2024-02-07T00:00:00".
    • Setting it to "UTC" interprets the timestamp as-is and displays it as "2024-02-06T23:00:00".

    Synapse Serverless SQL Behavior:

    • Synapse Serverless SQL currently assumes all timestamps are in UTC internally. When you read the Parquet data, it interprets the raw value (without timezone information) as UTC. This explains why you always see "2024-02-06T23:00:00" regardless of your local settings.

    The Challenge:

    The issue arises because Parquet doesn't store timezone information natively, and Synapse assumes UTC. This requires you to handle the conversion explicitly.

    Possible Solutions:

    1. Convert in PySpark: Before storing the data in ADLS, use PySpark to convert the timestamps to a consistent format (e.g., UTC) and store them that way. This ensures everyone accessing the data interprets it consistently.
    2. Convert in Synapse: Use T-SQL functions like AT TIME ZONE within your Synapse queries to convert the UTC timestamps to the desired timezone based on user needs.
    3. Power BI: If connecting to Synapse from Power BI, you might need to configure Power BI to interpret the timestamps correctly. Look for options related to "data source time zone" or similar settings.
    4. User Awareness: Make sure your users are aware that timestamps in Synapse are stored in UTC and need conversion if necessary.

    Hope this helps. Do let us know if you any further queries.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.