Reading datetime from parquet

Leonid Repno 1 Reputation point
2020-12-01T14:46:36.997+00:00

Hello!

I am having difficulties reading parquet Timestamp into a Synapse SQL.

I am using a straightforward query:

SELECT
    BIRTHDAY
FROM
    OPENROWSET(
        BULK 'https://{account}.dfs.core.windows.net/root/parquet/part-00000-tid-7858592301442936858-0ac91229-4635-4c53-92ed-a5e1be35e33c-0-1-c000.snappy.parquet',
        FORMAT='PARQUET'
    ) AS [result]

And getting the following error:

Error handling external file: 'Inserting value to batch for column type DATETIME2 failed. Invalid argument provided.'. File: 'https://{account}.dfs.core.windows.net/root/parquet/part-00000-tid-7858592301442936858-0ac91229-4635-4c53-92ed-a5e1be35e33c-0-1-c000.snappy.parquet'.

I have checked the data and I didn't see any problems, some records do have Timestamp as of '0001-01-01 00:00:00' but I consider this is acceptable for datetime2 data type.

I am able to read the file without this column or with top 100 including birthday, so it is more or less about the data I think.

Can you provide any suggestions or tips on this question?

Thanks!

EDIT:

The problem persist with some other operations (as of COPY bulk from parquet/ORC to internal table).

following datetime is an issue and raises error:

0001-01-01 02:00:00

Please check on your side

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,362 questions
{count} votes

3 answers

Sort by: Most helpful
  1. Joe Edwards 11 Reputation points
    2021-09-15T12:33:09.66+00:00

    I have had a similar issue with Azure Data Factory data flow created parquet files being read via Synapse. In my case it was with a DATE data type, rather than DATETIME2, but I think the same issue could apply.

    If in ADF you insert a date as toDate('0001-01-01'), what actually gets written to the parquet file is the date '0000-12-30'. Since there was no year 0 (we went straight from year 1 BC to year 1 AD) this is not a real date. Whether or not this can be read depends on the tool. ADF data flows will happily read it (as '0000-12-30') but Synapse throws "Inserting value to batch for column type DATE failed".

    If however when writing it you use toDate('0001-01-03'), this gets written to the parquet file as '0001-01-01' and Synapse will read it just fine.

    The reason for this error is that ADF is not correctly handling the transition from the Julian to the Gregorian calender from 1582 onwards (https://en.wikipedia.org/wiki/Gregorian_calendar). In my opinion this is a bug. The most logical way to store a dateless time is 0001-01-01 HH:mm:ss, so support for this would be greatly appreciated. You will find that toDate('1582-01-01') -> '1582-01-11', but toDate('1583-01-01') -> '1583-01-01'

    Joe

    2 people found this answer helpful.
    0 comments No comments

  2. HimanshuSinha-msft 19,376 Reputation points Microsoft Employee
    2020-12-08T01:23:04.82+00:00

    Hello ,
    I am assuming as well that some date is not in correct format, casting it as varchar will help us see what date it is , below query shoul help.

    SELECT
    BIRTHDAY, count(*) as c
    FROM
    OPENROWSET(
    BULK 'https://{account}.dfs.core.windows.net/root/parquet/part-00000-tid-7858592301442936858-0ac91229-4635-4c53-92ed-a5e1be35e33c-0-1-c000.snappy.parquet',
    FORMAT='PARQUET'
    ) WITH (
    BIRTHDAY varchar(30)
    )AS [result]
    GROUP BY
    BIRTHDAY

    Thanks
    Himanshu


  3. GB 6 Reputation points
    2022-05-11T11:05:38.33+00:00

    Thanks so much @Joe Edwards for the workaround.
    Is there a Synapse step that would help identify parquet data issues that have yet to be found or a blog site for such?
    How did you interrogate the snappy.parquet file outside of Synapse to be able to show this 201032-image.png?
    Any tool recommendations for snappy.parquet schema viewing?
    Al the best , Grayham

    0 comments No comments