Reading datetime from parquet

Question

Hello!

I am having difficulties reading parquet Timestamp into a Synapse SQL.

I am using a straightforward query:

SELECT
    BIRTHDAY
FROM
    OPENROWSET(
        BULK 'https://{account}.dfs.core.windows.net/root/parquet/part-00000-tid-7858592301442936858-0ac91229-4635-4c53-92ed-a5e1be35e33c-0-1-c000.snappy.parquet',
        FORMAT='PARQUET'
    ) AS [result]

And getting the following error:

Error handling external file: 'Inserting value to batch for column type DATETIME2 failed. Invalid argument provided.'. File: 'https://{account}.dfs.core.windows.net/root/parquet/part-00000-tid-7858592301442936858-0ac91229-4635-4c53-92ed-a5e1be35e33c-0-1-c000.snappy.parquet'.

I have checked the data and I didn't see any problems, some records do have Timestamp as of '0001-01-01 00:00:00' but I consider this is acceptable for datetime2 data type.

I am able to read the file without this column or with top 100 including birthday, so it is more or less about the data I think.

Can you provide any suggestions or tips on this question?

Thanks!

EDIT:

The problem persist with some other operations (as of COPY bulk from parquet/ORC to internal table).

following datetime is an issue and raises error:

0001-01-01 02:00:00

Please check on your side

Answer

I have had a similar issue with Azure Data Factory data flow created parquet files being read via Synapse. In my case it was with a DATE data type, rather than DATETIME2, but I think the same issue could apply.

If in ADF you insert a date as toDate('0001-01-01'), what actually gets written to the parquet file is the date '0000-12-30'. Since there was no year 0 (we went straight from year 1 BC to year 1 AD) this is not a real date. Whether or not this can be read depends on the tool. ADF data flows will happily read it (as '0000-12-30') but Synapse throws "Inserting value to batch for column type DATE failed".

If however when writing it you use toDate('0001-01-03'), this gets written to the parquet file as '0001-01-01' and Synapse will read it just fine.

The reason for this error is that ADF is not correctly handling the transition from the Julian to the Gregorian calender from 1582 onwards (https://en.wikipedia.org/wiki/Gregorian_calendar). In my opinion this is a bug. The most logical way to store a dateless time is 0001-01-01 HH:mm:ss, so support for this would be greatly appreciated. You will find that toDate('1582-01-01') -> '1582-01-11', but toDate('1583-01-01') -> '1583-01-01'

Joe

Answer

Hello ,
I am assuming as well that some date is not in correct format, casting it as varchar will help us see what date it is , below query shoul help.

SELECT
BIRTHDAY, count(*) as c
FROM
OPENROWSET(
BULK 'https://{account}.dfs.core.windows.net/root/parquet/part-00000-tid-7858592301442936858-0ac91229-4635-4c53-92ed-a5e1be35e33c-0-1-c000.snappy.parquet',
FORMAT='PARQUET'
) WITH (
BIRTHDAY varchar(30)
)AS [result]
GROUP BY
BIRTHDAY

Thanks
Himanshu

Answer

Thanks so much @Joe Edwards for the workaround.
Is there a Synapse step that would help identify parquet data issues that have yet to be found or a blog site for such?
How did you interrogate the snappy.parquet file outside of Synapse to be able to show this ?
Any tool recommendations for snappy.parquet schema viewing?
Al the best , Grayham

Reading datetime from parquet

3 answers