question

LeonidRepno-4744 avatar image
0 Votes"
LeonidRepno-4744 asked GB-4428 edited

Reading datetime from parquet

Hello!

I am having difficulties reading parquet Timestamp into a Synapse SQL.

I am using a straightforward query:

 SELECT
     BIRTHDAY
 FROM
     OPENROWSET(
         BULK 'https://{account}.dfs.core.windows.net/root/parquet/part-00000-tid-7858592301442936858-0ac91229-4635-4c53-92ed-a5e1be35e33c-0-1-c000.snappy.parquet',
         FORMAT='PARQUET'
     ) AS [result]

And getting the following error:

Error handling external file: 'Inserting value to batch for column type DATETIME2 failed. Invalid argument provided.'. File: 'https://{account}.dfs.core.windows.net/root/parquet/part-00000-tid-7858592301442936858-0ac91229-4635-4c53-92ed-a5e1be35e33c-0-1-c000.snappy.parquet'.

I have checked the data and I didn't see any problems, some records do have Timestamp as of '0001-01-01 00:00:00' but I consider this is acceptable for datetime2 data type.


I am able to read the file without this column or with top 100 including birthday, so it is more or less about the data I think.

Can you provide any suggestions or tips on this question?

Thanks!


EDIT:

The problem persist with some other operations (as of COPY bulk from parquet/ORC to internal table).

following datetime is an issue and raises error:

 0001-01-01 02:00:00

Please check on your side

azure-synapse-analytics
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @LeonidRepno-4744 ,
Thanks for the ask and I did tried to repro the issue and it was no repro for me .

Created a table with dummy data and then generated the parquet file .

create table footest100
(
_datetime datetime
,_datetime2 datetime2
,name varchar(100)

)
INSERT INTO footest100(_datetime,_datetime2,name) VALUES (GETDATE(),GETDATE(),'Himanshu')
select * from footest100

Run the below command and it just worked fine .
44213-paraquetissue.png

In you case the error is realted to INSERT , is there more part of the script ? if yes can you please share that ?
One more thing you mentioned that top 100 records works fine , can you please lets us know how the data looks on this records .

Thanks
Himanshu


0 Votes 0 ·
paraquetissue.png (32.2 KiB)

Hello Himanshu!

Thank you for your response.

I believe this is a parquet-datetime issue but I still have no proofs to that.
On your example, you might not have some non-standard datetimes that I have, here is the sample of the data I have in my parquet file:

44233-capture.png

In my understanding, I have top 100 query working well because these 100 records have "appropriate" datetimes, but when I am trying to read the whole file I get 0 records because of this error.


0 Votes 0 ·
capture.png (19.9 KiB)
HimanshuSinha-MSFT avatar image
0 Votes"
HimanshuSinha-MSFT answered MartinJaffer-MSFT commented

Hello ,
I am assuming as well that some date is not in correct format, casting it as varchar will help us see what date it is , below query shoul help.

SELECT
BIRTHDAY, count(*) as c
FROM
OPENROWSET(
BULK 'https://{account}.dfs.core.windows.net/root/parquet/part-00000-tid-7858592301442936858-0ac91229-4635-4c53-92ed-a5e1be35e33c-0-1-c000.snappy.parquet',
FORMAT='PARQUET'
) WITH (
BIRTHDAY varchar(30)
)AS [result]
GROUP BY
BIRTHDAY

Thanks
Himanshu


· 6
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello ,
We have not heard back from you on this and was just following up if you can eloborate more on the ask as requested before .
Incase if you have resolution , request you to share the same here , so that other community members can benefit from that .
Thanks
Himanshu

0 Votes 0 ·

Hello Himanshu!

Thank you for still being with me at this point.
But your query returns all BIRTHDAY as NULL.

I 46380-capture2.png


0 Votes 0 ·
capture2.png (31.7 KiB)

My sincere apoloziges for the delay in reply from my side .
I must accept that I am running out of ideas here at this point I
I did saw your edit and I think there is some more to it .
I am able to query the date 0001-01-01 02:00:00.
49284-12-17-2020.png


0 Votes 0 ·
12-17-2020.png (32.4 KiB)

Hello ,
We have not heard back from you on this and was just following up .
Incase if you have resolution , request you to share the same here , so that other community members can benefit from that .
Thanks
Himanshu

0 Votes 0 ·
Show more comments
JoeEdwards-2965 avatar image
2 Votes"
JoeEdwards-2965 answered JoeEdwards-2965 published

I have had a similar issue with Azure Data Factory data flow created parquet files being read via Synapse. In my case it was with a DATE data type, rather than DATETIME2, but I think the same issue could apply.

If in ADF you insert a date as toDate('0001-01-01'), what actually gets written to the parquet file is the date '0000-12-30'. Since there was no year 0 (we went straight from year 1 BC to year 1 AD) this is not a real date. Whether or not this can be read depends on the tool. ADF data flows will happily read it (as '0000-12-30') but Synapse throws "Inserting value to batch for column type DATE failed".

If however when writing it you use toDate('0001-01-03'), this gets written to the parquet file as '0001-01-01' and Synapse will read it just fine.

The reason for this error is that ADF is not correctly handling the transition from the Julian to the Gregorian calender from 1582 onwards (https://en.wikipedia.org/wiki/Gregorian_calendar). In my opinion this is a bug. The most logical way to store a dateless time is 0001-01-01 HH:mm:ss, so support for this would be greatly appreciated. You will find that toDate('1582-01-01') -> '1582-01-11', but toDate('1583-01-01') -> '1583-01-01'

Joe




5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

GB-4428 avatar image
0 Votes"
GB-4428 answered GB-4428 edited

Thanks so much @JoeEdwards-2965 for the workaround.
Is there a Synapse step that would help identify parquet data issues that have yet to be found or a blog site for such?
How did you interrogate the snappy.parquet file outside of Synapse to be able to show this 201032-image.png?
Any tool recommendations for snappy.parquet schema viewing?
Al the best , Grayham



image.png (4.2 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.