can't parse json files stored as octet-stream

wayne 1

i am storing data in azure by using the data reader getstream method from sql server. This allows me to upload the stream to azure blob via c# without incurring memory issues. I can read the file which is a gz json file formatted as a list of json objects through data bricks. I am now trying to offload the effort off a data bricks cluster so that i can transform the files into parquet via data factory. The issue is data factory file parsers does not seem to like json files written as an octet-stream. It is throwing a "Malformed records are detected in schema inference. Parse Mode:" error message.

How can i get data factory to read these files as it seems like there is a difference in how data bricks read them and data factory does even though they both have spark backends.

Saurabh Sharma 23,791 Reputation points Microsoft Employee

2021-09-21T18:11:32.827+00:00

Hi @wayne ,

Are you using Dataflow in Data Factory to load the the json file ? What is the encoding of json file - UTF-8 or UTF-8 BOM ? You could this error if your JSON file encoding is UTF-8 BOM as
UTF-8 BOM is not supported in Data flow. As the presence of special characters - EF BB BF (in hex) from UTF-8 BOM encoding at the start of the JSON file could lead to this error.

Thanks
Saurabh
Saurabh Sharma 23,791 Reputation points Microsoft Employee

2021-09-24T18:20:20.96+00:00

Hi @wayne ,
We haven't heard back from you. Just wanted to check if you are you still facing the issue? In case If you already found a solution, would you please share it here with the community? Otherwise, let us know and we will continue to engage with you on the issue.

Thanks
Saurabh
Wayne Theron 1 Reputation point

2021-09-26T06:52:42.157+00:00

hi

sorry for the late response. I am unsure what format it is because there is nowhere to specify the format. it's a query issues to sql server wrapping a for json query in a sql compress command. The MSSQL. net client issues this query with the reader issuing a get stream. If its a case of this process creating a utf8 bom format then can I suggest support for this be added because this is a very good way to extract large amounts of data from sql server in a compressed format using streams and not any data table that consumes memory. The Azure blob client can accept that stream and uploads the file to a blob. The only thing I noticed when opening the json file after decompressing it, the file is not in txt format its an octet stream which might be this bom format you are referring to. It does seem odd there is no support for that format because extracting compressed json data from sql server instead of streaming uncompressed data makes a lot of sense.

in any case I decided the better path was to produce local parquet files instead and shift those to azure saving the computation costs of conversion in the cloud.
Saurabh Sharma 23,791 Reputation points Microsoft Employee

2021-10-18T18:55:33.193+00:00

Hi @wayne ,

I suggest you to please open a support ticket to look into this. In case you have any limitations opening a support ticket please let me know.

Thanks
Saurabh

Share via

can't parse json files stored as octet-stream