DataFlow not able to read API response json file stored in Blob Storage

Kunal Kumar Sinha 171 Reputation points
2020-12-04T05:50:13.463+00:00

Data Flow is not able to read the API response json file stored in blob storage, if the file is manually placed in the same location it works fine, but for the json api response dataflow says corrupted file? In dataset I'm able to preview the file but in data flow it doesn't work.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,198 questions
Not Monitored
Not Monitored
Tag not monitored by Microsoft.
37,798 questions
0 comments No comments
{count} votes

Accepted answer
  1. MartinJaffer-MSFT 26,061 Reputation points
    2020-12-21T19:55:23.417+00:00

    @Kunal Kumar Sinha I remember originally you mentioned corrupted file, are you sure it wasn't corrupted record?

    If in the Dataflow source data preview, you see a column _corrupt_record like this...
    50123-image.png
    then you need to go to source options and change the JSON settings from single document to Document per line
    50009-image.png


3 additional answers

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,061 Reputation points
    2020-12-08T00:30:09.667+00:00

    Hello @Kunal Kumar Sinha and thank you for bringing this to our attention.

    If I understand correctly, you have the same file/data written to blob storage by two different mechanisms, after which Mapping Data Flow cannot read one of them.
    In the happy case you made a call to some API, and then manually uploaded the blob.
    In the other case some service called the same API and got same response, and wrote to blob.

    First, can you compare the file sizes or MD5s? Or do a diff on them to find the difference?
    Also share a few properties of the blobs, specifically the blob type and content-type

    Second, can you share what service is writing the bad blob?

    45962-image.png


  2. MartinJaffer-MSFT 26,061 Reputation points
    2020-12-09T21:01:57.683+00:00

    @Kunal Kumar Sinha
    As a test, I located a blob whose content-type is application/octect-stream and text appears to be json, and created a new dataflow and dataset pointing to it. No schema was imported to either dataset or dataflow. All other settings left as default.
    The data was taken from public source reqres.in .
    46713-image.png
    46628-image.png
    46600-image.png

    As you can see in the above image, I was able to preview the data in the source transformation in Data Flow.
    This suggests the cause is not the content-type , but instead an issue in your data. Please visually inspect you data.


  3. MartinJaffer-MSFT 26,061 Reputation points
    2020-12-21T20:45:24.01+00:00

    In the Data Flow preview, if you get a message sayings results has incompatible types like in below image
    50132-image.png

    I have determined that to be rooted in how Data Flow guesses the schema. It looked at the first X records, and found they were all empty arrays, so it guessed it was an array of string. There are a few ways to fix this. The shortest is to go to source settings and enable Infer drifted column types, then goto Projection and import the projection again. You can also disable Validate schema in the source settings.

    Another way to more regorously fix it, is to go to the dataset and import schema from a file. The file should be an example which has all fields filled out, no incomplete objects or empty arrays.

    0 comments No comments