Malformed records are detected in schema inference parsing json

maneesh.d 101 Reputation points
2021-11-18T18:21:17.203+00:00

I am facing an error while trying to parse the Json file from Blob Storage using a data set within ADF.
The json file is the list of objects and I have selected the appropriate option in the data flow, and I have even tried with all the three available options, yet don't know the reason behind the issue.
I have tried changing the encoding but it resulted in a change in data.

The json file is created by merging more than one json file with one record in it, using a copy activity as below

150725-image.png

I have verified the output merged json file and it looks appropriate and used the online json viewer to verify the structure of the file, everything looks good.

But when this file is used in dataflow to read the data as shown in the below snip, I am facing an issue with the error message as:

{"StatusCode":"DF-JSON-WrongDocumentForm","Message":"Job failed due to reason: Malformed records are detected in schema inference. Parse Mode: FAILFAST","Details":"org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, 10.90.0.5, executor 0): org.apache.spark.SparkException: Malformed records are detected in schema inference. Parse Mode: FAILFAST.\n\tat org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:66)\n\tat org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:53)\n\tat scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)\n\tat scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)\n\tat scala.collection.Iterator$class.isEmpty(Iterator.scala:331)\n\tat scala.collection.AbstractIterator.isEmpty(Iterator.scala:1334)\n\tat scala.collection.TraversableOnce$class.reduceLeftOption(TraversableOnce.scala:203)\n\tat scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1334)\n\tat scala.collection.TraversableOnce$class.reduceOption(TraversableOnce.scal"}  

150670-image.png

150761-image.png

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,196 questions
0 comments No comments
{count} votes

Accepted answer
  1. maneesh.d 101 Reputation points
    2021-11-19T05:23:41.067+00:00

    The issue behind this is, by default the copy activity stores the json file as UTF-8 with BOM in the blob, and while reading the file using ADF Data Flow even though it is by default UTF-8, still it is unable to detect the BOM character and raising the issue.
    So while creating the file in blob storage, at the sink dataset (in my case) specify the encoding as "UTF-8 without BOM" and it worked.

    Note: For some reason, we cant use the dataset which has UTF-8 without BOM encoding in DataFlow, in that case, you can create two datasets one with default UTF-8 encoding (which will be used in DataFlow) and one with UTF-8 without BOM(which will be used in copy activity sink/while creating a file).

    Thank you.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Nishil Jain 1 Reputation point
    2021-11-18T22:23:45.733+00:00

    The problem is in the Json file being read by the Dataflow.

    Try to upload the raw unmodified json file to blobstore manually and try the "Data preview" option in the "Source component" in Dataflow.

    I got the same error when I did minor modification in the json file in a text editor and uploaded it back to the blob storage.