Share via

ADF - Json Data schema comparison | Spark OOM error

Anonymous
2023-10-16T17:30:32.2866667+00:00

Hi . I have a requirement of loading a json file (approx 4 GB) to a Cosmos DB. I want to compare the data schema (column structure) against a defined data structure before loading the file into the DB. I am trying to use Dataflow to achieve this but I end up facing a OOM exception (below is the complete message) even when I selected the highest core count and Memory optimized Compute in my pipeline settings for DataFlow. Looking for suggestions to achieve this in some other better way or if Dataflow is the only way, how to overcome the Spark OOM error?

Error code

DF-Executor-OutOfMemoryError

"Job failed due to reason: at Sink 'sink1': Cluster ran into out of memory issue during execution, please retry using an integration runtime with bigger core count and/or memory optimized compute type"

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.


1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 31,361 Reputation points Microsoft Employee Moderator
    2023-10-18T16:56:37.21+00:00

    Hello Azharudheen r, Mohamed,

    Yes, you are correct. You can set partitioning at the Dataflow level but can define cluster configuration in the pipeline settings only.

    As the next steps, can you please check if you can break down the 4GB file into smaller files?

    and to take advantage of partitioning, the structure the json needs to be one document per line.

    By the way there is 2GB limit for block size on spark, so a row cannot be more than 2GB.

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.