ADF - Json Data schema comparison | Spark OOM error

Question

ADF - Json Data schema comparison | Spark OOM error

Anonymous

Hi . I have a requirement of loading a json file (approx 4 GB) to a Cosmos DB. I want to compare the data schema (column structure) against a defined data structure before loading the file into the DB. I am trying to use Dataflow to achieve this but I end up facing a OOM exception (below is the complete message) even when I selected the highest core count and Memory optimized Compute in my pipeline settings for DataFlow. Looking for suggestions to achieve this in some other better way or if Dataflow is the only way, how to overcome the Spark OOM error?

Error code

DF-Executor-OutOfMemoryError

"Job failed due to reason: at Sink 'sink1': Cluster ran into out of memory issue during execution, please retry using an integration runtime with bigger core count and/or memory optimized compute type"

Danilo Ribeiro 0 Reputation points

2023-10-16T17:52:49.96+00:00

https://learn.microsoft.com/en-us/azure/data-factory/data-flow-troubleshoot-guide

https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance

Explanation of the cause and solution of the error.
Bhargava-MSFT 31,361 Reputation points Microsoft Employee Moderator

2023-10-16T19:51:05.5866667+00:00

Thank youDanilo Ribeiro

Hello Azharudheen r, Mohamed,

Welcome to the Microsoft Q&A forum.

Please follow the above links to troubleshoot the issue you are facing.

<in addition to Danilo's comment>

Is this a new pipeline? Or having issues all of a sudden.

Dataflows/Spark divides the data into partitions and transforms it using different processes. If the data size in a partition is more than the process can hold in memory, then the process fails with OOM error.

Can you please try to increase your partitions using 'set partitioning' and see if it helps?
Anonymous

2023-10-18T15:22:10.7166667+00:00

Thanks for your response Danilo & @Bhargava-MSFT . I tried to setup partitioning with 16 & then 30 partitions but ended up with same error . Tried multiple times. Is there any other suggestion ? One more clarification I needed is we can set Partitioning at the Dataflow level but can define cluster configuration in the pipeline settings only, correct ?

Would you recommend any other way to achieve this data schema comparison within ADF (with or without using dataflow ?)
Bhargava-MSFT 31,361 Reputation points Microsoft Employee Moderator

2023-10-24T15:58:23.6433333+00:00

Hello Azharudheen r, Mohamed,

I am checking to see if you got a chance to look into my above response.
Bhargava-MSFT 31,361 Reputation points Microsoft Employee Moderator

2023-11-08T20:21:01.2166667+00:00

Hello Azharudheen r, Mohamed,

I am checking to see if you need any further assistance here.

1 answer

Your answer

Danilo Ribeiro 0 Reputation points

2023-10-16T17:52:49.96+00:00

https://learn.microsoft.com/en-us/azure/data-factory/data-flow-troubleshoot-guide

https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance

Explanation of the cause and solution of the error.
Bhargava-MSFT 31,361 Reputation points Microsoft Employee Moderator

2023-10-16T19:51:05.5866667+00:00

Thank youDanilo Ribeiro

Hello Azharudheen r, Mohamed,

Welcome to the Microsoft Q&A forum.

Please follow the above links to troubleshoot the issue you are facing.

<in addition to Danilo's comment>

Is this a new pipeline? Or having issues all of a sudden.

Dataflows/Spark divides the data into partitions and transforms it using different processes. If the data size in a partition is more than the process can hold in memory, then the process fails with OOM error.

Can you please try to increase your partitions using 'set partitioning' and see if it helps?
Anonymous

2023-10-18T15:22:10.7166667+00:00

Thanks for your response Danilo & @Bhargava-MSFT . I tried to setup partitioning with 16 & then 30 partitions but ended up with same error . Tried multiple times. Is there any other suggestion ? One more clarification I needed is we can set Partitioning at the Dataflow level but can define cluster configuration in the pipeline settings only, correct ?

Would you recommend any other way to achieve this data schema comparison within ADF (with or without using dataflow ?)
Bhargava-MSFT 31,361 Reputation points Microsoft Employee Moderator

2023-10-24T15:58:23.6433333+00:00

Hello Azharudheen r, Mohamed,

I am checking to see if you got a chance to look into my above response.
Bhargava-MSFT 31,361 Reputation points Microsoft Employee Moderator

2023-11-08T20:21:01.2166667+00:00

Hello Azharudheen r, Mohamed,

I am checking to see if you need any further assistance here.

Answer 1

Bhargava-MSFT 31,361 Microsoft Employee Moderator

Hello Azharudheen r, Mohamed,

Yes, you are correct. You can set partitioning at the Dataflow level but can define cluster configuration in the pipeline settings only.

As the next steps, can you please check if you can break down the 4GB file into smaller files?

and to take advantage of partitioning, the structure the json needs to be one document per line.

By the way there is 2GB limit for block size on spark, so a row cannot be more than 2GB.

0 comments

Share via

ADF - Json Data schema comparison | Spark OOM error

1 answer

Your answer