Unable to process Excel file through databricks

Question

Unable to process Excel file through databricks

Chayan Upadhyay 96

Hello Experts,

I am trying to read an Excel file (247MB and total rows 549628) through databricks and converting into Paraquet file and keeping it in ADLS Gen 1(already mounted) but getting below error even while reading the file:
"The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

below is the code:

import org.apache.spark._

val Data = spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.load("/mnt/adls/folder/file.xlsx")

Below is the configuration of cluster:

I have also tried by choosing high configuration

But still getting this issue and also if i check the even log i always see below details:

if i break the same file into small records like 100k in each file then able to process the file.

Please let me know if i am not choosing the optimized configuration but it seems weird since databricks should increase the processing and should be able to process 247mb file even with basic config.

Appreciate your time and effort, Thanks !!

Accepted answer

1 additional answer

Your answer

Answer 1

Chayan Upadhyay 96

Finally able to process the file with just one node config, need to add maxRowsInMemory parameter in the code

val Data = spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true").option("maxRowsInMemory",10)
.load("/mnt/adls/folder/file.xlsx")

Reference: https://stackoverflow.com/questions/50789369/construct-a-dataframe-from-excel-using-scala

There are many other optional parameter which might be helpful in other use cases.

Answer 2

ShaikMaheer-MSFT 38,546 Microsoft Employee Moderator

Hi @Chayan Upadhyay ,

Welcome to Microsoft Q&A Platform. Thank you for posting query in Microsoft Q&A Platform.

Could you please try below and see if that works.
spark.catalog.clearCache()

Also, Please go through the below links to resolve your issue.
https://kb.databricks.com/jobs/driver-unavailable.html

To know what is GC, please check this answer,
https://forums.databricks.com/questions/14725/how-to-resolve-spark-full-gc-on-cluster-startup.html

Hope this will help. Please let us know if any further queries.

------------------------------

Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
Want a reminder to come back and check responses? Here is how to subscribe to a notification.

Chayan Upadhyay 96 Reputation points

2021-08-09T08:42:27.27+00:00

Thank you @ShaikMaheer-MSFT for you reply and time.

I did try clearchache() statement but didn't help much, gone through the links as well but i don't see any expensive code/activity which will take this kind of hit in performance or processing, probably i am missing some config which will optimize it or it is an actual issue.

I will keep trying something on cluster config and see if i can turn this around, appreciate your help !!
ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator

2021-08-09T15:46:29.193+00:00

Hi @Chayan Upadhyay - Glad to know that you resolved your issue and also marking that as answer. Thank you.

Share via

Unable to process Excel file through databricks

1 additional answer

Your answer