Unable to process Excel file through databricks

Chayan Upadhyay 96 Reputation points
2021-08-06T10:10:39.03+00:00

Hello Experts,

I am trying to read an Excel file (247MB and total rows 549628) through databricks and converting into Paraquet file and keeping it in ADLS Gen 1(already mounted) but getting below error even while reading the file:
"The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

below is the code:

import org.apache.spark._

val Data = spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.load("/mnt/adls/folder/file.xlsx")

Below is the configuration of cluster:

121211-image.png

I have also tried by choosing high configuration

121170-image.png

But still getting this issue and also if i check the even log i always see below details:

121194-image.png

if i break the same file into small records like 100k in each file then able to process the file.

Please let me know if i am not choosing the optimized configuration but it seems weird since databricks should increase the processing and should be able to process 247mb file even with basic config.

Appreciate your time and effort, Thanks !!

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

Accepted answer
  1. Chayan Upadhyay 96 Reputation points
    2021-08-09T12:48:02.587+00:00

    Finally able to process the file with just one node config, need to add maxRowsInMemory parameter in the code

    val Data = spark.read.format("com.crealytics.spark.excel")
    .option("header", "true")
    .option("inferSchema", "true").option("maxRowsInMemory",10)
    .load("/mnt/adls/folder/file.xlsx")

    Reference: https://stackoverflow.com/questions/50789369/construct-a-dataframe-from-excel-using-scala

    There are many other optional parameter which might be helpful in other use cases.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator
    2021-08-09T05:50:52.337+00:00

    Hi @Chayan Upadhyay ,

    Welcome to Microsoft Q&A Platform. Thank you for posting query in Microsoft Q&A Platform.

    Could you please try below and see if that works.
    spark.catalog.clearCache()

    Also, Please go through the below links to resolve your issue.
    https://kb.databricks.com/jobs/driver-unavailable.html

    To know what is GC, please check this answer,
    https://forums.databricks.com/questions/14725/how-to-resolve-spark-full-gc-on-cluster-startup.html

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.