spark.read excel with formula

braxx 456 Reputation points
2022-01-10T10:47:54.137+00:00

For some reason spark is not reading the data correctly from xlsx file in the column with a formula. I am reading it from a blob storage.

Consider this simple data set
163586-image.png

The column "color" has formulas for all the cells like
=VLOOKUP(A4,C3:D5,2,0)

In cases where the formula could not return a value it is read differently by excel and spark:
excel - #N/A
spark - =VLOOKUP(A4,C3:D5,2,0)

Here is my code:

df= spark.read\  
  .format("com.crealytics.spark.excel")\  
  .option("header", "true")\  
  .load(input_path + input_folder_general + "test1.xlsx")  
  
display(df)  

And here is how the above dataset is read:
163529-image.png

How to get #N/A instead of a formula?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,486 Reputation points Microsoft Employee Moderator
    2022-01-11T22:39:11.35+00:00

    Hello @braxx ,
    Thanks for the ask and using Microsoft Q&A platform .
    Have you tried the options "setErrorCellsToFallbackValues" to check if will help ? I think its will not give you "N/A" as asked but may toggle beyween 0 and 1 . You can read about this more here https://github.com/crealytics/spark-excel#create-a-dataframe-from-an-excel-file

    spark.read
    .format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation
    .option("dataAddress", "'My Sheet'!B3:C35") // Optional, default: "A1"
    .option("header", "true") // Required
    .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
    .option("setErrorCellsToFallbackValues", "true") // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
    .option("usePlainNumberFormat", "false") // Optional, default: false, If true, format the cells without rounding and scientific notations
    .option("inferSchema", "false") // Optional, default: false
    .option("addColorColumns", "true") // Optional, default: false
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
    .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    .option("workbookPassword", "pass") // Optional, default None. Requires unlimited strength JCE for older JVMs
    .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
    .load("Worktime.xlsx")

    My apoloziges I was not able to test it as I am struggling to install the library on the cluster .

    Please do let me know how it goes .
    Thanks
    Himanshu

    -------------------------------------------------------------------------------------------------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.