Read and Transform not tabular xlsx/xlsm (Excel) file in Azure Databricks from Azure Blob Storage

Alessandro Di Tocco 0 Reputation points
2023-09-21T08:24:01.3633333+00:00

Hi, I have to read an Excel file (xlsx/xlsm) that is not in a tabular form (inside there are merged columns, images, formulas, buttons, etc..), I have to read the cells I need and create a dataframe from it.

My Excel file is in an Azure Blob Storage, I have to import, read and transform it in Azure Databricks.

I cannot convert it immediately in spark/pandas dataframe due to the complex nature of the Excel file.

In my local machine I used openpyxl with success but in Azure Databricks I'm not able to read the excel file without transforming it immediately in a dataframe.

How can I do it?

I have Azure Databricks Community Edition, but if needed I can upgrade it.

Thanks to all.

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,849 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,175 questions
Excel
Excel
A family of Microsoft spreadsheet software with tools for analyzing, charting, and communicating data.
1,890 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 24,551 Reputation points
    2023-09-22T15:33:56.3933333+00:00

    Have you tried the openpyxl library ?

    You can install it :

    %pip install openpyxl
    

    Then, mount Azure Blob Storage to Azure Databricks :

    dbutils.fs.mount(
        source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
        mount_point = "/mnt/<mount-point-name>",
        extra_configs = {"fs.azure.account.key.<storage-account-name>": "<storage-account-key>"}
    )
    

    And then read the Excel file using openpyxl :

    import openpyxl
    # Get the path to the Excel file in Azure Blob Storage
    excel_file_path = "/mnt/<mount-point-name>/<excel-file-name>.xlsx"
    # Open the Excel file using openpyxl
    wb = openpyxl.load_workbook(excel_file_path)
    # Get the worksheet you want to read
    ws = wb.worksheets[0]
    # Iterate over the cells in the worksheet and read the data you need
    for row in ws.rows:
        for cell in row.cells:
            # Do something with the cell value
            cell_value = cell.value
    

    And then you will need to create a Pandas df from the data you read and write the transformed df to a new Excel file in Azure Blob Storage :

    # Get the path to the new Excel file in Azure Blob Storage
    new_excel_file_path = "/mnt/<mount-point-name>/<new-excel-file-name>.xlsx"
    
    df.to_excel(new_excel_file_path, index=False)
    
    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.