Read and Transform not tabular xlsx/xlsm (Excel) file in Azure Databricks from Azure Blob Storage

Question

Read and Transform not tabular xlsx/xlsm (Excel) file in Azure Databricks from Azure Blob Storage

Alessandro Di Tocco 0

Hi, I have to read an Excel file (xlsx/xlsm) that is not in a tabular form (inside there are merged columns, images, formulas, buttons, etc..), I have to read the cells I need and create a dataframe from it.

My Excel file is in an Azure Blob Storage, I have to import, read and transform it in Azure Databricks.

I cannot convert it immediately in spark/pandas dataframe due to the complex nature of the Excel file.

In my local machine I used openpyxl with success but in Azure Databricks I'm not able to read the excel file without transforming it immediately in a dataframe.

How can I do it?

I have Azure Databricks Community Edition, but if needed I can upgrade it.

Thanks to all.

ShaikMaheer-MSFT 38,551 Reputation points Microsoft Employee Moderator

2023-10-03T08:49:32.82+00:00

Hi Alessandro Di Tocco, Just checking if below answer helps. If yes, please consider hitting Accept Answer button. Accepted answers help community as well. Please let me know if any further queries. Thank you.

1 answer

Your answer

ShaikMaheer-MSFT 38,551 Reputation points Microsoft Employee Moderator

2023-10-03T08:49:32.82+00:00

Hi Alessandro Di Tocco, Just checking if below answer helps. If yes, please consider hitting Accept Answer button. Accepted answers help community as well. Please let me know if any further queries. Thank you.

Answer 1

Have you tried the openpyxl library ?

You can install it :

%pip install openpyxl

Then, mount Azure Blob Storage to Azure Databricks :

dbutils.fs.mount(
    source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
    mount_point = "/mnt/<mount-point-name>",
    extra_configs = {"fs.azure.account.key.<storage-account-name>": "<storage-account-key>"}
)

And then read the Excel file using openpyxl :

import openpyxl
# Get the path to the Excel file in Azure Blob Storage
excel_file_path = "/mnt/<mount-point-name>/<excel-file-name>.xlsx"
# Open the Excel file using openpyxl
wb = openpyxl.load_workbook(excel_file_path)
# Get the worksheet you want to read
ws = wb.worksheets[0]
# Iterate over the cells in the worksheet and read the data you need
for row in ws.rows:
    for cell in row.cells:
        # Do something with the cell value
        cell_value = cell.value

And then you will need to create a Pandas df from the data you read and write the transformed df to a new Excel file in Azure Blob Storage :

# Get the path to the new Excel file in Azure Blob Storage
new_excel_file_path = "/mnt/<mount-point-name>/<new-excel-file-name>.xlsx"

df.to_excel(new_excel_file_path, index=False)

ShaikMaheer-MSFT 38,551 Reputation points Microsoft Employee Moderator

2023-09-27T09:23:16.3033333+00:00

Hi Alessandro Di Tocco, Just checking if above answer helps. If yes, please consider hitting Accept Answer button. Accepted answers help community as well. Please let me know if any further queries. Thank you.

Share via

Read and Transform not tabular xlsx/xlsm (Excel) file in Azure Databricks from Azure Blob Storage

1 answer

Your answer