How to run Azure Databricks notebooks in parallel using pyspark, and we have to print the failed notebooks in the parallel execution

2024-02-08T13:56:54.55+00:00

Need to run Data bricks notebooks in Parallel using pyspark, but if failed notebook in the execution, we have to print the failed notebooks

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

Accepted answer
  1. Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator
    2024-02-08T18:47:13.3466667+00:00

    Hello SaiSekhar, MahasivaRavi (Philadelphia),

    From the below documents, you can use the dbutils.notebook.run() function to run multiple notebooks in parallel.

    https://www.codesexplorer.com/2020/03/run-databricks-notebooks-in-parallel-python.html

    https://learn.microsoft.com/en-us/azure/databricks/notebooks/notebook-workflows

    I have copied the code from the above link and modified to print error message when a notebook fails to run.

    Please try and let me know.

    from concurrent.futures import ThreadPoolExecutor
    
    class NotebookData:
      def __init__(self, path, timeout, parameters=None, retry=0):
        self.path = path
        self.timeout = timeout
        self.parameters = parameters
        self.retry = retry
    
      def submitNotebook(notebook):
        print("Running notebook %s" % notebook.path)
        try:
          if (notebook.parameters):
            return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
          else:
            return dbutils.notebook.run(notebook.path, notebook.timeout)
        except Exception as e:
           print(f"Notebook {notebook.path} failed with error: {str(e)}")
           if notebook.retry < 1:
            raise
           print("Retrying notebook %s" % notebook.path)
           notebook.retry = notebook.retry - 1
           submitNotebook(notebook)
    
    def parallelNotebooks(notebooks, numInParallel):
    
       with ThreadPoolExecutor(max_workers=numInParallel) as ec:
        return [ec.submit(NotebookData.submitNotebook, notebook) for notebook in notebooks]
    
    #Array of instances of NotebookData Class
    notebooks = [
    NotebookData("../path/to/Notebook1", 1200),
    NotebookData("../path/to/Notebook2", 1200, {"Name": "Abhay"}),
    NotebookData("../path/to/Notebook3", 1200, retry=2)
    ]   
          
    res = parallelNotebooks(notebooks, 2)
    result = [i.result(timeout=3600) for i in res] # This is a blocking call.
    print(result)
    
    

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.