How to monitor REST API "run submit" job on Azure Databricks?

philmarius-new 126

We're moving away from notebooks and putting our pyspark workflows into .py files that are being uploaded to DBFS via CI/CD pipelines and then being run by the run submit API endpoint. However, we're struggling to monitor these jobs for failures.

We've setup the spark-monitoring scripts on our Azure Databricks instance and it's successfully feeding logs back to Azure Monitor. However, when running deliberately failing jobs, we cannot tell via the logs whether the job has failed or not, making monitoring the ETL workflows we have problematic.

Has anyone done something similar to this before and how did they do it?

KranthiPakala-MSFT 46,422 Reputation points Microsoft Employee

2021-03-30T17:16:47.96+00:00

Hi @philmarius-new ,

Thanks for reaching out. We are reaching out to internal team to get assistance on this query and will keep you posted as soon as I have a update.

Thank you for your patience.
philmarius-new 126 Reputation points

2021-03-30T17:22:26.363+00:00

Hi Kranthi

We're working on an alternative solution where we use Durable Azure Functions to continuously poll the API to retrieve the state of the run
KranthiPakala-MSFT 46,422 Reputation points Microsoft Employee

2021-03-30T17:55:48.77+00:00

Hi @philmarius-new ,

Thanks for the update. Do let us know know how it goes so that it can be beneficial to other community members reading this thread.

Also curious to know the reason for switching from python notebook to .py ? Appreciate if you could share specific reason for this change :)

Thank you
philmarius-new 126 Reputation points

2021-03-31T09:28:24.867+00:00
There's a number of reasons:

Notebooks encourage bad coding practices

Development / stagging / production environment separation is non-existent

Can't test code easily

Little to no version control (and the version control that currently exists is bad)

%run makes it really hard to follow the order of execution of notebooks

They're painfully slow to develop on when they get big

No intellicode / linting / code formatting

Don't play nice with external systems (can ONLY communicate via API) and it's highly encouraged to "lock in" all aspects of a pipeline into Databricks

Changing to having simple PySpark .py files and firing them at Databricks allows us to test code, integrate it with CI/CD, and add version control to name a few
philmarius-new 126 Reputation points

2021-03-31T09:29:05.65+00:00
https://datapastry.com/blog/why-i-dont-use-jupyter-notebooks-and-you-shouldnt-either/#:~:text=Notebooks%20are%20the%20junk%20food,ll%20end%20up%20morbidly%20obese.&text=Jupyter%20notebooks%20are%20pretty%20much,when%20approaching%20a%20new%20problem.

https://towardsdatascience.com/5-reasons-why-jupyter-notebooks-suck-4dc201e27086

https://medium.com/skyline-ai/jupyter-notebook-is-the-cancer-of-ml-engineering-70b98685ee71
KranthiPakala-MSFT 46,422 Reputation points Microsoft Employee

2021-03-31T17:18:42.25+00:00

Hi @philmarius-new , thanks for your response and details. It is very helpful.
Please keep us posted about how it goes with alternative solution using Azure Durable Functions to retrieve the state of the job run.

When I checked with the internal team about the initial query, below is the suggestion:

For ETL monitoring, it is suggested either :

Use https://docs.databricks.com/dev-tools/api/latest/jobs.html#run-now to get track of job_ids more easily.

Use ADF as a scheduler. ADF offers good capabilities for job monitoring
philmarius-new 126 Reputation points

2021-04-01T11:08:50.19+00:00
We've tried ADF and it's just as bad as notebooks:

Slow development time

No version control outside of on-prem Github enterprise and Azure DevOps

EXTREMELY expensive (our costs have skyrocketed since implementing it)

The "preview" window only ever shows a sample of the data which makes joining datasets impossible to monitor without heavy filtering at the beginning

Doesn't partition delta lakes at all even when enabling partitioning on data flows, this makes using the deltas outside of ADF inconvenient

Hard to work out why pipelines fail, just presented with an obscure error and, as we don't know what's running under the hood, we can't work out why pipelines have failed

Event driven pipelines are easy to setup, but hard to configure, ours kept failing when we tried running concurrent pipelines

Cannot edit triggers easily

Staying away from notebooks and ADF gives us more control
KranthiPakala-MSFT 46,422 Reputation points Microsoft Employee

2021-04-02T01:17:54.29+00:00

Thanks for sharing your observations @philmarius-new .

Does the alternative solution using Durable Azure Functions to continuously retrieve the state of the run worked as expected? Please do let us know how it goes, as it can be beneficial to others reading this.

Thank you
philmarius-new 126 Reputation points

2021-04-02T09:44:10.8+00:00

Hi Kranthi

I'm still working on it! I am going to be doing a writeup on the company blog about our solution and will paste that here :)
KranthiPakala-MSFT 46,422 Reputation points Microsoft Employee

2021-04-05T23:17:49.7+00:00

Thanks much for the update @philmarius-new . Please keep this thread posted once your blog is published as it would be beneficial to other community members :)

Have a good day.