Spark U/I is not loading for recently "terminated" job clustsers

David Beavon 971 Reputation points
2021-04-07T20:49:06.167+00:00

I've been having trouble with a feature in Azure Databricks. The Spark U/I will not be shown for job clusters that have recently completed ("terminated"). There is a link to the Spark U/I in the databricks portal. But when you click the link, you are presented with a status message that just says "Loading" :

Loading old UI for cluster "whatever"... This may take a few minutes.

This screen will remain the same for a long period of time (hours) and I will eventually lose patience and close the window. I haven't yet tried to wait overnight ... and even if I tried, I'm not sure if it would be reasonable to wait that amount of time for the U/I to respond.

When I previously encounter this issue, and opened a support ticket with databricks/azure-databricks, they were not able to confirm any outages during the period in question. So far we have established that there is a "Spark History Server UI" which is a shared resource that can become congested with requests from multiple customers. I'm assuming this implies that the issue is simultaneously affecting multiple customers ... although we haven't yet established that for certain.

I've been using Azure Databricks in production for a few months now, and I'm not familiar enough to know if the issue could be specific to us, or if it might be a chronic issue that affects others in the same region as well. I googled for the problem and was unable to find any results for my search. So I thought it would be good to start a new discussion about the problem here in the Q&A.

Please let me know if anyone has an explanation, or has experienced this themselves. I'm also eager to hear if there are any tricks to get the workspace to start working properly. Whenever I encountered this issue, the problem wouldn't go away on its own until a day or two had passed. I haven't yet gotten any acknowledgement of these outages from Microsoft's perspective.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,904 questions
0 comments No comments
{count} votes

7 answers

Sort by: Most helpful
  1. David Beavon 971 Reputation points
    2021-05-21T19:25:22.583+00:00

    It has been over a month, but the documentation is worth the wait:

    https://kb.databricks.com/clusters/replay-cluster-spark-events.html

    If the Spark U/I is not working in the databricks portal, then you can use this as a plan "B".

    The new documentation will explain the full workaround. This is a solution that databricks support engineers have been recommending for a long time.... Previously you had to open a support case before they would give you this secret workaround.

    Hope it helps.

    1 person found this answer helpful.
    0 comments No comments

  2. David Beavon 971 Reputation points
    2021-04-07T23:17:05.387+00:00

    Over the past week I've been able to learn a bit more about this issue from tech support.

    There is apparently a workaround that is available to customers if/when the "Spark History Server" isn't showing the Spark UI successfully in a databricks workspace.

    The workaround is only possible if you start by configuring a cluster to deliver logs to a dbfs location. Within those logs is an "eventlog" file which what is used to render the Spark UI.

    This workaround allows us to render the eventlog file which is found in the delivered logs. The eventlog can only be rendered into a freshly-started all-purpose cluster. Once the events are rendered, the Spark UI can be reviewed as normal. The folks in Databricks engineering said they would make this workaround available in a KB article once it has been tested by a sufficient number of customers. This approach is called "replaying" the events. The signature of their method looks like so:

    def replaySparkEvents(pathToEventLogs: String): Unit = { ... }

    If/when you are unable to use the Spark UI in the Azure Databricks workspace, you should contact tech support. They are likely to provide you with this workaround, especially since the problems with this Spark UI seem to be persistent and recurring and unpredictable.


  3. David Beavon 971 Reputation points
    2021-04-08T12:44:51.797+00:00

    @PRADEEPCHEEKATLA-MSFT
    Its not a resolution, it is a workaround. I'm still working with tech support to understand the circumstances when the "normal" Spark UI functionality stops working. As I mentioned, it seems to be a recurring and unpredictable problem.

    Given the existence of the workaround (replaySparkEvents), and given that it was delivered to me within a day of reporting the problem, it is pretty clear that this is not a new topic for the folks at Azure Databricks.

    However, I think they need to refocus their efforts on the root cause that is preventing the "Spark UI" from working reliably in the first place. The workaround is quite a bit more effort, and involves more configuration than simply clicking the link in the Azure Databricks workspace. If that worked consistently it would save everyone a lot of time, and avoid future tech support. I am still waiting on the public KB for "replaySparkEvents" and will post it here when I have a link.


  4. David Beavon 971 Reputation points
    2021-04-28T16:55:16.127+00:00

    There should be documentation shortly for the workaround (replaySparkEvents) .

    Also the underlying bug in the Spark history server should be fixed in the next year or so. You can inquire about the details by contacting azure-databricks and providing the improvement ID for the upcoming History Server enhancements (DB-I-3506).

    0 comments No comments

  5. David Beavon 971 Reputation points
    2022-11-07T22:52:13.623+00:00

    My CSS support engineer, Hira, says that Azure Databricks has finally fixed this issue with the Spark U/I:

    "I have confirmed that yesterday (10/26/22), a fix has been deployed. "

    In theory that means the workaround isn't necessary any longer (replaySparkEvents)

    0 comments No comments