Spark job freezes without any progress for a long period of time

Question

Spark job freezes without any progress for a long period of time

Pavlo Vitynskyi 46

I have multiple Spark jobs deployed on Azure Databricks. Usually, each of them takes less than 1 hour to process data and scheduled to run every hour.
I faced the following issue:
Sometimes a job had been running for an extremaly long period of time (a few hours or even days) without any progress until I cancel it.
The version of Databricks Runtime is '7.5 (includes Apache Spark 3.0.1, Scala 2.12)'
During the inactivity period the only messages in the driver logs are:

21/03/26 22:34:38 INFO HiveMetaStore: 1: get_database: default
21/03/26 22:34:38 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
21/03/26 22:34:38 INFO DriverCorral: Metastore health check ok
21/03/26 22:39:29 INFO DriverCorral: DBFS health check ok
21/03/26 22:39:30 WARN MetastoreMonitor: Failed to connect to the metastore InternalMysqlMetastore(DbMetastoreConfig{host=consolidated-centralus-prod-metastore-addl.mysql.database.azure.com, port=3306, dbName=organization123456789, user=[REDACTED]}). (timeSinceLastSuccess=16500028)
java.lang.IllegalArgumentException: A health check named database already exists
 at com.codahale.metrics.health.HealthCheckRegistry.register(HealthCheckRegistry.java:101)
 at com.databricks.instrumentation.Instrumented$Dsl.instrumentJdbi(Instrumented.scala:242)
 at com.databricks.common.database.DatabaseUtils$.createDBI(DatabaseUtils.scala:170)
 at com.databricks.common.database.DatabaseUtils$.withDBI(DatabaseUtils.scala:497)
 at com.databricks.backend.daemon.driver.MetastoreMonitor.checkMetastore(MetastoreMonitor.scala:177)
 at com.databricks.backend.daemon.driver.MetastoreMonitor.$anonfun$doMonitor$1(MetastoreMonitor.scala:154)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at com.databricks.logging.UsageLogging.$anonfun$recordOperation$4(UsageLogging.scala:432)
 at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
 at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
 at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
 at com.databricks.threading.NamedTimer$$anon$1.withAttributionContext(NamedTimer.scala:94)
 at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:277)
 at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:270)
 at com.databricks.threading.NamedTimer$$anon$1.withAttributionTags(NamedTimer.scala:94)
 at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:413)
 at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:339)
 at com.databricks.threading.NamedTimer$$anon$1.recordOperation(NamedTimer.scala:94)
 at com.databricks.threading.NamedTimer$$anon$1.$anonfun$run$2(NamedTimer.scala:103)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
 at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
 at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
 at com.databricks.threading.NamedTimer$$anon$1.withAttributionContext(NamedTimer.scala:94)
 at com.databricks.logging.UsageLogging.disableTracing(UsageLogging.scala:833)
 at com.databricks.logging.UsageLogging.disableTracing$(UsageLogging.scala:832)
 at com.databricks.threading.NamedTimer$$anon$1.disableTracing(NamedTimer.scala:94)
 at com.databricks.threading.NamedTimer$$anon$1.$anonfun$run$1(NamedTimer.scala:102)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at com.databricks.util.UntrustedUtils$.tryLog(UntrustedUtils.scala:100)
 at com.databricks.threading.NamedTimer$$anon$1.run(NamedTimer.scala:101)
 at java.util.TimerThread.mainLoop(Timer.java:555)
 at java.util.TimerThread.run(Timer.java:505)
21/03/26 22:39:38 INFO HiveMetaStore: 1: get_database: default
21/03/26 22:39:38 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
21/03/26 22:39:38 INFO DriverCorral: Metastore health check ok
21/03/26 22:44:29 INFO DriverCorral: DBFS health check ok

It would be good to know the exact reason for this problem.

Thanks

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-03-30T00:10:49.243+00:00

HI @Pavlo Vitynskyi ,

Thanks for reaching out. As you called out that the job usually complete in 1 hr and seems to run forever suddenly. For deeper investigation and immediate assistance, I would suggest you to file a support case so that a support engineer can gather the required logs to further troubleshoot the issue.

Let us know if you don't have a support plan.

Thanks
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-03-30T18:24:16.147+00:00

HI @Pavlo Vitynskyi ,

Just checking to see if you have got a chance to file a support ticket? If you don't have a support plan, please do let us know.

Thank you
Pavlo Vitynskyi 46 Reputation points

2021-03-30T19:53:21.42+00:00

Hi @KranthiPakala-MSFT ,
Yes, I created a support ticket and waiting for a response.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-03-30T20:06:22.083+00:00

Thanks for the response @Pavlo Vitynskyi . Could you please share the SR# for internal tracking of the issue?
Pavlo Vitynskyi 46 Reputation points

2021-03-31T07:50:44.627+00:00

Sure, the support request ID is 2103300050001826
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-04-05T23:20:26.34+00:00

Thanks for the details @Pavlo Vitynskyi .
Chintan Rajvir 426 Reputation points Microsoft Employee

2021-04-09T12:55:45.703+00:00

Hi @Pavlo Vitynskyi @KranthiPakala-MSFT

I am facing a similar issue with running a Python Notebook as Spark Job on Databricks Cluster 7.6, Spark 3.0.1 and Scala 2.12. The jobs when ran as Scala JAR files completed in under 10 minutes! Did you find a solution for this, from the support request?
Pavlo Vitynskyi 46 Reputation points

2021-04-10T07:47:44.107+00:00

Hi @Chintan Rajvir
No updates, support team is still looking into it
Phillips, William-D 1 Reputation point

2021-04-20T20:48:15.33+00:00

Having the same issue with same cluster configuration: 7.6 Spark 3.0.1 Scala 2.12
Florent Pousserot 11 Reputation points

2021-04-26T14:15:14.9+00:00

Having the same issue
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-04-26T18:14:47.02+00:00

Hi All,

Sorry for your experience, this issue has been escalated to Core databricks team for further analysis and an active investigation is going on. Will keep this thread posted as soon as we have an update.

Thank you
David Beavon 991 Reputation points

2021-05-22T00:55:51.98+00:00

@KranthiPakala-MSFT
Any word? Is a case like this expected to take a long time?

I'm having an issue that might be related to this. I think somebody at databricks decided to increase the logging level for HiveMetaStore operations, and override my preferences. Even though my init script sets the logging level at ERROR, I have noticed that some logic within the databricks driver(daemon) is fighting with me and is re-enabling a very high level of logging. The first few lines appear to be related to the HiveMetaStore:

21/05/22 00:45:31 INFO HiveMetaStore: 4: get_database: default
21/05/22 00:45:31 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
21/05/22 00:45:31 INFO HiveMetaStore: 4: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore

I suspect that their developers are flailing about, and changing logs, and causing problems for additional customers
Chu, Tong 1 Reputation point

2021-05-27T08:23:49.377+00:00

Meet the same problem with 'spark_version': '7.2.x-scala2.12', any solution or workaround for it? thanks.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-06-11T17:13:20.687+00:00

Hi all,

Apologies for the delayed response. By looking at the existing support ticket, the troubleshooting is still on-going with Databricks team.

If this is a blocker, I would recommend filing a support ticket so that a support engineer can gather required information and collaborate with databricks team for further troubleshooting.

Thank you
Yurii Pryimak 1 Reputation point

2021-12-23T16:23:14.177+00:00

@KranthiPakala-MSFT

Hello, any updates for the mentioned problem?

I have the same situation with the randomly stucking job.

Databricks Runtime is '7.5 (includes Apache Spark 3.0.1, Scala 2.12)'
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-12-27T21:14:24.707+00:00

HI @Yurii Pryimak ,

Thanks for reaching out. This original issue of the original poster was resolved internally by Support Engineer in collaboration with Core Databricks team.
If you have a similar issue, I would recommend you to please file a support ticket as the root cause may vary in each scenario and it need troubleshooting.
Florent Pousserot 11 Reputation points

2022-01-04T10:23:47.987+00:00

Hi,

1°) Which runtime version was fixed?
2°) Does the one hour lifetime limitation for a passtrough token still exist? (https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/job-fails-adls-hour)

Thanx

Your answer

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-03-30T00:10:49.243+00:00

HI @Pavlo Vitynskyi ,

Thanks for reaching out. As you called out that the job usually complete in 1 hr and seems to run forever suddenly. For deeper investigation and immediate assistance, I would suggest you to file a support case so that a support engineer can gather the required logs to further troubleshoot the issue.

Let us know if you don't have a support plan.

Thanks
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-03-30T18:24:16.147+00:00

HI @Pavlo Vitynskyi ,

Just checking to see if you have got a chance to file a support ticket? If you don't have a support plan, please do let us know.

Thank you
Pavlo Vitynskyi 46 Reputation points

2021-03-30T19:53:21.42+00:00

Hi @KranthiPakala-MSFT ,
Yes, I created a support ticket and waiting for a response.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-03-30T20:06:22.083+00:00

Thanks for the response @Pavlo Vitynskyi . Could you please share the SR# for internal tracking of the issue?
Pavlo Vitynskyi 46 Reputation points

2021-03-31T07:50:44.627+00:00

Sure, the support request ID is 2103300050001826
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-04-05T23:20:26.34+00:00

Thanks for the details @Pavlo Vitynskyi .
Chintan Rajvir 426 Reputation points Microsoft Employee

2021-04-09T12:55:45.703+00:00

Hi @Pavlo Vitynskyi @KranthiPakala-MSFT

I am facing a similar issue with running a Python Notebook as Spark Job on Databricks Cluster 7.6, Spark 3.0.1 and Scala 2.12. The jobs when ran as Scala JAR files completed in under 10 minutes! Did you find a solution for this, from the support request?
Pavlo Vitynskyi 46 Reputation points

2021-04-10T07:47:44.107+00:00

Hi @Chintan Rajvir
No updates, support team is still looking into it
Phillips, William-D 1 Reputation point

2021-04-20T20:48:15.33+00:00

Having the same issue with same cluster configuration: 7.6 Spark 3.0.1 Scala 2.12
Florent Pousserot 11 Reputation points

2021-04-26T14:15:14.9+00:00

Having the same issue
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-04-26T18:14:47.02+00:00

Hi All,

Sorry for your experience, this issue has been escalated to Core databricks team for further analysis and an active investigation is going on. Will keep this thread posted as soon as we have an update.

Thank you
David Beavon 991 Reputation points

2021-05-22T00:55:51.98+00:00

@KranthiPakala-MSFT
Any word? Is a case like this expected to take a long time?

I'm having an issue that might be related to this. I think somebody at databricks decided to increase the logging level for HiveMetaStore operations, and override my preferences. Even though my init script sets the logging level at ERROR, I have noticed that some logic within the databricks driver(daemon) is fighting with me and is re-enabling a very high level of logging. The first few lines appear to be related to the HiveMetaStore:

21/05/22 00:45:31 INFO HiveMetaStore: 4: get_database: default
21/05/22 00:45:31 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
21/05/22 00:45:31 INFO HiveMetaStore: 4: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore

I suspect that their developers are flailing about, and changing logs, and causing problems for additional customers
Chu, Tong 1 Reputation point

2021-05-27T08:23:49.377+00:00

Meet the same problem with 'spark_version': '7.2.x-scala2.12', any solution or workaround for it? thanks.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-06-11T17:13:20.687+00:00

Hi all,

Apologies for the delayed response. By looking at the existing support ticket, the troubleshooting is still on-going with Databricks team.

If this is a blocker, I would recommend filing a support ticket so that a support engineer can gather required information and collaborate with databricks team for further troubleshooting.

Thank you
Yurii Pryimak 1 Reputation point

2021-12-23T16:23:14.177+00:00

@KranthiPakala-MSFT

Hello, any updates for the mentioned problem?

I have the same situation with the randomly stucking job.

Databricks Runtime is '7.5 (includes Apache Spark 3.0.1, Scala 2.12)'
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2021-12-27T21:14:24.707+00:00

HI @Yurii Pryimak ,

Thanks for reaching out. This original issue of the original poster was resolved internally by Support Engineer in collaboration with Core Databricks team.
If you have a similar issue, I would recommend you to please file a support ticket as the root cause may vary in each scenario and it need troubleshooting.
Florent Pousserot 11 Reputation points

2022-01-04T10:23:47.987+00:00

Hi,

1°) Which runtime version was fixed?
2°) Does the one hour lifetime limitation for a passtrough token still exist? (https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/job-fails-adls-hour)

Thanx

Share via

Spark job freezes without any progress for a long period of time

Your answer