Spark job freezes without any progress for a long period of time

Pavlo Vitynskyi 46 Reputation points
2021-03-29T10:30:26.047+00:00

I have multiple Spark jobs deployed on Azure Databricks. Usually, each of them takes less than 1 hour to process data and scheduled to run every hour.
I faced the following issue:
Sometimes a job had been running for an extremaly long period of time (a few hours or even days) without any progress until I cancel it.
The version of Databricks Runtime is '7.5 (includes Apache Spark 3.0.1, Scala 2.12)'
During the inactivity period the only messages in the driver logs are:

21/03/26 22:34:38 INFO HiveMetaStore: 1: get_database: default
21/03/26 22:34:38 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
21/03/26 22:34:38 INFO DriverCorral: Metastore health check ok
21/03/26 22:39:29 INFO DriverCorral: DBFS health check ok
21/03/26 22:39:30 WARN MetastoreMonitor: Failed to connect to the metastore InternalMysqlMetastore(DbMetastoreConfig{host=consolidated-centralus-prod-metastore-addl.mysql.database.azure.com, port=3306, dbName=organization123456789, user=[REDACTED]}). (timeSinceLastSuccess=16500028)
java.lang.IllegalArgumentException: A health check named database already exists
 at com.codahale.metrics.health.HealthCheckRegistry.register(HealthCheckRegistry.java:101)
 at com.databricks.instrumentation.Instrumented$Dsl.instrumentJdbi(Instrumented.scala:242)
 at com.databricks.common.database.DatabaseUtils$.createDBI(DatabaseUtils.scala:170)
 at com.databricks.common.database.DatabaseUtils$.withDBI(DatabaseUtils.scala:497)
 at com.databricks.backend.daemon.driver.MetastoreMonitor.checkMetastore(MetastoreMonitor.scala:177)
 at com.databricks.backend.daemon.driver.MetastoreMonitor.$anonfun$doMonitor$1(MetastoreMonitor.scala:154)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at com.databricks.logging.UsageLogging.$anonfun$recordOperation$4(UsageLogging.scala:432)
 at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
 at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
 at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
 at com.databricks.threading.NamedTimer$$anon$1.withAttributionContext(NamedTimer.scala:94)
 at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:277)
 at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:270)
 at com.databricks.threading.NamedTimer$$anon$1.withAttributionTags(NamedTimer.scala:94)
 at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:413)
 at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:339)
 at com.databricks.threading.NamedTimer$$anon$1.recordOperation(NamedTimer.scala:94)
 at com.databricks.threading.NamedTimer$$anon$1.$anonfun$run$2(NamedTimer.scala:103)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
 at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
 at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
 at com.databricks.threading.NamedTimer$$anon$1.withAttributionContext(NamedTimer.scala:94)
 at com.databricks.logging.UsageLogging.disableTracing(UsageLogging.scala:833)
 at com.databricks.logging.UsageLogging.disableTracing$(UsageLogging.scala:832)
 at com.databricks.threading.NamedTimer$$anon$1.disableTracing(NamedTimer.scala:94)
 at com.databricks.threading.NamedTimer$$anon$1.$anonfun$run$1(NamedTimer.scala:102)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at com.databricks.util.UntrustedUtils$.tryLog(UntrustedUtils.scala:100)
 at com.databricks.threading.NamedTimer$$anon$1.run(NamedTimer.scala:101)
 at java.util.TimerThread.mainLoop(Timer.java:555)
 at java.util.TimerThread.run(Timer.java:505)
21/03/26 22:39:38 INFO HiveMetaStore: 1: get_database: default
21/03/26 22:39:38 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
21/03/26 22:39:38 INFO DriverCorral: Metastore health check ok
21/03/26 22:44:29 INFO DriverCorral: DBFS health check ok

It would be good to know the exact reason for this problem.

Thanks

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,538 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.