Spark job freezes without any progress for a long period of time
Pavlo Vitynskyi
46
Reputation points
I have multiple Spark jobs deployed on Azure Databricks. Usually, each of them takes less than 1 hour to process data and scheduled to run every hour.
I faced the following issue:
Sometimes a job had been running for an extremaly long period of time (a few hours or even days) without any progress until I cancel it.
The version of Databricks Runtime is '7.5 (includes Apache Spark 3.0.1, Scala 2.12)'
During the inactivity period the only messages in the driver logs are:
21/03/26 22:34:38 INFO HiveMetaStore: 1: get_database: default
21/03/26 22:34:38 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
21/03/26 22:34:38 INFO DriverCorral: Metastore health check ok
21/03/26 22:39:29 INFO DriverCorral: DBFS health check ok
21/03/26 22:39:30 WARN MetastoreMonitor: Failed to connect to the metastore InternalMysqlMetastore(DbMetastoreConfig{host=consolidated-centralus-prod-metastore-addl.mysql.database.azure.com, port=3306, dbName=organization123456789, user=[REDACTED]}). (timeSinceLastSuccess=16500028)
java.lang.IllegalArgumentException: A health check named database already exists
at com.codahale.metrics.health.HealthCheckRegistry.register(HealthCheckRegistry.java:101)
at com.databricks.instrumentation.Instrumented$Dsl.instrumentJdbi(Instrumented.scala:242)
at com.databricks.common.database.DatabaseUtils$.createDBI(DatabaseUtils.scala:170)
at com.databricks.common.database.DatabaseUtils$.withDBI(DatabaseUtils.scala:497)
at com.databricks.backend.daemon.driver.MetastoreMonitor.checkMetastore(MetastoreMonitor.scala:177)
at com.databricks.backend.daemon.driver.MetastoreMonitor.$anonfun$doMonitor$1(MetastoreMonitor.scala:154)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$4(UsageLogging.scala:432)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
at com.databricks.threading.NamedTimer$$anon$1.withAttributionContext(NamedTimer.scala:94)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:277)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:270)
at com.databricks.threading.NamedTimer$$anon$1.withAttributionTags(NamedTimer.scala:94)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:413)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:339)
at com.databricks.threading.NamedTimer$$anon$1.recordOperation(NamedTimer.scala:94)
at com.databricks.threading.NamedTimer$$anon$1.$anonfun$run$2(NamedTimer.scala:103)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
at com.databricks.threading.NamedTimer$$anon$1.withAttributionContext(NamedTimer.scala:94)
at com.databricks.logging.UsageLogging.disableTracing(UsageLogging.scala:833)
at com.databricks.logging.UsageLogging.disableTracing$(UsageLogging.scala:832)
at com.databricks.threading.NamedTimer$$anon$1.disableTracing(NamedTimer.scala:94)
at com.databricks.threading.NamedTimer$$anon$1.$anonfun$run$1(NamedTimer.scala:102)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.util.UntrustedUtils$.tryLog(UntrustedUtils.scala:100)
at com.databricks.threading.NamedTimer$$anon$1.run(NamedTimer.scala:101)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
21/03/26 22:39:38 INFO HiveMetaStore: 1: get_database: default
21/03/26 22:39:38 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
21/03/26 22:39:38 INFO DriverCorral: Metastore health check ok
21/03/26 22:44:29 INFO DriverCorral: DBFS health check ok
It would be good to know the exact reason for this problem.
Thanks
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,538 questions
Sign in to answer