An Apache Spark-based analytics platform optimized for Azure.
The errors described are internal Databricks SQL warehouse issues rather than problems with the dbt models or query syntax.
The messages
-
Query failed because the execution engine did not respond. -
[INTERNAL_ERROR] Query could not be scheduled: HTTP Response code: 503. Please try again later. SQLSTATE: XX000
indicate that the Databricks backend could not start or schedule an execution environment or engine instance for the query. HTTP 503 and INTERNAL_ERROR at scheduling time are transient service-side conditions (capacity, control-plane, or engine startup issues) and are not caused by query complexity or data size, which matches the observation that both very small and large models fail intermittently.
There is no additional public error-class entry in the provided documentation for these specific INTERNAL_ERROR / scheduling / 503 conditions; they are not mapped to a documented Databricks SQL error class like the ones listed for data sources, stateful streaming, or SQL script limits. Because of that, there is no further detail available from documentation beyond what the warehouse UI already shows.
Given that:
- Treat these as transient platform errors, not model errors.
- Implement retry logic on the client side (dbt) if possible, so that failed models are retried when the error text includes
INTERNAL_ERRORor HTTP 503. - Since the failures started suddenly at specific times and affect multiple regions and arbitrary queries, open a support case with Azure Databricks/Microsoft, providing:
- Workspace ID and region(s)
- SQL warehouse name and type (serverless)
- Approximate timestamps and request IDs / operation handles from the error messages
- Example queries (even simple ones that fail)
Only the service team can investigate the internal cause (capacity, control-plane regression, or regional incident) and apply a platform fix or advise on any required configuration changes.
References: