Failure provisioning databricks workspace

Taylor, Russell 26 Reputation points
2024-08-08T19:14:38.5933333+00:00

Hello, we are encountering an issue provisioning a new Azure Databricks workspace, we are using terraform to deploy the workspace and the the workspace configuration is to be set with private networking.

This is the terraform plan for the workspace being created, please note I have obscured some values for security reasons:

+ resource "azurerm_databricks_workspace" "this" {

+ customer_managed_key_enabled = false

+ id = (known after apply)

+ infrastructure_encryption_enabled = false

+ location = "eastus2"

+ managed_resource_group_id = (known after apply)

+ managed_resource_group_name = "AZUT-XXXXXXX-DATABRICKS-WS-RG"

+ name = "AZUT-XXXXXXX-DATABRICKS-WS"

+ network_security_group_rules_required = "NoAzureDatabricksRules"

+ public_network_access_enabled = false

+ resource_group_name = "AZUT-EDH-CORE-RG"

+ sku = "premium"

+ storage_account_identity = (known after apply)

+ tags = {

+ "applicationName" = "EDH"

+ "costCenter" = "XXXXXXX"

+ "department" = "EDH-ENG"

+ "environment" = "QA"

+ "lineOfBusiness" = "EDH-CORE"

+ "systemOwner" = "XXXXXXX"

+ "wbsCode" = "X-XXXXXX.01"

}

+ workspace_id = (known after apply)

+ workspace_url = (known after apply)

+ custom_parameters {

+ nat_gateway_name = (known after apply)

+ no_public_ip = true

+ private_subnet_name = "AZUT-XXXXXX-SUB-DATABRICKS-PRIV"

+ private_subnet_network_security_group_association_id = "/subscriptions/ed736821-7d46-XXXXXXXXX-c8659a4b95fd/resourceGroups/AZUT-XXXXXX-RG/providers/Microsoft.Network/networkSecurityGroups/AZUT-XXXXXX-NSG-DATABRICKS-PE"

+ public_ip_name = (known after apply)

+ public_subnet_name = "AZUT-XXXXXX-SUB-DATABRICKS-PUBL"

+ public_subnet_network_security_group_association_id = "/subscriptions/ed736821-7d46-XXXXXXX-c8659a4b95fd/resourceGroups/AZUT-XXXXXX-RG/providers/Microsoft.Network/networkSecurityGroups/AZUT-XXXXXX-NSG-DATABRICKS-PE"

+ storage_account_name = (known after apply)

+ storage_account_sku_name = (known after apply)

+ virtual_network_id = "/subscriptions/ed736821-7d46-XXXXXXXX-c8659a4b95fd/resourceGroups/AZUT-XXXXXX-RG/providers/Microsoft.Network/virtualNetworks/AZUT-XXXXXX-VNT"

+ vnet_address_prefix = (known after apply)

}

}

Our terraform configuration is valid and the workspace attempts to deploy, however even though the terraform apply succeeds when reviewing the workspace post deployment in the azure portal the following error is present:

The workspace 'AZUT-XXXXXXX-DATABRICKS-WS' is in a failed state and cannot be launched. Please review error details in the activity log tab and retry your operation. 

After reviewing the activity log we have found the operation "Create Databricks Workspace" in a failed state with the following error:

Failed to prepare subnet 'AZUT-XXXXXXXX-SUB-DATABRICKS-PRIV'. Please try again later. Error details: 'Gateway authentication failed for 'Microsoft.Network'. Diagnostic information: timestamp '20240808T184051Z', tracking id '2164dd04-897e-4951-a464-0b8b5c1bfe03', request correlation id '2164dd04-897e-4951-a464-0b8b5c1bfe03'

We have reviewed our VNET, NSG's, NSG associations, subnet delegations and route table association configurations and all appear to be correct and inline with our other Databricks environments which have encountered no issues when deploying and the only significant difference is we are creating this new workspace in an azure subscription we have not deployed to before.

I have ensured our service principal used as the identity to apply the terraform configuration has Contributor and User Access Administrator RBAC role membership on the resource group the workspace is being deployed to and these permissions match the other subs and rg's we have successfully provisioned workspaces within.

I have also ensured the Microsoft.Network resource provider is enabled in the subscription.

I am hoping someone in the community works for azure and has the ability to review internal logs which are not available to the customer that may contain a more meaningful error message that will help identify the root cause of the problem.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,534 questions
{count} votes

Accepted answer
  1. Konstantinos Passadis 19,591 Reputation points MVP
    2024-08-14T17:44:25.9266667+00:00

    Hello @Taylor Russell

    Great ! That was deep enough but glad you got it!

    I will repeat the Problem and the Solution , you may Accept it if you are OK so anyone having similar issue will find this solution as resolved

    PROBLEM :

    Encountering an issue provisioning a new Azure Databricks workspace, we are using terraform to deploy the workspace and the the workspace configuration is to be set with private networking

    .terraform configuration is valid and the workspace attempts to deploy, however even though the terraform apply succeeds when reviewing the workspace post deployment in the azure portal the following error is present:

    The workspace 'AZUT-XXXXXXX-DATABRICKS-WS' is in a failed state and cannot be launched. Please review error details in the activity log tab and retry your operation. 

    After reviewing the activity log we have found the operation "Create Databricks Workspace" in a failed state with the following error:

    Failed to prepare subnet 'AZUT-XXXXXXXX-SUB-DATABRICKS-PRIV'. Please try again later. Error details: 'Gateway authentication failed for 'Microsoft.Network'. Diagnostic information: timestamp '20240808T184051Z', tracking id '2164dd04-897e-4951-a464-0b8b5c1bfe03', request correlation id '2164dd04-897e-4951-a464-0b8b5c1bfe03'

    SOLUTION

    there was an azure policy applied at the subscription scope which checked that we have 5 specific tags added to every resource and resource group within the subscription. This policy was evaluating and returning a failure stating that one of the mandatory tags did not contain a valid value.

    Unfortunately the only way we could ascertain this information was to delete the problem Databricks workspace, delete the preconfigured NSG and subnets created within our VNET to be used by Databricks and then attempt to create the workspace manually via the Azure portal. The reason for deletion is when selecting the option to create the workspace in your own VNET the azure portal deployment requires you to define new subnets \ NSG rather than allowing you to reference any that are pre-existing.

    It was at this stage the validation of the azure policy failed when during the pre-deployment validation, this identified the policy that was causing the issue. Once this policy issue was resolved we then recreated the databricks subnets within our VNET, created the NSG with the rules required for databricks and lastly recreated the NSG association with the subnets, all was performed using the same terraform configurations as we used initially.

    With the VNET fully configured back to what it was prior to the above being deleted we then attempted again to apply our terraform configuration to recreate the workspace all of which succeeded.

    @Taylor Russell , you can Accept this Answer now and if you have something to add please do !

    Regards

    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. Taylor, Russell 26 Reputation points
    2024-08-14T15:00:29.82+00:00

    Posting this here as a solution for anyone else that may come across this issue.

    The error posted in the activity log appears to be a generic error which could be for any number of reasons, and not necesarily to do with RBAC permissions or VNET configuration.

    Due to concerns with live applications being impacted we did not try the suggestion posted about de-registering the Microsoft.Network provider.

    In our particular case there was an azure policy applied at the subscription scope which checked that we have 5 specific tags added to every resource and resource group within the subscription. This policy was evaluating and returning a failure stating that one of the mandatory tags did not contain a valid value.

    Unfortunately the only way we could ascertain this information was to delete the problem Databricks workspace, delete the preconfigured NSG and subnets created within our VNET to be used by Databricks and then attempt to create the workspace manually via the Azure portal. The reason for deletion is when selecting the option to create the workspace in your own VNET the azure portal deployment requires you to define new subnets \ NSG rather than allowing you to reference any that are pre-existing.

    It was at this stage the validation of the azure policy failed when during the pre-deployment validation, this identified the policy that was causing the issue. Once this policy issue was resolved we then recreated the databricks subnets within our VNET, created the NSG with the rules required for databricks and lastly recreated the NSG association with the subnets, all was performed using the same terraform configurations as we used initially.

    With the VNET fully configured back to what it was prior to the above being deleted we then attempted again to apply our terraform configuration to recreate the workspace all of which succeeded.

    TDLR - The error handling in the azure platform for when a Databricks workspace fails to provision successfully is not robust enough, the generic error that was in the activity log provided no hints at all that the issue lay in an azure policy which was preventing the successful provisioning.

    1 person found this answer helpful.
    0 comments No comments

  2. Konstantinos Passadis 19,591 Reputation points MVP
    2024-08-08T22:04:51.82+00:00

    Hello @Taylor Russell

    The error message "GatewayAuthenticationFailed" indicates that there was a problem with authenticating your request to create a private endpoint in Azure.

    First try unregistering and Re register the Microsoft.Network provider

    I would change the Terraform role to Owner/Contributor and then start to narrowing down the Privilege

    Is Terraform a Contributor ?

    Start with these and let us know please!

    --

    I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.