How to ship Azure Databricks artifacts from Dev->QA->Prod through Azure Devops Pipelines?

Cataster 641 Reputation points
2024-04-29T21:17:03.8233333+00:00

We have a Azure Databricks workspace and Dev/QA/Prod environments. Everytime the Data engineers have to ship the artifacts from nonprod -> prod (e.g. python notebooks, config modules, etc) they have to copy the artifacts manually over to the next environment and change the paths in the config files to work correctly in the new environment.

Its a hassle and they see an opportunity to reduce toil here by leveraging Azure Devops pipelines.

How do we create a pipeline around this to migrate these artifacts? what is the best practice? We've considered for example Azure Keyvault for storing variable specific environments and generalizing the config paths so that they are replaced dynamically during pipeline runtime. But is that really the best way? Is there a better way, maybe new method now to do this even easier? Things like that is what Im trying to understand here to do it best way possible in 2024.

Heres a sample databricks workspace structure of the current setup as well as an example config file with hardcoded paths that always have to change for new environment config.

Databricks Workspace structure:

Workspace

->Shared

-->Demo

-->Metrics Engine

-->modules

--->_resources

--->test

--->example.ipynb

--->mod_config.ipynb

--->mod-schema.ipynb

mod_config.ipynb has some hardcoded paths like this (they start with 'abfss'):

config = {

ConfigurationKeys.ROOT_PATH + Constants.FileFormats.CSV: ConfigEntry(None, 'abfss://companyxyzanalyticsdev@companyxyzdatageneraldev.dfs.core.windows.net/source/companyxyz/extracts/sampledb'),

ConfigurationKeys.ROOT_PATH + Constants.FileFormats.PARQUET: ConfigEntry(None, 'abfss://companyxyzanalyticsdev@companyxyzdatageneraldev.dfs.core.windows.net/sink/tables/parquet/'),

ConfigurationKeys.OUTPUT_PATH: ConfigEntry(None, 'abfss://companyxyzinternal@companyxyzdatageneraldev.dfs.core.windows.net/data-projects/internal/data-regression-analysis/resultset'),

ConfigurationKeys.RELATIVE_PATH_DATALAKE_TABLES_TRANSACTIONS: ConfigEntry(Constants.FileFormats.CSV, DataLake.RelativePaths.SourceTables.Transactions),

ConfigurationKeys.RELATIVE_PATH_DATALAKE_TABLES_REGIONS: ConfigEntry(Constants.FileFormats.CSV, DataLake.RelativePaths.SourceTables.Regions),

.............

So ideally at pipeline runtime, the paths would be changed to QA ones and Prod ultimately, cause right now as you can see theyre dev specific and normally they would have to be manually updated after copying them manually to other environments, which as mentioned is a hassle! so maybe if theres a transformation that would be possible to be done in the notebooks, or if the paths have to become generalized and the specific have to be stored in keyvault, etc. whatever the best approach would be.

FYI, the databricks repo is hosted in bitbucket (though could be changed if needed and makes things easier)

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,955 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 78,986 Reputation points Microsoft Employee
    2024-04-30T09:48:10.91+00:00

    @Cataster - Thanks for the question and using MS Q&A platform.

    To migrate artifacts from Dev to QA to Prod environments in Azure Databricks, you can leverage Azure DevOps pipelines. Here are some steps you can follow:

    • Create a new Azure DevOps pipeline and configure it to connect to your Bitbucket repository.
    • In the pipeline, add a task to copy the required artifacts from the Dev environment to the QA environment. You can use the Databricks CLI or REST API to copy the artifacts.
    • Add another task to copy the artifacts from the QA environment to the Prod environment.
    • To handle the hardcoded paths in the config files, you can use Azure Key Vault to store the environment-specific variables. You can then use the Databricks CLI or REST API to replace the hardcoded paths with the environment-specific variables during pipeline runtime.
    • You can also consider using Databricks notebooks to perform the transformation of the config files. For example, you can create a notebook that takes the Dev config file as input and outputs the QA or Prod config file with the environment-specific variables.
    • Finally, you can use the Databricks CLI or REST API to import the updated config files into the Databricks workspace.

    Overall, the best approach will depend on your specific requirements and constraints. However, using Azure DevOps pipelines and Azure Key Vault to automate the migration of artifacts and handle environment-specific variables is a common and effective approach.

    FYI, the databricks repo is hosted in bitbucket (though could be changed if needed and makes things easier)

    In that case, you can still use Azure DevOps pipelines to automate the migration of artifacts from Dev to QA to Prod environments in Azure Databricks. You can configure the pipeline to connect to your Bitbucket repository and use the Databricks CLI or REST API to copy the artifacts and perform the necessary transformations. You can also use Azure Key Vault to store the environment-specific variables and replace the hardcoded paths in the config files during pipeline runtime.

    Hope this helps. Do let us know if you any further queries.