Understanding the DevOps for data capability

Note

Data playbook capabilities: The data playbook defines a set of capabilities that represent conceptual building blocks that are used to build data-related solutions. See Defining data capabilities to see the full set of capabilities defined in the playbook.

DevOps can be defined as the union of people, process, and products to enable continuous delivery of value to the business. It's an iterative process of "Developing", "Building & Testing", "Deploying", "Operating", "Monitoring and Learning" and "Planning and Tracking".

The application of DevOps principles to Data can be understood through the concepts of Data and CI/CD pipelines:

Understanding data pipelines vs CI/CD pipelines

  • Data pipeline: Also termed as "value" pipelines, the data pipelines convert the raw data into meaningful information, thus delivering value to the business. Data engineers generally own the data ingestion, transformation, and sharing processes that are part of data pipelines. Data engineers are responsible for coding the business requirements into data pipelines.

  • CI/CD pipeline: CI/CD pipelines are one of the most critical parts of DevOps in general. In the context of data systems, CI/CD pipelines continuously update data pipelines in different environments as new ideas are developed, tested, and deployed to Production. The CI/CD pipeline is often termed as an innovation pipeline as it enables the change process. The Platform automation and operations team typically own the maintenance of CI/CD pipelines.

Here we have two different orchestrators to maintain: One for the data pipeline and the other one for the CI/CD pipeline. The testing strategy for these two pipelines is going to differ as well. For the data pipeline, the data is changing while the CI/CD pipeline is fixed. Conversely for the CI/CD pipeline, the data is fixed while the CI/CD pipeline itself is changing.

How to use Infrastructure as Code (IaC) for data services

Automated provisioning:

  • Automate the provisioning of data infrastructure and platforms using IaC.
  • Enable the dynamic scaling of resources based on data processing demands.

Environment consistency:

  • Ensure consistency across development, testing, and production environments using IaC.
  • Automate the creation and configuration of databases, data warehouses, and analytics platforms.

How to use CI/CD for data

Continuous Integration (CI) is a DevOps practice of merging the code changes from different contributors to a centralized repository, where the automated builds and tests are then run. CI helps in identifying bugs or issues early in the development lifecycle when they're easier and faster to fix.

Continuous delivery (CD) builds upon continuous integration (CI). CD is the process of taking the build artifacts and deploying them to different environments, such as QA and Staging. CD helps in testing the new changes for stability, performance, and security.

Continuous Delivery: Illustrates Continuous Delivery after Continuous Integration

Automated data pipelines:

  • Implement automated data pipelines for seamless integration, transformation, and loading (ETL) processes.
  • Use CI/CD practices to version-control and deploy changes to data pipelines.

Version control for data artifacts:

  • Apply version control to build data-related artifacts, including models, transformations, and analytics code.
  • Ensure traceability and repeatability of changes to data artifacts.

Automated testing:

  • Implement automated testing for data processes to ensure data quality and reliability.
  • Incorporate testing into the CI/CD pipeline for efficient and reliable releases.
  • Include testing for both in the value pipelines (test for data) and innovation/delivery pipelines (test for code).

Implementing workflow orchestration

Workflow automation:

  • Orchestrate end-to-end data workflows, integrating various tools and processes seamlessly. Check Data: Data Orchestration for details.
  • Implement workflow automation for scheduling and coordinating data processing tasks.

Using monitoring and feedback

Real-time monitoring:

  • Implement tools for continuous monitoring of data pipelines, databases, and analytics platforms.
  • Monitor data quality, data integrity, and data security in real-time.

Alerting and feedback mechanisms:

  • Set up automated alerts to notify teams of anomalies, errors, or performance degradation.
  • Create feedback loops to aid communication between development, operations, and other relevant teams.

How to use configuration management

Parameterization:

  • Use parameterized configurations to make IaC adaptable to different scenarios, environments, and data service requirements.
  • For data pipelines, consider metadata-driven approaches to make the pipelines more flexible and adaptable.

Handing sensitive data:

  • Keep secrets in separate configuration files that aren't checked in to the repo.
  • Add such files to .gitignore to prevent them from being checked in.
  • Where possible, use Azure Key Vault to store and manage secrets.

Implementing security and compliance

Automated security measures:

  • Implement automated security measures, including access controls, encryption, and identity management.
  • Integrate security checks into the CI/CD pipeline for proactive security measures.

Compliance checks:

  • Automate compliance checks to ensure that data processing adheres to regulatory requirements.
  • Implement automated audits and reporting for compliance purposes.

Learn more about DevOps for data in Microsoft Fabric

Implementations

Parking sensors sample

The MDW Repo: Parking Sensors sample covers the end-to-end implementation of various characteristics of 'DevOps for Data'. It demonstrates how DevOps principles can be applied to end-to-end data pipeline solutions built according to the Modern Data Warehouse pattern. The sample covers the following articles:

  • Bicep-based IaC deployment of Azure data services - Link.
  • GitHub integration with Azure Data Factory (ADF) - Link.
  • Build and Release (CI/CD) pipelines - Link.
  • Environment variables and parameterization using Azure DevOps variable groups - Link.
  • Automated testing of data pipelines including unit and integration tests - Link.
  • Manual approval gates for release pipelines - Link.

ADF CI/CD auto publish

The MDW Repo: ADF CI/CD auto-publish sample demonstrates the deployment of Azure Data Factory (ADF) using Automated Publish method. Usually, ADF deployment requires Manual Publish setup where the developer publishes the ADF changes manually from the portal. This step generates ARM Templates that are used in deployment steps. The sample eliminates the manual publish by using a publicly available npm package @microsoft/azure-data-factory-utilities for automated publishing. Also, check the official documentation on CI/CD in ADF and Automated publishing for CI/CD.

IaC deployment samples for secured networks

The following table contains IaC samples for deploying various Azure data services within a secure network configuration. These code samples are authored in Bicep. But they can be easily customized for use with Terraform.

Azure Data Service IaC Code Sample
Azure Synapse Analytics Azure Synapse VNet recipe
Azure Databricks Azure Databricks VNet recipe
Microsoft Purview Microsoft Purview VNet recipe
Azure Data Factory Azure Data Factory VNet recipe

Examples

Review the following coverage of options for how to prep your sandbox environments.

Using sandbox environment options

For data solutions, sandbox environments generally need extra preparation steps depending on the Azure Services.

Data Service Sandbox Environment Options
Data Lake Gen2 Storage A common sandbox file system can be created, and each developer can then create their own folder within this filesystem.
Azure SQL or SQL Data Warehouse A transient database (restored from DEV) can be spun up per developer on demand.
Azure Synapse Analytics Git integration allows developers to make changes to their own branches and debug runs independently.

How to use version control

Azure data services vary in their approaches to version control. The following table outlines available options for several commonly used Azure data services.

Azure Data Service Documentation Link
Azure Data Factory DevOps in ADF
Azure Synapse Analytics DevOps in Synapse Analytics
Azure Databricks DevOps in Azure Databricks
Azure SQL or SQL Data Warehouse DevOps in Azure SQL

Learn how to publish data artifacts

Azure Artifacts can be used alongside Azure Pipelines for deploying packages, publishing build artifacts, or integrating files across pipeline stages. The following table contains links to various options for publishing data artifacts.

Artifact name Applicable To Example
SQL DACPAC Azure SQL Database Publishing SQL DACPAC
Python Wheel Python Creating and publishing wheel distribution package
Databricks Notebook Azure Databricks Publishing Databricks notebook

Perform unit testing

  • Apache Spark: The MDW Repo: Azure Databricks and MDW Repo: Azure Synapse samples showcase how to execute unit tests for data transformation code written in Apache Spark. It encapsulates the business logic into a python wheel package, keeping data access code in the notebook that loads the package.
  • Azure Stream Analytics (ASA): The MDW Repo: Azure Stream Analytics sample showcases how to execute unit tests for ASA.
  • Data Factory testing Framework: This stand-alone Data Factory - testing framework allows writing unit tests for Data Factory pipelines on Microsoft Fabric and Azure Data Factory.

Perform integration testing

For more information