Modern Data Warehouse backlog guidance

This content covers a set of best practices, strategies, and recommendations for managing and prioritizing the backlog of tasks and projects related to the development and maintenance of a modern data warehouse.

Understanding objectives

Here is guidance on how to support creation of a foundational backlog based on the Parking Sensor sample. The backlog can be extended to different services and other requirements that target other use cases.

The aim is NOT to create a detailed task list, but a high-level breakdown of the main areas to target. These areas may be extended to other areas that are not implemented in the Modern Data Warehouse main sample.

The elements that are included in the Parking Sensor sample are identified with a sample tag while the others are tagged with extension referring to possible extensions.

Understand the project scope

Define the scope of your project from the below options:

Determine the location

  • sample - Cloud
  • extension - Edge computing
  • extension - Cloud and edge computing

Decide on the mode

  • sample - Batch
  • extension - NRT (Near Real Time)
  • extension - Batch and NRT (Near Real Time)

The scope of the sample is Cloud + Batch, however there is an opportunity to extend to other fields applying similar techniques and concepts.

Identify the operational foundation

The areas that concern the sample are listed as shown below, and these areas can be extended as identified by the ext tag.

Define the architectural components

The architecture should include all or a subset of the components described in the following sections.

Define the goal for architectural components

The goal of the Architectural Components area in the sample solution is to mirror the necessary architecture components into the deployment for the scope defined.

Understand the architectural components considerations

Data lake Storage Service aspects to consider:

  • sample/extension: Plan for the storage of the Data Lake by considering the structure your lake will have to adopt.
  • sample: The need for one or more Storage accounts for the Data Lake function. The split into different accounts or containers might depend on the environment, security, cross-charge, or organizational requirements.
  • extension: Define the partition strategy in the Data Lake that most benefits usage of the data in the use case at hand.
  • sample/extension: Define what is the default file format that is going to be used to ingest the data and land in the Data Lake (for example: parquet, delta, json, etc.).
  • sample: Consider the need of a storage account for internal use of some services (Azure Synapse for example).

Databases

Other databases can be considered in the process, either as a data source or as a Data Warehouse database to serve the final reporting layer.

  • sample: Azure Synapse SQL Dedicated Pool, used to feed the reporting layer.
  • extension: Any other database (operational, historical, business, etc.) that needs to be ingested to contextualize or transform the data that lives in the data lake.
  • extension: Use another database that would be used in the serving layer for reporting purposes.

Data Governance can be composed of different tools and integrations:

  • extension: Microsoft Purview service and artifacts like: Glossary initial import, Metadata model assets initial import, Custom entities initial import.
  • extension: Other Data Catalogs might or might not include integration/federation with Microsoft Purview.

Data Transformation

  • sample: Azure Synapse service (Synapse workspace and selected pools), and artifacts (Data sources, Linked Services, Scripts, Pipelines, Notebooks, Triggers, Packages (wheel/jar)).
  • *sample: Azure Databricks service (Workspace, cluster) and artifacts (Notebooks, Scripts, Packages).
  • *sample: Azure Data Factory and Azure Synapse pipelines can also be used as a mechanism to perform data transformation beyond notebook calls.
  • extension: Spark Job definitions could be included in the artifact lists if there is a requirement or need for it.

Data pipeline Orchestrator

  • sample: Azure Data Factory service
  • sample: Azure Synapse pipelines
  • extension: ADB--Jobs
  • extension: ADB--Delta Live Tables

Observability

  • sample: Infrastructure observability
    • Monitoring default metrics with Azure Monitor
      • Synapse: pipeline and pool metrics
  • sample: Data pipeline observability
    • More observability can be implemented using the integration of certain libraries like Great Expectations and App Insights at the notebook level.

Secret Management

  • sample: Azure Key vault, a secret management service, is necessary to store all the sensitive information at deployment time that can be later accessed for the CI/CD process or for the data pipelines functionality.
  • sample: In order to simplify the security layer, Integrated Security should be considered whenever possible.

Serving Data Layer

  • extension: Power BI service (chances are that PBI is already enabled in the organization).
  • sample: Synapse artifacts like scripts or exploration notebooks.
  • sample: DataBricks scripts or exploration notebooks.
  • extension: Third-party tools as required per the Customer.

Security

  • extension: Collect Security Customer requirements. Be sure to cover aspects such as:
    • Access control requirements.
    • Privacy requirements (or other compliance requirements).
    • Network requirements.
  • After collecting all security requirements, consider each architecture component and:
    • sample/extension: Implement access management automation through a combination of control plane, ACLs, and data plane (where appropriate) of each. Note: in sample, just some RBAC roles are assigned in the context of the user that deploys the sample, but ACLs are not part of the implementation.
    • extension: Implement vNet integration automation when network isolation of the components is required.
    • extension: Implement specific data protection mechanisms when required like data encryption or data obfuscation.

Services integration

Consider how to integrate the deployed services in the IaC process. Examples might include:

  • extension: Azure Synapse and Microsoft Purview integration.
  • extension: Azure Synapse and Power BI integration.

Define success criteria

  • Successfully deploy all the components from the architecture and initial artifacts in an automated way and end-2-end.
  • Cover all the environments needed - DEV, STG, PROD.
  • Successfully configure integration points between services in an automated way.

Design the Data Lake structure

This section refers to the design of a Data Lake.

Define the goal for Data Lake design

In Modern Data Warehouse architectures, data coming from other systems can land in a Data Lake within a pre-defined structure.

Understand the design considerations for Data Lake

Organize your data lake, according to these considerations:

  • sample: Define how many layers you need in the medallion architecture (normally three layers: Bronze, Silver, and Gold; it needs to fit your specific use case. Alternative zones can be used that might not need any transformation or pre-processing before they are used for the reporting logic).
  • extension: Explore the data formats and expected usage to decide which partitioning strategies to explore.
  • extension: Explore privacy and security requirements (CLS, RLS, Access Management, Control and Data Plane) and map the data lake design to the security need as required.
  • extension: Explore Data Quality requirements that might be impacted?affected? by the Data Lake design, like Data duplication.

Define success criteria

  • Benchmark ingestion and query times based on the selected partitioning strategies (or without it).
  • Agree on a final design for the data lake.
  • Successfully integrate the data lake initial deployment and configurations on the IaC process following the design decisions and structure made for the current use case.

Design the data warehouse structure

This section refers to the design of a classic data warehouse.

Define the design goals for data warehouse structure

In a modern Data Warehouse architecture, data coming from other systems can land in a Data Lake that must be well structured. As the data progresses through the layers, it might be required that it finally lands on a classic Data Warehouse. The Data Warehouse can be used as a serving layer for reporting or custom applications beyond the gold layer. The Data Warehouse needs to be designed to receive the transformed data into a required schema (star schema or snowflake schema for example) that successfully serves the end users’ needs.

Understand the design considerations for data warehouse

  • sample/extension: Understand the flow of the data from the data source to the Data Warehouse final sink/sync? and understand the transformations that need to be applied at each stage.

Define success criteria

  • Model the Data Warehouse schema that can effectively store and organize the transformed data.
  • Embed the schema creation into the IaC process.

Include CI/CD

Completely implement and integrate CI/CD.

Define the goals for CI/CD implementation

The goal of the CI/CD area in the sample solution is to fully implement the CI/CD process integrating with each and every component of the defined architecture.

Understand the design considerations for CI/CD

  • sample: Implement the Git Integration with each applicable component of the solution (only on DEV).
  • sample: Define the branching strategy for the development of the overall solution. Consider each component of the solution individually as different services might work differently and need a different approach. For example, in the sample solution ADF and Azure Synapse use the concept of collaboration and publish branches to promote the artifacts between environments.
  • sample: Consider what validations and builds need to be triggered with a PR for each environment stage. For example: unit tests, linting, code wheel or jar builds and dacpac builds.
  • sample: Define the approval gates between environments; they might be manual.
  • sample: Define necessary Perform testing between environments:
    • sample: Unit testing (at the package level that is imported in the notebook)
    • sample: Integration testing (triggered between STG and PROD environment)
    • extension: Notebook testing
    • extension: Loading testing
    • extension: Performance testing
  • extension: Consider implementing CI/CD for the reporting layer (report development), it might not be in scope.
  • extension: Consider implementing CI/CD for the Governance layer (if in scope).

Define success criteria

Successfully implement the automation of the CI/CD process (devops pipeline) that:

  • Deploys and updates all the services in the architecture and all the included artifacts.
  • Updates services and artifacts between all the environments needed (from DEV to STG, from STG to PROD).
  • Implements the necessary approval gates (from DEV to STG, from STG to PROD).
  • Incorporates data schema changes and versioning handling into the CI/CD process. Note: this step might be done later in the process, or it might be omitted if versioning is not in scope.
  • Successfully test the end-2-end deployment and promotion between environments.

This section refers to different data related operations/stages.

Design data ingestion

This section refers to the data ingestion goals and design considerations.

Understand and achieve goals

To achieve your goal understanding the data at the source, ask the customer the following questions:

  • Which sources exist?
  • What are the expected formats? Parquet, Delta, csv, avro, delta, txt, other?
  • Where is the data located? On-premises or in the cloud?
  • What is the latency requirement for ingestion? Batch, NRT?
  • Are there Privacy constraints or requirements to be aware of? What are they?
  • Are there other constraints or requirements to consider?

NOTE: It is imperative that the customer is available and fully onboard as early as possible in the process to answer these questions.

Understand design considerations

  • sample/extension: Based on format, location, and latency requirements, you might use different ingestion mechanisms.

  • Consider writing an ADR to explore different ingestion options and document pros and cons. Examples of ingestion mechanisms are:

    • sample ADF (Azure Data Factory),
    • sample Azure Synapse pipelines, Azure Data Share, Microsoft Purview In-place sharing, Azure Copy, etc.
  • extension: Consider whether the ingestion process is required to handle versioning of the schemas or datasets and make the appropriate changes in the data lake design or [Data Warehouse](#design-the-Data Warehouse-structure) design.

Define success criteria

  • Write an ADR and document the decision on the Ingestion option or options.
  • Equip the solution with one or a combination of ingestion mechanisms as decided in the ADR. Consider all formats and latencies from the Data Source list which aligned with the customer requirements.
  • Successfully automate the ingestion from all data sources to bronze layer as designed in data lake design.

Design data transformation

Data transformation involves data moving from unprocessed/raw state to processed/transformed data. This section refers to the goal and design considerations for data transformation.

Understand and achieve goals

The goal is to understand the data journey after landing in the bronze layer. Understand the final Data product and the personas using or iterating with the data in the silver and gold layers. In other words, define your data desired state for each layer and how is the data going to be used/explored.

Understand the design considerations for data transformation

  • sample/extension: Consider what format is going to be primarily used for the transformations (for example: parquet, delta, other). If the source data is in another format, you might consider transforming the original format to the expected file type and format for the silver and gold layers.
  • sample: If Data pipelines are being considered for the ingestion, consider the following:
  • extension: Consider investigating what optimization techniques could be applied (cluster, code, configuration).

Define success criteria

  • Decision on the format you want to land the data in the silver layer.
  • Decision on the format you want to land the data in the gold layer.
  • Define which personas will use the silver layer: data scientists, others and how the layer needs to be structured for such use cases.
  • Define which personas will use the gold layer, for example: business users, product owners, end users, and how the layer needs to be structured for such use case: Power BI report, REST API, Custom App, other.
  • Write transformation logic through notebooks and/or pipelines that implement current to desired state.
  • Unit test the packages and incorporate them in the DevOps pipeline.
  • Test the transformation logic end-2-end.
  • Benchmark results and overall configurations before and after applying isolated or combined optimization techniques.
  • Incorporate artifacts deployment (pipelines, packages, Data Sources definitions) into IaC and CI/CD processes.

Design data quality checks

Data quality is standard and included in the different stages of the data pipeline.

Define the goals for data quality

Incorporating data quality into the solution is a standard activity that should be incorporated in different stages of the data pipeline. They are driven by two different pillars:

  • extension: Standard DQ validations
  • extension: Business validations

Understand design considerations for data quality

The Standard dimensions of Data Quality (DQ) are summarized as follows:

  • Completeness
  • Uniqueness
  • Timeliness
  • Correctness
  • Consistency
  • Validity
  • Conformity

At any point of the Data pipeline lifecycle, all or a subset of these dimensions can be checked with rules.

Business validations can be considered as Data Quality validations in certain aspects. They are related to a use case. Consider some examples to think about the use case at hand and how it can benefit from business rules validations.

Consider for example:

  • extension: A data pipeline is working with unstructured image data and for the final report to have a certain level of quality, the images are required to have a minimum resolution. This property can be checked when ingesting the data, or later during the transformation phase.
  • extension: Large data gaps during data ingestion will affect the quality of the business Machine Learning models, leading to poor decision making.

Define success criteria

  • Decide which Data quality metrics should be implemented in the solution.
  • Implement the rules considering a DQ existing tool that can integrate with your pipeline or developing custom rules and checks in the pipeline.
  • Decide with the customer what remediation actions can respond to such rules and if they can be automated or not.

Design data monitoring and observability

This section refers to the goals and design considerations for data monitoring and observability.

Understand and achieve goals

Incorporate observability in the Data pipeline lifecycle. The aim is to report on health and progress of the ingestion, transformations, and serving processes.

Understand the design considerations

The approach to observability should be as comprehensive as possible by including:

  • sample: Infrastructure monitoring
  • sample: Data Pipelines health and progress metrics
  • sample: Notebook observability

Define success criteria

  • Identify and prioritize which metrics and failure points need to be monitored.
  • Implement/configure monitoring and observability for all components of the architecture (infrastructure). Azure Monitor can be applied for Azure native and first party components.
  • Implement observability using Application Insights existing integrations (with Synapse Spark for example).
  • Implement custom observability at the notebook level using appropriate libraries and frameworks. Application insights can also be used.
  • Write kql targeted queries that can be used to query the metrics and custom logs. The aim is to reveal useful information about the workloads that are being run.

Perform testing

Testing is usually performed at different points of the project.

Understand and achieve goals

The goal is to incorporate testing across the entire solution. Different types of testing might be included in different stages of the project.

Understand the design considerations

Consider if the solution you are designing might benefit from the following testing types. If so, include the relevant tasks in earlier activities:

  • extension: IaC testing
  • sample: Unit testing
  • extension: Notebook testing
  • sample: Integration testing
  • extension: Loading testing
  • extension: Performance testing
  • extension: End-To-End testing

Define success criteria

  • IaC testing (related to Define the architectural components)
    • Perform an end-2-end deployment of the IaC pipeline and guarantee that the results are consistent with the expected results: services deployed, access control and initial artifacts. It is not enough that the deployment runs without errors, it is necessary to test against the access.
  • Unit testing related to Include CI/CD.
    • Successfully incorporate unit testing against the wheel and jar packages to be included in the solution.
  • Notebook testing (related to Design data transformation)
    • Perform an ADR to investigate notebook testing options, weigh pros and cons and document the decision.
  • Integration testing related to Include CI/CD and Design data Transformation.
    • Successfully run integration tests under the scope of the release pipeline to test integration points between components.
  • Loading testing related to Perform testing
    • If ingestion times are required to meet certain thresholds, you are advised to benchmark and test the loading time for different partitioning strategies.
    • Agree on an acceptable threshold for the ingestion of the data sources.
    • Agree on the data volume that the benchmark should run against per data source.
    • Implement monitoring and alerting that detects when the threshold of the ingestion process exceeds the agreed upon limit.
  • Performance testing
    • If query times are required to meet certain thresholds, benchmarking and testing different types of queries is advised. The recommendation for benchmarking is to execute 3 query runs and use the average execution time. The runs should be executed against an agreed data set with a relevant data volume.
    • Agree on an acceptable threshold for the queries per query type.
    • Agree on the data volume that the benchmark should run against.
    • If the threshold of the ingestion process exceeds the agreed value, proceed with query, engine, or configuration optimizations. Testing should be repeated until the threshold is achieved.
  • End-2-End Testing
    • Successfully test the entire solution:
      • IaC deployment followed by,
      • a data Ingestion pipeline run, where observability is active and triggers alerts based on default thresholds. Select a low threshold to force the trigger to fired, followed by:
      • a release pipeline where unit, notebook and integration tests are triggered, followed by,
      • a performance test on pre-defined queries that log the query execution time and triggers an alert when performance is degrading from initial thresholds.