Making data pipelines work reliably

This article provides best practices for designing and building analytical data pipelines that work reliably.

Best practices for designing and building data pipelines

This section covers cross-cutting concerns and best practices for designing data pipelines for analytical data workloads.

Validate your data early in the pipeline

To ensure that your data meets certain standards and is ready for processing, validate it between the bronze and silver zones of your data pipeline. This can help prevent data pipeline failures in cases of unexpected changes to the input data.

It's not recommended to validate data before the bronze zone of the data lake. The bronze datasets ensure that there is a similar copy of the source system data. This dataset is then usable for pipeline reruns to test validation logic and for data recovery (for example, when data corruption occurs due to a bug in the transformation logic).

For more information, see:

Make your data pipelines reproducible

When your data pipelines are reproducible, you can recover from errors by deploying code fixes and replaying the pipelines. This also helps prevent data duplication when replaying the pipelines.

Make sure your data transformation code is testable

To effectively test your data transformation code, separate the logic from the code that accesses source datasets. This allows you to move data transformation code from notebooks into packages, and run tests more quickly and effectively.

For more information, see:

Use metadata-based data pipelines

Instead of building and maintaining individual data pipelines for every permutation of data sources and transformation requirements, consider a configuration-based design (metadata) that dynamically specifies data ingestion and transformation requirements.

For more information, see:

Optimize serving layers for specific consumers

Different customers have different data consumption requirements, ranging from interactive reports to APIs and even availability in a relational database management system (RDBMS). Instead of providing a single consumption mechanism, consider creating multiple serving points tailored to the needs of the consumers.

Choose the right orchestrator for your data pipeline

An orchestrator is a tool or system that coordinates the execution of tasks and manages the flow of data through the pipeline. It is responsible for scheduling the different stages of the pipeline, such as data extraction, processing, and loading. An orchestrator can also manage error handling and recovery, and provide monitoring and reporting capabilities.

Some examples of orchestrators include Azure Data Factory, Apache Airflow, Argo Workflows, SQL Server Integration Services (SSIS), and Apache Nifi. These tools provide a way to define and schedule pipelines and to monitor and manage pipeline execution via a web-based user interface.

Here are some key features to consider when choosing an orchestrator:

  • Scheduling: the ability to schedule and trigger pipeline tasks on a regular or event-based schedule.
  • Workflow management: the ability to define and manage complex pipeline workflows, including branching and conditional logic.
  • Error handling: the ability to detect and handle errors that occur during pipeline execution, and provide recovery options.
  • Monitoring and reporting: the ability to monitor the status of pipeline tasks and provide detailed reporting on pipeline performance and errors.
  • Data lineage: the ability to track the flow of data through the pipeline and provide information on where data originated and how it has been transformed.

For more information, see Choose a data pipeline orchestration technology in Azure.