Редактиране

Споделяне чрез


Choose a batch processing technology in Azure

Big data solutions often consist of discrete batch processing tasks that contribute to the overall data processing solution. You can use batch processing for workloads that don't require immediate access to insights. Batch processing can complement real-time processing requirements. You can also use batch processing to balance complexity and reduce cost for your overall implementation.

The fundamental requirement of batch processing engines is to scale out computations to handle a large volume of data. Unlike real-time processing, batch processing has latencies, or the time between data ingestion and computing a result, of minutes or hours.

Choose a technology for batch processing

Microsoft offers several services that you can use to do batch processing.

Microsoft Fabric

Microsoft Fabric is an all-in-one analytics and data platform for organizations. It's a software as a service offering that simplifies how you provision, manage, and govern an end-to-end analytics solution. Fabric handles data movement, processing, ingestion, transformation, and reporting. Fabric features that you use for batch processing include data engineering, data warehouses, lakehouses, and Apache Spark processing. Azure Data Factory in Fabric also supports lakehouses. To simplify and accelerate development, you can enable AI-driven Copilot.

  • Languages: R, Python, Java, Scala, and SQL

  • Security: Managed virtual network and OneLake role-based access control (RBAC)

  • Primary storage: OneLake, which has shortcuts and mirroring options

  • Spark: A prehydrated starter pool and a custom Spark pool with predefined node sizes

Azure Synapse Analytics

Azure Synapse Analytics is an enterprise analytics service that brings together both SQL and Spark technologies under a single construct of a workspace. Azure Synapse Analytics simplifies security, governance, and management. Every workspace has integrated data pipelines that you can use to author end-to-end workflows. You can also provision a dedicated SQL pool (formerly a SQL data warehouse) for large-scale analytics, a serverless SQL endpoint that you can use to directly query the lake, and a Spark runtime for distributed data processing.

  • Languages: Python, Java, Scala, and SQL

  • Security: Managed virtual network, RBAC and access control, and storage access control lists on Azure Data Lake Storage

  • Primary storage: Data Lake Storage and also integrates with other sources

  • Spark: Custom Spark configuration setup with predefined node sizes

Azure Databricks

Azure Databricks is a Spark-based analytics platform. It features rich and premium Spark features that are built on top of open-source Spark. Azure Databricks is a Microsoft service that integrates with the rest of the Azure services. It features extra configurations for Spark cluster deployments. And Unity Catalog helps simplify the governance of Azure Databricks Spark objects.

  • Languages: R, Python, Java, Scala, and Spark SQL.

  • Security: User authentication with Microsoft Entra ID.

  • Primary storage: Built-in integration with Azure Blob Storage, Data Lake Storage, Azure Synapse Analytics, and other services. For more information, see Data sources.

Other benefits include:

  • Web-based notebooks for collaboration and data exploration.

  • Fast cluster start times, automatic termination, and autoscaling.

  • Support for GPU-enabled clusters.

Key selection criteria

To choose your technology for batch processing, consider the following questions:

  • Do you want a managed service, or do you want to manage your own servers?

  • Do you want to author batch processing logic declaratively or imperatively?

  • Do you perform batch processing in bursts? If yes, consider options that provide the ability to automatically terminate a cluster or that have pricing models for each batch job.

  • Do you need to query relational data stores along with your batch processing, for example to look up reference data? If yes, consider options that provide the ability to query external relational stores.

Capability matrix

The following tables summarize key differences in capabilities between services.

General capabilities

Capability Fabric Azure Synapse Analytics Azure Databricks
Software as a service Yes1 No No
Managed service No Yes Yes
Relational data store Yes Yes Yes
Pricing model Capacity units SQL pool or cluster hour Azure Databricks unit 2 and cluster hour

[1] Assigned Fabric capacity.

[2] An Azure Databricks unit is the processing capability per hour.

Other capabilities

Capability Fabric Azure Synapse Analytics Azure Databricks
Autoscaling No No Yes
Scale-out granularity Per Fabric SKU Per cluster or per SQL pool Per cluster
In-memory caching of data No Yes Yes
Query from external relational stores Yes No Yes
Authentication Microsoft Entra ID SQL or Microsoft Entra ID Microsoft Entra ID
Auditing Yes Yes Yes
Row-level security Yes Yes 1 Yes
Supports firewalls Yes Yes Yes
Dynamic data masking Yes Yes Yes

[1] Filter predicates only. For more information, see Row-level security.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal authors:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps