Choose a batch processing technology in Azure
Big data solutions often consist of discrete batch processing tasks that contribute to the overall data processing solution. You can use batch processing for workloads that don't require immediate access to insights. Batch processing can complement real-time processing requirements. You can also use batch processing to balance complexity and reduce cost for your overall implementation.
The fundamental requirement of batch processing engines is to scale out computations to handle a large volume of data. Unlike real-time processing, batch processing has latencies, or the time between data ingestion and computing a result, of minutes or hours.
Choose a technology for batch processing
Microsoft offers several services that you can use to do batch processing.
Microsoft Fabric
Microsoft Fabric is an all-in-one analytics and data platform for organizations. It's a software as a service offering that simplifies how you provision, manage, and govern an end-to-end analytics solution. Fabric handles data movement, processing, ingestion, transformation, and reporting. Fabric features that you use for batch processing include data engineering, data warehouses, lakehouses, and Apache Spark processing. Azure Data Factory in Fabric also supports lakehouses. To simplify and accelerate development, you can enable AI-driven Copilot.
Languages: R, Python, Java, Scala, and SQL
Security: Managed virtual network and OneLake role-based access control (RBAC)
Primary storage: OneLake, which has shortcuts and mirroring options
Spark: A prehydrated starter pool and a custom Spark pool with predefined node sizes
Azure Synapse Analytics
Azure Synapse Analytics is an enterprise analytics service that brings together both SQL and Spark technologies under a single construct of a workspace. Azure Synapse Analytics simplifies security, governance, and management. Every workspace has integrated data pipelines that you can use to author end-to-end workflows. You can also provision a dedicated SQL pool for large-scale analytics, a serverless SQL endpoint that you can use to directly query the lake, and a Spark runtime for distributed data processing.
Languages: Python, Java, Scala, and SQL
Security: Managed virtual network, RBAC and access control, and storage access control lists on Azure Data Lake Storage
Primary storage: Data Lake Storage and also integrates with other sources
Spark: Custom Spark configuration setup with predefined node sizes
Azure Databricks
Azure Databricks is a Spark-based analytics platform. It features rich and premium Spark features that are built on top of open-source Spark. Azure Databricks is a Microsoft service that integrates with the rest of the Azure services. It features extra configurations for Spark cluster deployments. And Unity Catalog helps simplify the governance of Azure Databricks Spark objects.
Languages: R, Python, Java, Scala, and Spark SQL.
Security: User authentication with Microsoft Entra ID.
Primary storage: Built-in integration with Azure Blob Storage, Data Lake Storage, Azure Synapse Analytics, and other services. For more information, see Data sources.
Other benefits include:
Web-based notebooks for collaboration and data exploration.
Fast cluster start times, automatic termination, and autoscaling.
Support for GPU-enabled clusters.
Key selection criteria
To choose your technology for batch processing, consider the following questions:
Do you want a managed service, or do you want to manage your own servers?
Do you want to author batch processing logic declaratively or imperatively?
Do you perform batch processing in bursts? If yes, consider options that provide the ability to automatically terminate a cluster or that have pricing models for each batch job.
Do you need to query relational data stores along with your batch processing, for example to look up reference data? If yes, consider options that provide the ability to query external relational stores.
Capability matrix
The following tables summarize key differences in capabilities between services.
General capabilities
Capability | Fabric | Azure Synapse Analytics | Azure Databricks |
---|---|---|---|
Software as a service | Yes1 | No | No |
Managed service | No | Yes | Yes |
Relational data store | Yes | Yes | Yes |
Pricing model | Capacity units | SQL pool or cluster hour | Azure Databricks unit 2 and cluster hour |
[1] Assigned Fabric capacity.
[2] An Azure Databricks unit is the processing capability per hour.
Other capabilities
Capability | Fabric | Azure Synapse Analytics | Azure Databricks |
---|---|---|---|
Autoscaling | No | No | Yes |
Scale-out granularity | Per Fabric SKU | Per cluster or per SQL pool | Per cluster |
In-memory caching of data | No | Yes | Yes |
Query from external relational stores | Yes | No | Yes |
Authentication | Microsoft Entra ID | SQL or Microsoft Entra ID | Microsoft Entra ID |
Auditing | Yes | Yes | Yes |
Row-level security | Yes | Yes 1 | Yes |
Supports firewalls | Yes | Yes | Yes |
Dynamic data masking | Yes | Yes | Yes |
[1] Filter predicates only. For more information, see Row-level security.
Contributors
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal authors:
- Zoiner Tejada | CEO and Architect
- Pratima Valavala | Principal Solutions Architect
To see non-public LinkedIn profiles, sign in to LinkedIn.
Next steps
- What is Fabric?
- Fabric decision guide
- Training: Introduction to Azure Synapse Analytics
- What is Azure HDInsight?
- What is Azure Databricks?