Choose a batch processing technology in Azure

Article
10/09/2023

Big data solutions often use long-running batch jobs to filter, aggregate and otherwise prepare the data for analysis. Usually, these jobs involve reading source files from scalable storage (like HDFS, Azure Data Lake Store, and Azure Storage), processing them, and writing the output to new files in scalable storage.

The fundamental requirement of such batch processing engines is to scale out computations to handle a large volume of data. Unlike real-time processing, batch processing is expected to have latencies (the time between data ingestion and computing a result) that measure in minutes to hours.

Technology choices for batch processing

Azure Synapse Analytics

Azure Synapse is a distributed system designed to perform analytics on large data. It supports massive parallel processing (MPP), which makes it suitable for running high-performance analytics. Consider Azure Synapse when you have large amounts of data (more than 1 TB) and are running an analytics workload that will benefit from parallelism.

Azure Data Lake Analytics

Data Lake Analytics is an on-demand analytics job service. It's optimized for distributed processing of large data sets stored in Azure Data Lake Store.

Languages: U-SQL (including Python, R, and C# extensions).
Integrates with Azure Data Lake Store, Azure Storage blobs, Azure SQL Database, and Azure Synapse.
Pricing model is per-job.

HDInsight

HDInsight is a managed Hadoop service. Use it to deploy and manage Hadoop clusters in Azure. For batch processing, you can use Spark, Hive, Hive LLAP, MapReduce.

Languages: R, Python, Java, Scala, SQL
Kerberos authentication with Active Directory, Apache Ranger-based access control
Gives you complete control of the Hadoop cluster

Azure Databricks

Azure Databricks is an Apache Spark-based analytics platform. You can think of it as "Spark as a service." It's the easiest way to use Spark on the Azure platform.

Languages: R, Python, Java, Scala, Spark SQL
Fast cluster start times, autotermination, autoscaling.
Manages the Spark cluster for you.
Built-in integration with Azure Blob Storage, Azure Data Lake Storage (ADLS), Azure Synapse, and other services. See Data Sources.
User authentication with Microsoft Entra ID.
Web-based notebooks for collaboration and data exploration.
Supports GPU-enabled clusters

Key selection criteria

To narrow the choices, start by answering these questions:

Do you want a managed service rather than managing your own servers?
Do you want to author batch processing logic declaratively or imperatively?
Will you perform batch processing in bursts? If yes, consider options that let you auto-terminate the cluster or whose pricing model is per batch job.
Do you need to query relational data stores along with your batch processing, for example, to look up reference data? If yes, consider the options that enable the querying of external relational stores.

Capability matrix

The following tables summarize the key differences in capabilities.

General capabilities

Capability	Azure Data Lake Analytics	Azure Synapse	HDInsight	Azure Databricks
Is managed service	Yes	Yes	Yes ¹	Yes
Relational data store	Yes	Yes	No	Yes
Pricing model	Per batch job	By cluster hour	By cluster hour	Databricks Unit² + cluster hour

[1] With manual configuration.

[2] A Databricks Unit (DBU) is a unit of processing capability per hour.

Capabilities

Capability	Azure Data Lake Analytics	Azure Synapse	HDInsight with Spark	HDInsight with Hive	HDInsight with Hive LLAP	Azure Databricks
Autoscaling	No	No	Yes	Yes	Yes	Yes
Scale-out granularity	Per job	Per cluster	Per cluster	Per cluster	Per cluster	Per cluster
In-memory caching of data	No	Yes	Yes	No	Yes	Yes
Query from external relational stores	Yes	No	Yes	No	No	Yes
Authentication	Microsoft Entra ID	SQL / Microsoft Entra ID	No	Microsoft Entra ID¹	Microsoft Entra ID¹	Microsoft Entra ID
Auditing	Yes	Yes	No	Yes ¹	Yes ¹	Yes
Row-level security	No	Yes²	No	Yes ¹	Yes ¹	Yes
Supports firewalls	Yes	Yes	Yes	Yes ³	Yes ³	Yes
Dynamic data masking	No	Yes	No	Yes ¹	Yes ¹	Yes

[1] Requires using a domain-joined HDInsight cluster.

[2] Filter predicates only. See Row-Level Security

[3] Supported when used within an Azure Virtual Network.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Zoiner Tejada | CEO and Architect

Choose a batch processing technology in Azure

Technology choices for batch processing

Azure Synapse Analytics

Azure Data Lake Analytics

HDInsight

Azure Databricks

Key selection criteria

Capability matrix

General capabilities

Capabilities

Contributors

Next steps

Feedback

Feedback

Additional resources

Choose a batch processing technology in Azure

Technology choices for batch processing

Azure Synapse Analytics

Azure Data Lake Analytics

HDInsight

Azure Databricks

Key selection criteria

Capability matrix

General capabilities

Capabilities

Contributors

Next steps

Related resources

Feedback

Feedback

Additional resources