Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Most big data solutions consist of repeated data processing operations, encapsulated in workflows. A pipeline orchestrator is a tool that helps to automate these workflows. An orchestrator can schedule jobs, execute workflows, and coordinate dependencies among tasks.
In Azure, the following services and tools will meet the core requirements for pipeline orchestration, control flow, and data movement:
These services and tools can be used independently from one another, or used together to create a hybrid solution. For example, the Integration Runtime (IR) in Azure Data Factory V2 can natively execute SSIS packages in a managed Azure compute environment. While there is some overlap in functionality between these services, there are a few key differences.
To narrow the choices, start by answering these questions:
Do you need big data capabilities for moving and transforming your data? Usually this means multi-gigabytes to terabytes of data. If yes, then narrow your options to those that best suited for big data.
Do you require a managed service that can operate at scale? If yes, select one of the cloud-based services that aren't limited by your local processing power.
Are some of your data sources located on-premises? If yes, look for options that can work with both cloud and on-premises data sources or destinations.
Is your source data stored in Blob storage on an HDFS filesystem? If so, choose an option that supports Hive queries.
The following tables summarize the key differences in capabilities.
Capability | Azure Data Factory | SQL Server Integration Services (SSIS) | Oozie on HDInsight |
---|---|---|---|
Managed | Yes | No | Yes |
Cloud-based | Yes | No (local) | Yes |
Prerequisite | Azure Subscription | SQL Server | Azure Subscription, HDInsight cluster |
Management tools | Azure Portal, PowerShell, CLI, .NET SDK | SSMS, PowerShell | Bash shell, Oozie REST API, Oozie web UI |
Pricing | Pay per usage | Licensing / pay for features | No additional charge on top of running the HDInsight cluster |
Capability | Azure Data Factory | SQL Server Integration Services (SSIS) | Oozie on HDInsight |
---|---|---|---|
Copy data | Yes | Yes | Yes |
Custom transformations | Yes | Yes | Yes (MapReduce, Pig, and Hive jobs) |
Azure Machine Learning scoring | Yes | Yes (with scripting) | No |
HDInsight On-Demand | Yes | No | No |
Azure Batch | Yes | No | No |
Pig, Hive, MapReduce | Yes | No | Yes |
Spark | Yes | No | No |
Execute SSIS Package | Yes | Yes | No |
Control flow | Yes | Yes | Yes |
Access on-premises data | Yes | Yes | No |
Capability | Azure Data Factory | SQL Server Integration Services (SSIS) | Oozie on HDInsight |
---|---|---|---|
Scale up | Yes | No | No |
Scale out | Yes | No | Yes (by adding worker nodes to cluster) |
Optimized for big data | Yes | No | Yes |
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal author:
Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowTraining
Learning path
Data integration at scale Azure Data Factory - Training
Data integration at scale with Azure Data Factory or Azure Synapse Pipeline
Certification
Microsoft Certified: Azure Data Engineer Associate - Certifications
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.