Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest Fabric, Power BI, and SQL learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
APPLIES TO:
Azure Data Factory
Azure Synapse Analytics
Tip
Try out Data Factory in Microsoft Fabric, an all-in-one analytics solution for enterprises. Microsoft Fabric covers everything from data movement to data science, real-time analytics, business intelligence, and reporting. Learn how to start a new trial for free!
Important
Support for Azure Machine Learning Studio (classic) will end on August 31, 2024. We recommend that you transition to Azure Machine Learning by that date.
As of December 1, 2021, you can't create new Machine Learning Studio (classic) resources (workspace and web service plan). Through August 31, 2024, you can continue to use the existing Machine Learning Studio (classic) experiments and web services. For more information, see:
Machine Learning Studio (classic) documentation is being retired and might not be updated in the future.
This article explains data transformation activities in Azure Data Factory and Synapse pipelines that you can use to transform and process your raw data into predictions and insights at scale. A transformation activity executes in a computing environment such as Azure Databricks or Azure HDInsight. It provides links to articles with detailed information on each transformation activity.
The service supports the following data transformation activities that can be added to pipelines either individually or chained with another activity.
Mapping data flows are visually designed data transformations in Azure Data Factory and Azure Synapse. Data flows allow data engineers to develop graphical data transformation logic without writing code. The resulting data flows are executed as activities within pipelines that use scaled-out Spark clusters. Data flow activities can be operationalized via existing scheduling, control, flow, and monitoring capabilities within the service. For more information, see mapping data flows.
Power Query in Azure Data Factory enables cloud-scale data wrangling, which allows you to do code-free data preparation at cloud scale iteratively. Data wrangling integrates with Power Query Online and makes Power Query M functions available for data wrangling at cloud scale via spark execution. For more information, see data wrangling in Azure Data Factory.
Note
Power Query is currently only supported in Azure Data Factory, and not in Azure Synapse. For a list of specific features supported in each service, see Available features in Azure Data Factory & Azure Synapse Analytics pipelines.
Optionally, you can hand-code transformations and manage the external compute environment yourself.
The HDInsight Hive activity in a pipeline executes Hive queries on your own or on-demand Windows/Linux-based HDInsight cluster. See Hive activity article for details about this activity.
The HDInsight Pig activity in a pipeline executes Pig queries on your own or on-demand Windows/Linux-based HDInsight cluster. See Pig activity article for details about this activity.
The HDInsight MapReduce activity in a pipeline executes MapReduce programs on your own or on-demand Windows/Linux-based HDInsight cluster. See MapReduce activity article for details about this activity.
The HDInsight Streaming activity in a pipeline executes Hadoop Streaming programs on your own or on-demand Windows/Linux-based HDInsight cluster. See HDInsight Streaming activity for details about this activity.
The HDInsight Spark activity in a pipeline executes Spark programs on your own HDInsight cluster. For details, see Invoke Spark programs with Azure Data Factory or Azure Synapse Analytics.
Important
Support for Azure Machine Learning Studio (classic) will end on August 31, 2024. We recommend that you transition to Azure Machine Learning by that date.
As of December 1, 2021, you can't create new Machine Learning Studio (classic) resources (workspace and web service plan). Through August 31, 2024, you can continue to use the existing Machine Learning Studio (classic) experiments and web services. For more information, see:
Machine Learning Studio (classic) documentation is being retired and might not be updated in the future.
The service enables you to easily create pipelines that use a published ML Studio (classic) web service for predictive analytics. Using the Batch Execution activity in a pipeline, you can invoke a Studio (classic) web service to make predictions on the data in batch.
Over time, the predictive models in the Studio (classic) scoring experiments need to be retrained using new input datasets. After you are done with retraining, you want to update the scoring web service with the retrained machine learning model. You can use the Update Resource activity to update the web service with the newly trained model.
See Use ML Studio (classic) activities for details about these Studio (classic) activities.
You can use the SQL Server Stored Procedure activity in a Data Factory pipeline to invoke a stored procedure in one of the following data stores: Azure SQL Database, Azure Synapse Analytics, SQL Server Database in your enterprise or an Azure VM. See Stored Procedure activity article for details.
Data Lake Analytics U-SQL activity runs a U-SQL script on an Azure Data Lake Analytics cluster. See Data Analytics U-SQL activity article for details.
The Azure Synapse Notebook Activity in a Synapse pipeline runs a Synapse notebook in your Azure Synapse workspace. See Transform data by running an Azure Synapse notebook.
The Azure Databricks Notebook Activity in a pipeline runs a Databricks notebook in your Azure Databricks workspace. Azure Databricks is a managed platform for running Apache Spark. See Transform data by running a Databricks notebook.
The Azure Databricks Jar Activity in a pipeline runs a Spark Jar in your Azure Databricks cluster. Azure Databricks is a managed platform for running Apache Spark. See Transform data by running a Jar activity in Azure Databricks.
The Azure Databricks Python Activity in a pipeline runs a Python file in your Azure Databricks cluster. Azure Databricks is a managed platform for running Apache Spark. See Transform data by running a Python activity in Azure Databricks.
If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity with your own data processing logic and use the activity in the pipeline. You can configure the custom .NET activity to run using either an Azure Batch service or an Azure HDInsight cluster. See Use custom activities article for details.
You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See Run R Script using Azure Data Factory and Synapse pipelines.
You create a linked service for the compute environment and then use the linked service when defining a transformation activity. There are two supported types of compute environments.
See Compute Linked Services article to learn about supported compute services.
See the following tutorial for an example of using a transformation activity: Tutorial: transform data using Spark
Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest Fabric, Power BI, and SQL learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayTraining
Module
Code-free transformation at scale with Azure Data Factory - Training
Perform code-free transformation at scale with Azure Data Factory or Azure Synapse Pipeline
Certification
Microsoft Certified: Azure Data Engineer Associate - Certifications
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.
Documentation
Mapping data flows - Azure Data Factory
An overview of mapping data flows in Azure Data Factory
Data Flow activity - Azure Data Factory & Azure Synapse
How to execute data flows from inside an Azure Data Factory or Azure Synapse Analytics pipeline.
Transform data using a mapping data flow - Azure Data Factory
This tutorial provides step-by-step instructions for using Azure Data Factory to transform data with mapping data flow