Understand Apache Spark for U-SQL developers

Important

Azure Data Lake Analytics retired on 29 February 2024. Learn more with this announcement.

For data analytics, your organization can use Azure Synapse Analytics or Microsoft Fabric.

Microsoft supports several Analytics services such as Azure Databricks, Azure HDInsight, and Azure Data Lake Analytics. We hear from developers that they have a clear preference for open-source-solutions as they build analytics pipelines. To help U-SQL developers understand Apache Spark, and how you might transform your U-SQL scripts to Apache Spark, we've created this guidance.

It includes the steps you can take, and several alternatives.

Steps to transform U-SQL to Apache Spark

  1. Transform your job orchestration pipelines.

    If you use Azure Data Factory to orchestrate your Azure Data Lake Analytics scripts, you have to adjust them to orchestrate the new Spark programs.

  2. Understand the differences between how U-SQL and Spark manage data.

    If you want to move your data from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2, you have to copy both the file data and the catalog maintained data. Azure Data Lake Analytics only supports Azure Data Lake Storage Gen1. For more information, see Understand Spark data formats.

  3. Transform your U-SQL scripts to Spark.

    Before transforming your U-SQL scripts, you have to choose an analytics service. Some of the available compute services available are:

    • Azure Data Factory DataFlow Mapping data flows are visually designed data transformations that allow data engineers to develop a graphical data transformation logic without writing code. While not suited to execute complex user code, they can easily represent traditional SQL-like dataflow transformations
    • Azure HDInsight Hive Apache Hive on HDInsight is suited to Extract, Transform, and Load (ETL) operations. This means you're going to translate your U-SQL scripts to Apache Hive.
    • Apache Spark Engines such as Azure HDInsight Spark or Azure Databricks This means you're going to translate your U-SQL scripts to Spark. For more information, see Understand Spark data formats

Caution

Both Azure Databricks and Azure HDInsight Spark are cluster services and not serverless jobs like Azure Data Lake Analytics. You will have to consider how to provision the clusters to get the appropriate cost/performance ratio and how to manage their lifetime to minimize your costs. These services are have different performance characteristics with user code written in .NET, so you will have to either write wrappers or rewrite your code in a supported language. For more information, see Understand Spark data formats, Understand Apache Spark code concepts for U-SQL developers, .NET for Apache Spark

Next steps