Migration from AWS EMR to Azure

BoazD 0 Reputation points

We are trying to move our spark steps code from AWS EMR cluster to AZURE. we are using the add-steps option with command-runner.jar in EMR. Each step inits a python script which uses large text file in S3 storage and manipulating it with Spark.

Example for EMR step (from bash script):

aws emr add-steps --cluster-id $cluster_id --steps '[{"Args":["spark-submit","--deploy-mode","cluster","'$python_code_path'","--param1","'$param1'"],"Type":"CUSTOM_JAR","ActionOnFailure":"TERMINATE_CLUSTER","Jar":"command-runner.jar","Properties":","Name":"$app_name"}]'

Is there a similar way in AZURE to run those spark jobs using only python pyspark scripts similar to EMR with azure created command-runner.jar file (the "command-runner.jar" is delivered automatically by Amazon)

Azure HDInsight
Azure HDInsight
An Azure managed cluster service for open-source analytics.
197 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,901 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 76,361 Reputation points Microsoft Employee

    Hello @BoazD Thanks for the question and using MS Q&A platform.

    Yes, Azure provides a similar way to run Spark jobs using Python scripts similar to EMR's "command-runner.jar" feature. Azure provides Azure Databricks as a managed Spark service, which allows you to run Spark jobs using Python scripts.

    To migrate your Spark steps code from AWS EMR to Azure Databricks, you would need to follow these high-level steps:

    1. Create an Azure Databricks workspace: You can create an Azure Databricks workspace in the Azure portal, and you can configure the workspace to use your existing Azure subscription.
    2. Import your Spark code: You can import your existing Spark code into the Azure Databricks workspace. Azure Databricks supports various data sources, including S3 storage, which means you should be able to use the same text files stored in S3 that you used in AWS EMR.
    3. Create a Databricks cluster: You can create a Databricks cluster in the Azure portal, which provides the compute resources needed to run your Spark jobs. You can choose the cluster size and configuration based on your requirements.
    4. Create a Databricks notebook: You can create a Databricks notebook in the Azure Databricks workspace, which allows you to run Spark jobs using Python scripts. You can write your Spark code in the notebook, configure the cluster, and execute the notebook.
    5. Schedule the notebook: You can use Azure Databricks' scheduling feature to schedule the notebook to run at regular intervals, similar to AWS EMR's add-steps option. You can configure the scheduling settings to specify the frequency, start time, and end time of the job.

    Azure Databricks provides a similar experience to AWS EMR, and it is a popular choice for running Spark jobs in the cloud. With Azure Databricks, you can take advantage of Azure's scalable and flexible compute resources and integrate with other Azure services, such as Azure Storage and Azure Event Hubs.

    Hope this helps. Do let us know if you any further queries.

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.