Apache Spark guidelines

This article provides various guidelines for using Apache Spark on Azure HDInsight.

How do I run or submit Spark jobs?

Option Documents
VSCode Use Spark & Hive Tools for Visual Studio Code
Jupyter Notebooks Tutorial: Load data and run queries on an Apache Spark cluster in Azure HDInsight
IntelliJ Tutorial: Use Azure Toolkit for IntelliJ to create Apache Spark applications for an HDInsight cluster
IntelliJ Tutorial: Create a Scala Maven application for Apache Spark in HDInsight using IntelliJ
Zeppelin notebooks Use Apache Zeppelin notebooks with Apache Spark cluster on Azure HDInsight
Remote job submission with Livy Use Apache Spark REST API to submit remote jobs to an HDInsight Spark cluster
Apache Oozie Oozie is a workflow and coordination system that manages Hadoop jobs.
Apache Livy You can use Livy to run interactive Spark shells or submit batch jobs to be run on Spark.
Azure Data Factory for Apache Spark The Spark activity in a Data Factory pipeline executes a Spark program on your own or [on-demand HDInsight cluster.
Azure Data Factory for Apache Hive The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand HDInsight cluster.

How do I monitor and debug Spark jobs?

Option Documents
Azure Toolkit for IntelliJ Failure spark job debugging with Azure Toolkit for IntelliJ (preview)
Azure Toolkit for IntelliJ through SSH Debug Apache Spark applications locally or remotely on an HDInsight cluster with Azure Toolkit for IntelliJ through SSH
Azure Toolkit for IntelliJ through VPN Use Azure Toolkit for IntelliJ to debug Apache Spark applications remotely in HDInsight through VPN
Job graph on Apache Spark History Server Use extended Apache Spark History Server to debug and diagnose Apache Spark applications

How do I make my Spark jobs run more efficiently?

Option Documents
IO Cache Improve performance of Apache Spark workloads using Azure HDInsight IO Cache (Preview)
Configuration options Optimize Apache Spark jobs

How do I connect to other Azure Services?

Option Documents
Apache Hive on HDInsight Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector
Apache HBase on HDInsight Use Apache Spark to read and write Apache HBase data
Apache Kafka on HDInsight Tutorial: Use Apache Spark Structured Streaming with Apache Kafka on HDInsight
Azure Cosmos DB Azure Synapse Link for Azure Cosmos DB

What are my storage options?

Option Documents
Azure Data Lake Storage Gen2 Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters
Azure Data Lake Storage Gen1 Use Azure Data Lake Storage Gen1 with Azure HDInsight clusters
Azure Blob Storage Use Azure storage with Azure HDInsight clusters

Next steps