Run Apache Spark from the Spark Shell

An interactive Apache Spark Shell provides a REPL (read-execute-print loop) environment for running Spark commands one at a time and seeing the results. This process is useful for development and debugging. Spark provides one shell for each of its supported languages: Scala, Python, and R.

Run an Apache Spark Shell

  1. Use ssh command to connect to your cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net
    
  2. Spark provides shells for Scala (spark-shell), and Python (pyspark). In your SSH session, enter one of the following commands:

    spark-shell
    
    # Optional configurations
    # spark-shell --num-executors 4 --executor-memory 4g --executor-cores 2 --driver-memory 8g --driver-cores 4
    
    pyspark
    
    # Optional configurations
    # pyspark --num-executors 4 --executor-memory 4g --executor-cores 2 --driver-memory 8g --driver-cores 4
    

    If you intend to use any optional configuration, ensure you first review OutOfMemoryError exception for Apache Spark.

  3. A few basic example commands. Choose the relevant language:

    val textFile = spark.read.textFile("/example/data/fruits.txt")
    textFile.first()
    textFile.filter(line => line.contains("apple")).show()
    
    textFile = spark.read.text("/example/data/fruits.txt")
    textFile.first()
    textFile.filter(textFile.value.contains("apple")).show()
    
  4. Query a CSV file. Note the language below works for spark-shell and pyspark.

    spark.read.csv("/HdiSamples/HdiSamples/SensorSampleData/building/building.csv").show()
    
  5. Query a CSV file and store results in variable:

    var data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/HdiSamples/HdiSamples/SensorSampleData/building/building.csv")
    
    data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/HdiSamples/HdiSamples/SensorSampleData/building/building.csv")
    
  6. Display results:

    data.show()
    data.select($"BuildingID", $"Country").show(10)
    
    data.show()
    data.select("BuildingID", "Country").show(10)
    
  7. Exit

    :q
    
    exit()
    

SparkSession and SparkContext instances

By default when you run the Spark Shell, instances of SparkSession and SparkContext are automatically instantiated for you.

To access the SparkSession instance, enter spark. To access the SparkContext instance, enter sc.

Important shell parameters

The Spark Shell command (spark-shell, or pyspark) supports many command-line parameters. To see a full list of parameters, start the Spark Shell with the switch --help. Some of these parameters may only apply to spark-submit, which the Spark Shell wraps.

switch description example
--master MASTER_URL Specifies the master URL. In HDInsight, this value is always yarn. --master yarn
--jars JAR_LIST Comma-separated list of local jars to include on the driver and executor classpaths. In HDInsight, this list is composed of paths to the default filesystem in Azure Storage or Data Lake Storage. --jars /path/to/examples.jar
--packages MAVEN_COORDS Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Searches the local maven repo, then maven central, then any additional remote repositories specified with --repositories. The format for the coordinates is groupId:artifactId:version. --packages "com.microsoft.azure:azure-eventhubs:0.14.0"
--py-files LIST For Python only, a comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH. --pyfiles "samples.py"

Next steps