Megosztás a következőn keresztül:


Tutorial: Create a Scala Maven application for Apache Spark in HDInsight using IntelliJ

In this tutorial, you learn how to create an Apache Spark application written in Scala using Apache Maven with IntelliJ IDEA. The article uses Apache Maven as the build system. And starts with an existing Maven archetype for Scala provided by IntelliJ IDEA. Creating a Scala application in IntelliJ IDEA involves the following steps:

  • Use Maven as the build system.
  • Update Project Object Model (POM) file to resolve Spark module dependencies.
  • Write your application in Scala.
  • Generate a jar file that can be submitted to HDInsight Spark clusters.
  • Run the application on Spark cluster using Livy.

Ebben az oktatóanyagban a következőket sajátíthatja el:

  • Az IntelliJ IDEA Scala beépülő moduljának telepítése
  • Use IntelliJ to develop a Scala Maven application
  • Create a standalone Scala project

Előfeltételek

Az IntelliJ IDEA Scala beépülő moduljának telepítése

Do the following steps to install the Scala plugin:

  1. Nyissa meg az IntelliJ IDEA-t.

  2. On the welcome screen, navigate to Configure>Plugins to open the Plugins window.

    Screenshot showing IntelliJ Welcome Screen.

  3. Select Install for Azure Toolkit for IntelliJ.

    Screenshot showing IntelliJ Azure Tool Kit.

  4. Select Install for the Scala plugin that is featured in the new window.

    Screenshot showing IntelliJ Scala Plugin.

  5. A beépülő modul sikeres telepítése után újra kell indítania az IDE-t.

Use IntelliJ to create application

  1. Indítsa el az IntelliJ IDEA-t, és válassza az Új projekt létrehozása lehetőséget az Új projekt ablak megnyitásához.

  2. Select Apache Spark/HDInsight from the left pane.

  3. Válassza Spark Project (Scala) a főablakból.

  4. From the Build tool drop-down list, select one of the following values:

    • Maven for Scala project-creation wizard support.
    • SBT a függőségek kezeléséhez és a Scala-projekt létrehozásához.

    Screenshot showing create application.

  5. Válassza a Következőlehetőséget.

  6. Az Új projekt ablakban adja meg a következő információkat:

    Ingatlan Leírás
    Project name Adjon meg egy nevet.
    Projekt helye Adja meg a projekt mentésének helyét.
    Project SDK This field will be blank on your first use of IDEA. Válassza az Új lehetőséget... és lépjen a JDK-ra.
    Spark verzió A létrehozási varázsló integrálja a Spark SDK és a Scala SDK megfelelő verzióját. If the Spark cluster version is earlier than 2.0, select Spark 1.x. Ellenkező esetben válassza Spark2.x. Ez a példa a Spark 2.3.0 -t (Scala 2.11.8) használja.

    IntelliJ IDEA Selecting the Spark SDK.

  7. Válassza ki a Befejezésopciót.

Create a standalone Scala project

  1. Indítsa el az IntelliJ IDEA-t, és válassza az Új projekt létrehozása lehetőséget az Új projekt ablak megnyitásához.

  2. Select Maven from the left pane.

  3. Specify a Project SDK. If blank, select New... and navigate to the Java installation directory.

  4. Select the Create from archetype checkbox.

  5. From the list of archetypes, select org.scala-tools.archetypes:scala-archetype-simple. This archetype creates the right directory structure and downloads the required default dependencies to write Scala program.

    Screenshot shows the selected archetype in the New Project window.

  6. Válassza a Következőlehetőséget.

  7. Expand Artifact Coordinates. Provide relevant values for GroupId, and ArtifactId. Name, and Location will autopopulate. The following values are used in this tutorial:

    • GroupId: com.microsoft.spark.example
    • ArtifactId: SparkSimpleApp

    Screenshot shows the Artifact Coordinates option in the New Project window.

  8. Válassza a Következőlehetőséget.

  9. Verify the settings and then select Next.

  10. Verify the project name and location, and then select Finish. The project will take a few minutes to import.

  11. Once the project has imported, from the left pane navigate to SparkSimpleApp>src>test>scala>com>microsoft>spark>example. Right-click MySpec, and then select Delete.... You don't need this file for the application. A párbeszédpanelen kattintson az OK gombra.

  12. In the later steps, you update the pom.xml to define the dependencies for the Spark Scala application. For those dependencies to be downloaded and resolved automatically, you must configure Maven.

  13. From the File menu, select Settings to open the Settings window.

  14. From the Settings window, navigate to Build, Execution, Deployment>Build Tools>Maven>Importing.

  15. Select the Import Maven projects automatically checkbox.

  16. Válassza a Alkalmazlehetőséget, majd válassza OKlehetőséget. You'll then be returned to the project window.

    :::image type="content" source="./media/apache-spark-create-standalone-application/configure-maven-download.png" alt-text="Configure Maven for automatic downloads." border="true":::
    
  17. From the left pane, navigate to src>main>scala>com.microsoft.spark.example, and then double-click App to open App.scala.

  18. Replace the existing sample code with the following code and save the changes. This code reads the data from the HVAC.csv (available on all HDInsight Spark clusters). Retrieves the rows that only have one digit in the sixth column. And writes the output to /HVACOut under the default storage container for the cluster.

    package com.microsoft.spark.example
    
    import org.apache.spark.SparkConf
    import org.apache.spark.SparkContext
    
    /**
      * Test IO to wasb
      */
    object WasbIOTest {
        def main (arg: Array[String]): Unit = {
            val conf = new SparkConf().setAppName("WASBIOTest")
            val sc = new SparkContext(conf)
    
            val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
    
            //find the rows which have only one digit in the 7th column in the CSV
            val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
    
            rdd1.saveAsTextFile("wasb:///HVACout")
        }
    }
    
  19. In the left pane, double-click pom.xml.

  20. Within <project>\<properties> add the following segments:

    <scala.version>2.11.8</scala.version>
    <scala.compat.version>2.11.8</scala.compat.version>
    <scala.binary.version>2.11</scala.binary.version>
    
  21. Within <project>\<dependencies> add the following segments:

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>2.3.0</version>
    </dependency>
    
    Save changes to pom.xml.
    
  22. Create the .jar file. IntelliJ IDEA enables creation of JAR as an artifact of a project. Do the following steps.

    1. From the File menu, select Project Structure....

    2. From the Project Structure window, navigate to Artifacts>the plus symbol +>JAR>From modules with dependencies....

      `IntelliJ IDEA project structure add jar`.

    3. In the Create JAR from Modules window, select the folder icon in the Main Class text box.

    4. In the Select Main Class window, select the class that appears by default and then select OK.

      `IntelliJ IDEA project structure select class`.

    5. In the Create JAR from Modules window, ensure the extract to the target JAR option is selected, and then select OK. This setting creates a single JAR with all dependencies.

      IntelliJ IDEA project structure jar from module.

    6. The Output Layout tab lists all the jars that are included as part of the Maven project. You can select and delete the ones on which the Scala application has no direct dependency. For the application, you're creating here, you can remove all but the last one (SparkSimpleApp compile output). Select the jars to delete and then select the negative symbol -.

      `IntelliJ IDEA project structure delete output`.

      Ensure sure the Include in project build checkbox is selected. This option ensures that the jar is created every time the project is built or updated. Válassza az Alkalmaz és az OK gombokat.

    7. To create the jar, navigate to Build>Build Artifacts>Build. The project will compile in about 30 seconds. The output jar is created under \out\artifacts.

      IntelliJ IDEA project artifact output.

Run the application on the Apache Spark cluster

To run the application on the cluster, you can use the following approaches:

Erőforrások tisztítása

Ha nem folytatja az alkalmazás használatát, törölje a létrehozott fürtöt az alábbi lépésekkel:

  1. Jelentkezzen be a Azure portalra.

  2. A felül található Keresőmezőbe írja be a HDInsight parancsot.

  3. Select HDInsight clusters under Services.

  4. In the list of HDInsight clusters that appears, select the ... next to the cluster that you created for this tutorial.

  5. Válassza a Törlés lehetőséget. Válassza Igenlehetőséget.

Screenshot showing how to delete an HDInsight cluster via the Azure portal.

Következő lépés

In this article, you learned how to create an Apache Spark scala application. Advance to the next article to learn how to run this application on an HDInsight Spark cluster using Livy.