快速入門：使用 ARM 範本在 Azure HDInsight 中建立 Apache Spark 叢集

2024-09-06

在本快速入門中，您會使用 Azure Resource Manager 範本 (ARM 範本)，在 Azure HDInsight 中建立 Apache Spark 叢集。接著，您會建立 Jupyter Notebook 檔案，並將其用於對 Apache Hive 資料表執行 Spark SQL 查詢。 Azure HDInsight 是供企業使用的受控、全方位的開放原始碼分析服務。適用於 HDInsight 的 Apache Spark 架構能夠運用記憶體內部處理，使得資料分析及叢集運算更為快速。 Jupyter Notebook 可讓您與資料互動、將程式碼與 Markdown 文字相結合，以及執行簡單的視覺效果。

如果您同時使用多個叢集，您會想要建立虛擬網路，如果您使用的是 Spark 叢集，也會想要使用 Hive Warehouse Connector。如需詳細資訊，請參閱針對 Azure HDInsight 規劃虛擬網路和整合 Apache Spark 和 Apache Hive 與 Hive Warehouse Connector。

Azure Resource Manager 範本是一個 JavaScript 物件標記法 (JSON) 檔案，會定義專案的基礎結構和設定。範本使用宣告式語法。您可以描述預期的部署，而不需要撰寫程式設計命令順序來建立部署。

如果您的環境符合必要條件，而且您很熟悉 ARM 範本，請選取 [部署至 Azure] 按鈕。範本會在 Azure 入口網站中開啟。

必要條件

如果您沒有 Azure 訂用帳戶，請在開始前建立免費帳戶。

檢閱範本

本快速入門中使用的範本是來自 Azure 快速入門範本。

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "metadata": {
    "_generator": {
      "name": "bicep",
      "version": "0.5.6.12127",
      "templateHash": "4742950082151195489"
    }
  },
  "parameters": {
    "clusterName": {
      "type": "string",
      "metadata": {
        "description": "The name of the HDInsight cluster to create."
      }
    },
    "clusterLoginUserName": {
      "type": "string",
      "maxLength": 20,
      "minLength": 2,
      "metadata": {
        "description": "These credentials can be used to submit jobs to the cluster and to log into cluster dashboards. The username must consist of digits, upper or lowercase letters, and/or the following special characters: (!#$%&'()-^_`{}~)."
      }
    },
    "clusterLoginPassword": {
      "type": "secureString",
      "minLength": 10,
      "metadata": {
        "description": "The password must be at least 10 characters in length and must contain at least one digit, one upper case letter, one lower case letter, and one non-alphanumeric character except (single-quote, double-quote, backslash, right-bracket, full-stop). Also, the password must not contain 3 consecutive characters from the cluster username or SSH username."
      }
    },
    "sshUserName": {
      "type": "string",
      "minLength": 2,
      "metadata": {
        "description": "These credentials can be used to remotely access the cluster. The sshUserName can only consit of digits, upper or lowercase letters, and/or the following special characters (%&'^_`{}~). Also, it cannot be the same as the cluster login username or a reserved word"
      }
    },
    "sshPassword": {
      "type": "secureString",
      "maxLength": 72,
      "minLength": 6,
      "metadata": {
        "description": "SSH password must be 6-72 characters long and must contain at least one digit, one upper case letter, and one lower case letter.  It must not contain any 3 consecutive characters from the cluster login name"
      }
    },
    "location": {
      "type": "string",
      "defaultValue": "[resourceGroup().location]",
      "metadata": {
        "description": "Location for all resources."
      }
    },
    "headNodeVirtualMachineSize": {
      "type": "string",
      "defaultValue": "Standard_E8_v3",
      "allowedValues": [
        "Standard_A4_v2",
        "Standard_A8_v2",
        "Standard_E2_v3",
        "Standard_E4_v3",
        "Standard_E8_v3",
        "Standard_E16_v3",
        "Standard_E20_v3",
        "Standard_E32_v3",
        "Standard_E48_v3"
      ],
      "metadata": {
        "description": "This is the headnode Azure Virtual Machine size, and will affect the cost. If you don't know, just leave the default value."
      }
    },
    "workerNodeVirtualMachineSize": {
      "type": "string",
      "defaultValue": "Standard_E8_v3",
      "allowedValues": [
        "Standard_A4_v2",
        "Standard_A8_v2",
        "Standard_E2_v3",
        "Standard_E4_v3",
        "Standard_E8_v3",
        "Standard_E16_v3",
        "Standard_E20_v3",
        "Standard_E32_v3",
        "Standard_E48_v3"
      ],
      "metadata": {
        "description": "This is the workernode Azure Virtual Machine size, and will affect the cost. If you don't know, just leave the default value."
      }
    }
  },
  "resources": [
    {
      "type": "Microsoft.Storage/storageAccounts",
      "apiVersion": "2021-08-01",
      "name": "[format('storage{0}', uniqueString(resourceGroup().id))]",
      "location": "[parameters('location')]",
      "sku": {
        "name": "Standard_LRS"
      },
      "kind": "StorageV2"
    },
    {
      "type": "Microsoft.HDInsight/clusters",
      "apiVersion": "2021-06-01",
      "name": "[parameters('clusterName')]",
      "location": "[parameters('location')]",
      "properties": {
        "clusterVersion": "4.0",
        "osType": "Linux",
        "tier": "Standard",
        "clusterDefinition": {
          "kind": "spark",
          "configurations": {
            "gateway": {
              "restAuthCredential.isEnabled": true,
              "restAuthCredential.username": "[parameters('clusterLoginUserName')]",
              "restAuthCredential.password": "[parameters('clusterLoginPassword')]"
            }
          }
        },
        "storageProfile": {
          "storageaccounts": [
            {
              "name": "[replace(replace(reference(resourceId('Microsoft.Storage/storageAccounts', format('storage{0}', uniqueString(resourceGroup().id)))).primaryEndpoints.blob, 'https://', ''), '/', '')]",
              "isDefault": true,
              "container": "[parameters('clusterName')]",
              "key": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', format('storage{0}', uniqueString(resourceGroup().id))), '2021-08-01').keys[0].value]"
            }
          ]
        },
        "computeProfile": {
          "roles": [
            {
              "name": "headnode",
              "targetInstanceCount": 2,
              "hardwareProfile": {
                "vmSize": "[parameters('headNodeVirtualMachineSize')]"
              },
              "osProfile": {
                "linuxOperatingSystemProfile": {
                  "username": "[parameters('sshUserName')]",
                  "password": "[parameters('sshPassword')]"
                }
              }
            },
            {
              "name": "workernode",
              "targetInstanceCount": 2,
              "hardwareProfile": {
                "vmSize": "[parameters('workerNodeVirtualMachineSize')]"
              },
              "osProfile": {
                "linuxOperatingSystemProfile": {
                  "username": "[parameters('sshUserName')]",
                  "password": "[parameters('sshPassword')]"
                }
              }
            }
          ]
        }
      },
      "dependsOn": [
        "[resourceId('Microsoft.Storage/storageAccounts', format('storage{0}', uniqueString(resourceGroup().id)))]"
      ]
    }
  ],
  "outputs": {
    "storage": {
      "type": "object",
      "value": "[reference(resourceId('Microsoft.Storage/storageAccounts', format('storage{0}', uniqueString(resourceGroup().id))))]"
    },
    "cluster": {
      "type": "object",
      "value": "[reference(resourceId('Microsoft.HDInsight/clusters', parameters('clusterName')))]"
    }
  }
}

範本中定義了兩個 Azure 資源：

Microsoft.Storage/storageAccounts：建立 Azure 儲存體帳戶。
Microsoft HDInsight/cluster：建立 HDInsight 叢集。

部署範本

選取下方的 [部署至 Azure] 按鈕來登入 Azure，並開啟 ARM 範本。

輸入或選取下列值：

屬性	描述
訂用帳戶	從下拉式清單中，選取用於此叢集的 Azure 訂用帳戶。
資源群組	從下拉式清單中選取現有資源群組，或選取 [新建]。
Location	此值會以資源群組所用的位置來自動填入。
叢集名稱	輸入全域唯一名稱。針對此範本，請只使用小寫字母和數字。
叢集登入使用者名稱	提供使用者名稱，預設值為 `admin`。
叢集登入密碼	提供密碼。密碼長度至少必須為 10 個字元，且至少必須包含一個數字、一個大寫字母、一個小寫字母及一個非英數字元 (字元 ' ` " 除外)。
SSH 使用者名稱	提供使用者名稱，預設值為 `sshuser`。
SSH 密碼	請提供密碼。

使用 Azure Resource Manager 範本在 HDInsight 中建立 Spark 叢集。

檢閱條款及條件。然後選取 [我同意上方所述的條款及條件]，然後選取 [購買]。您會收到一則通知，內容指出您的部署正在進行中。大約需要 20 分鐘的時間來建立叢集。

如果您在建立 HDInsight 叢集時遇到問題，可能是因為您沒有執行此動作的適當權限。如需詳細資訊，請參閱存取控制需求。

檢閱已部署的資源

叢集建立好之後，您會收到部署成功通知，內有 [移至資源] 連結。 [資源群組] 頁面會列出新的 HDInsight 叢集以及與叢集相關聯的預設儲存體。每個叢集都具備 Azure 儲存體或 Azure Data Lake Storage Gen2 相依性。也稱為預設儲存體帳戶。 HDInsight 叢集及其預設儲存體帳戶必須共置於相同的 Azure 區域中。刪除叢集並不會刪除儲存體帳戶相依性。也稱為預設儲存體帳戶。 HDInsight 叢集及其預設儲存體帳戶必須共置於相同的 Azure 區域中。刪除叢集並不會刪除儲存體帳戶。

建立 Jupyter 筆記本檔案

Jupyter Notebook 是支援各種程式設計語言的互動式 Notebook 環境。您可以使用 Jupyter Notebook 檔案來與資料互動、將程式碼與 Markdown 文字相結合，以及執行簡單的視覺效果。

開啟 Azure 入口網站。
選取 [HDInsight 叢集]，然後選取您所建立的叢集。
從入口網站的 [叢集儀表板] 區段，選取 [Jupyter Notebook]。出現提示時，輸入叢集的叢集登入認證。
選取 [新增]>[PySpark] 來建立 Notebook。

新的 Notebook 隨即建立並以 Untitled(Untitled.pynb) 名稱開啟。

執行 Apache Spark SQL 陳述式

SQL (結構化查詢語言) 是最常見且廣泛使用的語言，可用於查詢及轉換資料。 Spark SQL 可作為 Apache Spark 的擴充功能，可讓您使用熟悉的 SQL 語法來處理結構化資料。

確認核心已就緒。當您在 Notebook 中的核心名稱旁邊看到一個空心圓時，表示核心已準備就緒。實心圓表示核心忙碌中。

alt-text="Kernel status." border="true":::

當您第一次啟動 Notebook 時，核心會在背景執行某些工作。等待核心準備就緒。
將以下程式碼貼入空白儲存格，然後按下 SHIFT + ENTER 鍵以執行此程式碼。此命令會列出叢集上的 Hive 資料表：
```
%%sql
SHOW TABLES
```
當您使用 Jupyter Notebook 檔案搭配 HDInsight 叢集時，您可取得預設的 spark 工作階段，用來執行使用 Spark SQL 的 Hive 查詢。 %%sql 會告知 Jupyter Notebook 使用預設的 spark 工作階段來執行 Hive 查詢。此查詢會擷取 Hive 資料表 (hivesampletable) 中的前 10 個資料列，依預設所有 HDInsight 叢集均隨附該資料表。第一次提交查詢時，Jupyter 會建立 Notebook 的 Spark 應用程式。大約需要 30 秒才能完成。 Spark 應用程式準備就緒後，查詢便會在大約一秒內執行，並產生結果。輸出如下所示：

y in HDInsight" border="true":::

每當您在 Jupyter 中執行查詢時，網頁瀏覽器視窗標題將會顯示 Notebook 標題和 (忙碌) 狀態。您也會在右上角的 PySpark 文字旁看到一個實心圓。
執行另一個查詢，以查看 hivesampletable 中的資料。
```
%%sql
SELECT * FROM hivesampletable LIMIT 10
```
畫面應會重新整理以顯示查詢輸出。

Insight" border="true":::
從 Notebook 的 [檔案] 功能表中，選取 [關閉並終止]。關閉 Notebook 可釋出叢集資源，包括 Spark 應用程式。