在 Azure Data Factory 或 Synapse Analytics 中使用 Hadoop MapReduce 活動轉換資料

發行項
12/09/2023

適用于： Azure Data Factory Azure Synapse Analytics

提示

試用 Microsoft Fabric 中的 Data Factory，這是適用于企業的單一分析解決方案。 Microsoft Fabric 涵蓋從資料移動到資料科學、即時分析、商業智慧和報告等所有專案。瞭解如何免費啟動新的試用版！

Azure Data Factory 或 Synapse Analytics 管線中的 HDInsight MapReduce 活動會自行或隨選 HDInsight 叢集叫用 MapReduce 程式。本文是以資料轉換活動一文為基礎，本文提供資料轉換和支援的轉換活動的一般概觀。

若要深入瞭解，請閱讀 Azure Data Factory 和 Synapse Analytics 的簡介文章，並執行教學課程：教學課程：閱讀本文之前轉換資料。

如需使用 HDInsight Pig 和 Hive 活動從管線在 HDInsight 叢集上執行 Pig/Hive 腳本的詳細資訊，請參閱 Pig 和 Hive 。

使用 UI 將 HDInsight MapReduce 活動新增至管線

若要使用 HDInsight MapReduce 活動至管線，請完成下列步驟：

在管線 [活動] 窗格中搜尋 MapReduce ，並將 MapReduce 活動拖曳至管線畫布。
如果尚未選取，請在畫布上選取新的 MapReduce 活動。
選取 [ HDI 叢集 ] 索引標籤，以選取或建立要用來執行 MapReduce 活動之 HDInsight 叢集的新連結服務。
選取 [Jar] 索引標籤，以選取或建立將裝載腳本之Azure 儲存體帳戶的新 Jar 連結服務。指定要在該處執行的類別名稱，以及儲存位置內的檔案路徑。您也可以設定進階詳細資料，包括 Jar libs 位置、偵錯組態，以及要傳遞至腳本的引數和參數。

語法

{
    "name": "Map Reduce Activity",
    "description": "Description",
    "type": "HDInsightMapReduce",
    "linkedServiceName": {
        "referenceName": "MyHDInsightLinkedService",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "className": "org.myorg.SampleClass",
        "jarLinkedService": {
            "referenceName": "MyAzureStorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "jarFilePath": "MyAzureStorage/jars/sample.jar",
        "getDebugInfo": "Failure",
        "arguments": [
            "-SampleHadoopJobArgument1"
        ],
        "defines": {
            "param1": "param1Value"
        }
    }
}

語法詳細資料

屬性	描述	必要
NAME	活動的名稱	Yes
description	描述活動用途的文字	No
type	對於 MapReduce 活動，活動類型為 HDinsightMapReduce	Yes
linkedServiceName	已註冊為連結服務的 HDInsight 叢集參考。若要瞭解此連結服務，請參閱計算連結服務一文。	Yes
className	要執行的類別名稱	Yes
jarLinkedService	用來儲存 Jar 檔案Azure 儲存體連結服務的參考。這裡僅支援Azure Blob 儲存體和 ADLS Gen2 連結服務。如果未指定此連結服務，則會使用 HDInsight 連結服務中定義的Azure 儲存體連結服務。	No
jarFilePath	提供 jarLinkedService 所參考之Azure 儲存體中所儲存 Jar 檔案的路徑。檔案名會區分大小寫。	Yes
jarlibs	儲存在 jarLinkedService 中定義之Azure 儲存體所參考之 Jar 程式庫檔案路徑的字串陣列。檔案名會區分大小寫。	No
getDebugInfo	指定記錄檔何時複製到 jarLinkedService 所指定的 HDInsight 叢集（或）所使用的Azure 儲存體。允許的值：None、Always 或 Failure。預設值：無。	No
參數	指定 Hadoop 作業的引數陣列。引數會以命令列引數的形式傳遞至每個工作。	No
定義	將參數指定為索引鍵/值組，以在 Hive 腳本中參考。	No

範例

您可以使用 HDInsight MapReduce 活動，在 HDInsight 叢集上執行任何 MapReduce jar 檔案。在管線的下列範例 JSON 定義中，HDInsight 活動已設定為執行 Mahout JAR 檔案。

{
    "name": "MapReduce Activity for Mahout",
    "description": "Custom MapReduce to generate Mahout result",
    "type": "HDInsightMapReduce",
    "linkedServiceName": {
        "referenceName": "MyHDInsightLinkedService",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
        "jarLinkedService": {
            "referenceName": "MyStorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar",
        "arguments": [
            "-s",
            "SIMILARITY_LOGLIKELIHOOD",
            "--input",
            "wasb://adfsamples@spestore.blob.core.windows.net/Mahout/input",
            "--output",
            "wasb://adfsamples@spestore.blob.core.windows.net/Mahout/output/",
            "--maxSimilaritiesPerItem",
            "500",
            "--tempDir",
            "wasb://adfsamples@spestore.blob.core.windows.net/Mahout/temp/mahout"
        ]
    }
}

您可以在 arguments 區段中指定 MapReduce 程式 的任何引數。在執行時間，您會從 MapReduce 架構看到一些額外的引數（例如：mapreduce.job.tags）。若要區分您的引數與 MapReduce 引數，請考慮同時使用 option 和 value 作為引數，如下列範例所示（-s,--input,--output etc.，是選項緊接其值）。

請參閱下列文章，說明如何以其他方式轉換資料：

在 Azure Data Factory 或 Synapse Analytics 中使用 Hadoop MapReduce 活動轉換資料

使用 UI 將 HDInsight MapReduce 活動新增至管線

語法

語法詳細資料

範例

相關內容

其他資源