PowerShell 指令碼 - 使用 Azure Data Factory 轉換雲端中的資料

發行項
03/05/2024

這個範例 PowerShell 指令碼會藉由在 Azure HDInsight Spark 叢集上執行 Spark 程式，來建立會轉換雲端中資料的管線。

注意

建議您使用 Azure Az PowerShell 模組來與 Azure 互動。若要開始使用，請參閱安裝 Azure PowerShell。若要了解如何移轉至 Az PowerShell 模組，請參閱將 Azure PowerShell 從 AzureRM 移轉至 Az。

此範例需要 Azure PowerShell。執行 Get-Module -ListAvailable Az 以尋找版本。如果您需要安裝或升級，請參閱安裝 Azure PowerShell 模組。

執行 Connect-AzAccount Cmdlet 以連線到 Azure。

必要條件

Azure 儲存體帳戶。建立 Python 指令碼和輸入檔案，並上傳至 Azure 儲存體。 spark 程式的輸出會儲存在這個儲存體帳戶中。隨選 Spark 叢集與其主要儲存體是使用相同的儲存體帳戶。

將 Python 指令碼上傳至 Blob 儲存體帳戶

使用下列內容建立名為 WordCount_Spark.py 的 Python 檔案：

import sys
from operator import add

from pyspark.sql import SparkSession

def main():
    spark = SparkSession\
        .builder\
        .appName("PythonWordCount")\
        .getOrCreate()

    lines = spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/minecraftstory.txt").rdd.map(lambda r: r[0])
    counts = lines.flatMap(lambda x: x.split(' ')) \
        .map(lambda x: (x, 1)) \
        .reduceByKey(add)
    counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfiles/wordcount")

    spark.stop()

if __name__ == "__main__":
    main()

以您的 Azure 儲存體帳戶名稱取代 <storageAccountName>。然後儲存檔案。
在 Azure Blob 儲存體中，建立名為 adftutorial 的容器 (如果不存在)。
建立名為 spark 的資料夾。
在 spark 資料夾下，建立名為 script 的子資料夾。
將 WordCount_Spark.py 檔案上傳至 script 子資料夾。

上傳輸入檔案

建立名為 minecraftstory.txt 的檔案並填入一些文字。 Spark 程式會計算這段文字中的字數。
在 blob 容器的 spark 資料夾中，建立名為 inputfiles 的子資料夾。
將 minecraftstory.txt 上傳至 inputfiles 子資料夾。

範例指令碼

重要

這個指令碼會建立 JSON 檔案，該檔案定義硬碟上 c:\ 資料夾中的 Data Factory 實體 (已連結的服務、資料集和管線)。

powershell Set-ExecutionPolicy Unrestricted -Scope CurrentUser

# Set variables with your own values
$resourceGroupName = "<Azure resource group name>"
$dataFactoryName = "<Data factory name. Must be globally unique.>"
$dataFactoryRegion = "East US" 
$storageAccountName = "<Az.Storage account name> "
$storageAccountKey = "<Az.Storage account key>"
$subscriptionID = "<Azure subscription ID>"
$tenantID = "<tenant ID>"
$servicePrincipalID = "<Active directory service principal ID>"
$servicePrincipalKey = "<Active directory service principal key>"

$pipelineName = "SparkTransformPipeline"

# Create a resource group
New-AzResourceGroup -Name $resourceGroupName -Location $dataFactoryRegion

# Create a data factory
$df = Set-AzDataFactory -ResourceGroupName $resourceGroupName -Location $dataFactoryRegion -Name $dataFactoryName

# Create an Az.Storage linked service in the data factory

## JSON definition of the linked service. 
$storageLinkedServiceDefinition = @"
{
    "name": "AzureStorageLinkedService",
    "properties": {
        "type": "AzureStorage",
        "typeProperties": {
            "connectionString": {
                "value": "DefaultEndpointsProtocol=https;AccountName=$storageAccountName;AccountKey=$storageAccountKey",
                "type": "SecureString"
            }
        }
    }
}
"@

## IMPORTANT: store the JSON definition in a file that will be used by the Set-AzDataFactoryLinkedService command. 
$storageLinkedServiceDefinition | Out-File c:\AzureStorageLinkedService.json

## Creates an Az.Storage linked service
Set-AzDataFactoryLinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name "AzureStorageLinkedService" -File c:\AzureStorageLinkedService.json

# Create on-demand Spark linked service in the data factory

## JSON definition of the linked service. 
$sparkLinkedServiceDefinition = @"
{
    "name": "OnDemandSparkLinkedService",
    "properties": {
      "type": "HDInsightOnDemand",
      "typeProperties": {
        "clusterSize": 2,
        "clusterType": "spark",
        "timeToLive": "00:15:00",
        "hostSubscriptionId": "$subscriptionID",
        "servicePrincipalId": "$servicePrincipalID",
        "servicePrincipalKey": {
          "value": "$servicePrincipalKey",
          "type": "SecureString"
        },
        "tenant": "$tenantID",
        "clusterResourceGroup": "$resourceGroupName",
        "version": "3.6",
        "osType": "Linux",
        "clusterNamePrefix":"ADFSparkSample",
        "linkedServiceName": {
          "referenceName": "AzureStorageLinkedService",
          "type": "LinkedServiceReference"
        }
      }
    }
}
"@

## IMPORTANT: store the JSON definition in a file that will be used by the Set-AzDataFactoryLinkedService command. 
$sparkLinkedServiceDefinition | Out-File c:\OnDemandSparkLinkedService.json

# Creates an on-demand Spark linked service
Set-AzDataFactoryLinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name "OnDemandSparkLinkedService" -File "C:\OnDemandSparkLinkedService.json"

# Create a pipeline in the data factory

## JSON definition of the pipeline
$pipelineDefinition = @"
{
  "name": "SparkTransformPipeline",
  "properties": {
    "activities": [
      {
        "name": "MySparkActivity",
        "type": "HDInsightSpark",
        "linkedServiceName": {
            "referenceName": "OnDemandSparkLinkedService",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
          "rootPath": "adftutorial/spark",
          "entryFilePath": "script/WordCount_Spark.py",
          "getDebugInfo": "Failure",
          "sparkJobLinkedService": {
            "referenceName": "AzureStorageLinkedService",
            "type": "LinkedServiceReference"
          }
        }
      }
    ]
  }
}
"@

## IMPORTANT: store the JSON definition in a file that will be used by the Set-AzDataFactoryPipeline command.
$pipelineDefinition | Out-File c:\SparkTransformPipeline.json

## Create a pipeline with Spark Activity in the data factory
Set-AzDataFactoryPipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name "SparkTransformPipeline" -File "c:\SparkTransformPipeline.json"

# Create a pipeline run 

## JSON definition for dummy pipeline parameters
$pipelineParameters = @"
{
    "dummy":  "b"
}
"@

## IMPORTANT: store the JSON definition in a file that will be used by the Invoke-AzDataFactoryPipeline command. 
$pipelineParameters | Out-File c:\PipelineParameters.json

# Create a pipeline run by using parameters
$runId = Invoke-AzDataFactoryPipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -PipelineName $pipelineName -ParameterFile c:\PipelineParameters.json

# Check the pipeline run status until it finishes
Start-Sleep -Seconds 30
while ($True) {
    $result = Get-AzDataFactoryActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-Date).AddMinutes(30)

    if (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
        Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
        Start-Sleep -Seconds 300
    }
    else {
        Write-Host "Pipeline $pipelineName run finished. Result:" -foregroundcolor "Yellow"
        $result
        break
    }
}

# Get the activity run details 
$result = Get-AzDataFactoryActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName `
    -PipelineRunId $runId `
    -RunStartedAfter (Get-Date).AddMinutes(-30) `
    -RunStartedBefore (Get-Date).AddMinutes(30) `
    -ErrorAction Stop

$result

if ($result.Status -eq "Succeeded") {`
    $result.Output -join "`r`n"`
}`
else {`
    $result.Error -join "`r`n"`
}

# To remove the data factory from the resource gorup
# Remove-AzDataFactory -Name $dataFactoryName -ResourceGroupName $resourceGroupName
# 
# To remove the whole resource group
# Remove-AzResourceGroup  -Name $resourceGroupName

清除部署

執行範例指令碼之後，您可以使用下列命令以移除資源群組及與其相關聯的所有資源：

Remove-AzResourceGroup -ResourceGroupName $resourceGroupName

若要從資源群組移除資料處理站，請執行下列命令：

Remove-AzDataFactoryV2 -Name $dataFactoryName -ResourceGroupName $resourceGroupName

指令碼說明

此指令碼會使用下列命令：

Command	注意
New-AzResourceGroup	建立用來存放所有資源的資源群組。
Set-AzDataFactoryV2	建立資料處理站。
Set-AzDataFactoryV2LinkedService	在資料處理站中建立已連結的服務。已連結的服務會將資料存放區或計算連結到資料處理站。
Set-AzDataFactoryV2Pipeline	在資料處理站中建立管線。管線包含一或多個執行特定作業的活動。在此管線中，Spark 活動會藉由在 Azure HDInsight Spark 叢集上執行程式，來轉換資料。
Invoke-AzDataFactoryV2Pipeline	建立管線的執行。也就是說，執行管線。
Get-AzDataFactoryV2ActivityRun	在管線中取得活動的執行 (活動執行) 的相關詳細資料。
Remove-AzResourceGroup	刪除資源群組，包括所有的巢狀資源。

如需有關 Azure PowerShell 的詳細資訊，請參閱 Azure PowerShell 文件。

您可以在 Azure Data Factory PowerShell 範例中，找到其他 Azure Data Factory PowerShell 指令碼範例。

共用方式為

PowerShell 指令碼 - 使用 Azure Data Factory 轉換雲端中的資料

必要條件

將 Python 指令碼上傳至 Blob 儲存體帳戶

上傳輸入檔案

範例指令碼

清除部署

指令碼說明

意見反應

意見反應

其他資源

共用方式為

PowerShell 指令碼 - 使用 Azure Data Factory 轉換雲端中的資料

必要條件

將 Python 指令碼上傳至 Blob 儲存體帳戶

上傳輸入檔案

範例指令碼

清除部署

指令碼說明

相關內容

意見反應

意見反應

其他資源