在 Azure Synapse Link 中使用 Apache Spark 3 與 Azure Cosmos DB 互動

發行項
08/15/2024

在本文中，您將瞭解如何使用 Synapse Apache Spark 3 與 Azure Cosmos DB 互動。 Synapse Apache Spark 3 完全支援 Scala、Python、SparkSQL 和 C#，是 Azure Cosmos DB 的 Azure Synapse Link 中分析、數據工程、數據科學和數據探索案例的核心。

與 Azure Cosmos DB 互動時支援下列功能：

Synapse Apache Spark 3 可讓您分析 Azure Cosmos DB 容器中的數據，這些容器會在近乎實時的情況下使用 Azure Synapse Link 啟用，而不會影響交易式工作負載的效能。下列兩個選項可用來從 Spark 查詢 Azure Cosmos DB 分析存放區：
- 載入 Spark DataFrame
- 建立 Spark 資料表
Synapse Apache Spark 也可讓您將數據內嵌至 Azure Cosmos DB。請務必注意，數據一律會透過交易存放區內嵌至 Azure Cosmos DB 容器。啟用 Synapse Link 時，任何新的插入、更新和刪除都會自動同步至分析存放區。
Synapse Apache Spark 也支援使用 Azure Cosmos DB 作為來源和接收的 Spark 結構化串流。

下列各節將逐步引導您完成上述功能的語法。您也可以參閱 Learn 課程模組，瞭解如何使用適用於 Azure Synapse Analytics 的 Apache Spark 查詢 Azure Cosmos DB。 Azure Synapse Analytics 工作區中的手勢是設計來提供簡單的現用體驗來開始使用。當您在 Synapse 工作區的 [數據] 索引標籤中，以滑鼠右鍵按兩下 Azure Cosmos DB 容器時，即會顯示手勢。使用手勢，您可以快速產生程序代碼，並根據您的需求量身打造。手勢也非常適合使用單鍵來探索數據。

重要

您應該注意分析架構中的一些條件約束，可能會導致數據載入作業發生非預期的行為。例如，分析架構中只有前1000個來自交易式架構的屬性可用、具有空格的屬性無法使用等等。如果您遇到一些非預期的結果，請檢查分析存放區架構條件約束以取得詳細數據。

查詢 Azure Cosmos DB 分析存放區

在您了解查詢 Azure Cosmos DB 分析存放區、載入 Spark DataFrame 和建立 Spark 資料表的兩個可能選項之前，值得探索體驗的差異，以便您可以選擇適合您需求的選項。

體驗的差異在於 Azure Cosmos DB 容器中的基礎數據變更是否應該自動反映在 Spark 中執行的分析中。當 Spark DataFrame 已註冊或針對容器的分析存放區建立 Spark 數據表時，分析存放區中目前數據快照集周圍的元數據會擷取至 Spark，以有效率地下推後續分析。請務必注意，由於 Spark 遵循延遲評估原則，除非 Spark DataFrame 或 SparkSQL 查詢上叫用動作，否則不會從基礎容器的分析存放區擷取實際數據。

在載入 Spark DataFrame 的情況下，擷取的元數據會在 Spark 會話的存留期內快取，因此在 DataFrame 上叫用的後續動作會根據數據框架建立時分析存放區的快照集進行評估。

另一方面，在建立Spark數據表的情況下，分析存放區狀態的元數據不會在Spark中快取，而且會針對Spark數據表在每個SparkSQL查詢執行時重載。

因此，您可以選擇載入 Spark DataFrame，並根據您想要根據分析存放區的固定快照集或針對分析存放區的最新快照集評估 Spark 數據表來建立 Spark 數據表。

注意

若要查詢適用於 MongoDB 帳戶的 Azure Cosmos DB，請深入瞭解分析存放區中的完整精確度架構表示法，以及要使用的擴充屬性名稱。

注意

請注意， options 下列命令中的所有命令都會區分大小寫。

載入 Spark DataFrame

在此範例中，您將建立指向 Azure Cosmos DB 分析存放區的 Spark DataFrame。接著，您可以對 DataFrame 叫用 Spark 動作來執行其他分析。這項作業不會影響交易式存放區。

Python 中的語法如下：

# To select a preferred list of regions in a multi-region Azure Cosmos DB account, add .option("spark.cosmos.preferredRegions", "<Region1>,<Region2>")

df = spark.read.format("cosmos.olap")\
    .option("spark.synapse.linkedService", "<enter linked service name>")\
    .option("spark.cosmos.container", "<enter container name>")\
    .load()

Scala 中的對等語法如下：

// To select a preferred list of regions in a multi-region Azure Cosmos DB account, add option("spark.cosmos.preferredRegions", "<Region1>,<Region2>")

val df_olap = spark.read.format("cosmos.olap").
    option("spark.synapse.linkedService", "<enter linked service name>").
    option("spark.cosmos.container", "<enter container name>").
    load()

建立 Spark 資料表

在此範例中，您將建立指向 Azure Cosmos DB 分析存放區的 Spark 數據表。接著，您可以對數據表叫用 SparkSQL 查詢來執行其他分析。此作業不會影響交易存放區，也不會產生任何數據移動。如果您決定刪除此 Spark 數據表，基礎 Azure Cosmos DB 容器和對應的分析存放區將不會受到影響。

此案例方便透過第三方工具重複使用Spark資料表，並提供運行時間基礎數據的輔助功能。

建立 Spark 數據表的語法如下：

%%sql
-- To select a preferred list of regions in a multi-region Azure Cosmos DB account, add spark.cosmos.preferredRegions '<Region1>,<Region2>' in the config options

create table call_center using cosmos.olap options (
    spark.synapse.linkedService '<enter linked service name>',
    spark.cosmos.container '<enter container name>'
)

注意

如果您有基礎 Azure Cosmos DB 容器架構隨著時間變更的情況;而且，如果您想要更新的架構自動反映在對 Spark 數據表的查詢中，您可以藉由將 Spark 數據表選項中的選項設定 spark.cosmos.autoSchemaMerge 為 true 來達成此目的。

將 Spark DataFrame 寫入 Azure Cosmos DB 容器

在此範例中，您會將Spark DataFrame寫入 Azure Cosmos DB 容器。這項作業會影響交易式工作負載的效能，並取用在 Azure Cosmos DB 容器或共用資料庫上布建的要求單位。

Python 中的語法如下：

# Write a Spark DataFrame into an Azure Cosmos DB container
# To select a preferred list of regions in a multi-region Azure Cosmos DB account, add .option("spark.cosmos.preferredRegions", "<Region1>,<Region2>")

YOURDATAFRAME.write.format("cosmos.oltp")\
    .option("spark.synapse.linkedService", "<enter linked service name>")\
    .option("spark.cosmos.container", "<enter container name>")\
    .mode('append')\
    .save()

Scala 中的對等語法如下：

// To select a preferred list of regions in a multi-region Azure Cosmos DB account, add option("spark.cosmos.preferredRegions", "<Region1>,<Region2>")

import org.apache.spark.sql.SaveMode

df.write.format("cosmos.oltp").
    option("spark.synapse.linkedService", "<enter linked service name>").
    option("spark.cosmos.container", "<enter container name>").
    mode(SaveMode.Append).
    save()

從容器載入串流 DataFrame

在此手勢中，您將使用Spark串流功能將資料從容器載入資料框架。數據會儲存在您連線到工作區的主要 Data Lake 帳戶（和檔案系統）中。

注意

如果您想要參考 Synapse Apache Spark 中的外部連結庫，請在這裡深入瞭解。例如，如果您想要將Spark DataFrame內嵌至適用於 MongoDB 的 Azure Cosmos DB 容器，您可以在這裡利用適用於 Spark 的 MongoDB 連接器。

從 Azure Cosmos DB 容器載入串流數據框架

在此範例中，您將使用Spark的結構化串流功能，使用 Azure Cosmos DB 中的變更摘要功能，將數據從 Azure Cosmos DB 容器載入 Spark 串流數據框架。 Spark 所使用的檢查點數據會儲存在您連線到工作區的主要 Data Lake 帳戶（和文件系統）中。

Python 中的語法如下：

# To select a preferred list of regions in a multi-region Azure Cosmos DB account, add .option("spark.cosmos.preferredRegions", "<Region1>,<Region2>")

dfStream = spark.readStream\
    .format("cosmos.oltp.changeFeed")\
    .option("spark.synapse.linkedService", "<enter linked service name>")\
    .option("spark.cosmos.container", "<enter container name>")\
    .option("spark.cosmos.changeFeed.startFrom", "Beginning")\
    .option("spark.cosmos.changeFeed.mode", "Incremental")\
    .load()

Scala 中的對等語法如下：

// To select a preferred list of regions in a multi-region Azure Cosmos DB account, add .option("spark.cosmos.preferredRegions", "<Region1>,<Region2>")

val dfStream = spark.readStream.
    format("cosmos.oltp.changeFeed").
    option("spark.synapse.linkedService", "<enter linked service name>").
    option("spark.cosmos.container", "<enter container name>").
    option("spark.cosmos.changeFeed.startFrom", "Beginning").
    option("spark.cosmos.changeFeed.mode", "Incremental").
    load()

將串流數據框架寫入 Azure Cosmos DB 容器

在此範例中，您會將串流數據框架寫入 Azure Cosmos DB 容器。此作業會影響交易式工作負載的效能，並取用在 Azure Cosmos DB 容器或共用資料庫上布建的要求單位。如果未建立 /localWriteCheckpointFolder 資料夾（在下列範例中），系統會自動建立它。

Python 中的語法如下：

# To select a preferred list of regions in a multi-region Azure Cosmos DB account, add .option("spark.cosmos.preferredRegions", "<Region1>,<Region2>")

streamQuery = dfStream\
    .writeStream\
    .format("cosmos.oltp")\
    .option("spark.synapse.linkedService", "<enter linked service name>")\
    .option("spark.cosmos.container", "<enter container name>")\
    .option("checkpointLocation", "/tmp/myRunId/")\
    .outputMode("append")\
    .start()

streamQuery.awaitTermination()

Scala 中的對等語法如下：

// To select a preferred list of regions in a multi-region Azure Cosmos DB account, add .option("spark.cosmos.preferredRegions", "<Region1>,<Region2>")

val query = dfStream.
            writeStream.
            format("cosmos.oltp").
            outputMode("append").
            option("spark.synapse.linkedService", "<enter linked service name>").
            option("spark.cosmos.container", "<enter container name>").
            option("checkpointLocation", "/tmp/myRunId/").
            start()

query.awaitTermination()

下一步

在 GitHub 上開始使用 Azure Synapse Link 的範例
了解適用於 Azure Cosmos DB 的 Azure Synapse Link 支援的內容
聯機至適用於 Azure Cosmos DB 的 Synapse Link
請參閱 Learn 課程模組，瞭解如何使用適用於 Azure Synapse Analytics 的 Apache Spark 查詢 Azure Cosmos DB。

共用方式為

在 Azure Synapse Link 中使用 Apache Spark 3 與 Azure Cosmos DB 互動

查詢 Azure Cosmos DB 分析存放區

載入 Spark DataFrame

建立 Spark 資料表

將 Spark DataFrame 寫入 Azure Cosmos DB 容器

從容器載入串流 DataFrame

從 Azure Cosmos DB 容器載入串流數據框架

將串流數據框架寫入 Azure Cosmos DB 容器

下一步

意見反應

其他資源