SQL Server 巨量資料叢集 Spark 3 升級指南

發行項
05/04/2023

適用於：SQL Server 2019 (15.x)

重要

Microsoft SQL Server 2019 巨量資料叢集附加元件將會淘汰。 SQL Server 2019 巨量資料叢集的支援將於 2025 年 2 月 28 日結束。平台上將完全支援含軟體保證 SQL Server 2019 的所有現有使用者，而且軟體將會持續透過 SQL Server 累積更新來維護，直到該時間為止。如需詳細資訊，請參閱公告部落格文章與 Microsoft SQL Server 平台上的巨量資料選項。

本文包含將 Apache Spark 2.4 工作負載移轉至 Spark 3.1 版的重要資訊和指導。需要此項目，才能從 SQL Server 巨量資料叢集 CU12 升級至 CU13 和以上版本。

SQL Server 巨量資料叢集上的 Apache Spark 3 簡介

直到累積更新 12 (CU12)，巨量資料叢集都依賴 Apache Spark 2.4 行，而後者已在 2021 年 5 月生命週期結束。與我們承諾持續改善 Apache Spark 引擎所提供的巨量資料和機器學習功能一致，CU13 帶入 Apache Spark 3.1.2 版的目前版本。

新的效能基準

這個新版 Apache Spark 對於巨量資料處理工作負載帶來效能優勢。在我們的測試中使用參考「TCP-DS 10TB 工作負載」，我們可以將執行時間從 4.19 小時減少到 2.96 小時，只要在 SQL Server 巨量資料叢集上使用相同硬體和組態設定檔來切換引擎，就能「達到 29.36% 的改善」，而不需要額外的應用程式最佳化。個別查詢執行時間的改善平均值是 36%。

升級指引

Spark 3 是主要版本，且「包含重大變更」。遵循 SQL Server Universe 中的相同已建立最佳做法，建議您：

請完整檢閱本文。
請檢閱官方 Apache Spark 3 移轉指南。
使用您目前的環境，執行新巨量資料叢集版本 CU13 的並存部署。
(選擇性) 利用新的 azdata HDFS 分散式複製功能，以取得驗證所需的資料子集。
在升級之前，請使用 Spark 3 來驗證您目前的工作負載。
重新評估程式碼和資料表定義策略中強制執行的 Spark 最佳化。 Spark 3 帶來新的隨機顯示、資料分割和自適性查詢執行增強功能。這是重新評估先前決策的絕佳機會，並嘗試利用較新的引擎現成可用功能。

叢集升級期間會發生什麼情況？

叢集升級程序將會使用新版本和重新整理的 Apache Spark 執行階段來部署 Spark Pod。升級之後，將不會再有任何 Spark 2.4 元件。

將會保留透過設定架構所進行的持續性設定變更。

將會保留直接載入至 HDFS 的使用者程式庫和成品。不過，請確定這些程式庫和成品與 Spark 3 相容。

警告

將會遺失直接對 Pod 進行的自訂，請確定您驗證並重新套用那些自訂 (如果仍然適用於 Spark 3)。

重大變更

Spark 3 與 2.4 未完全回溯相容，重大變更主要是由三個部分所造成：

Spark 3 所使用的 Scala 2.12 與 Spark 2.4 所使用的 Scala 2.11 不相容
Spark 3 API 變更和取代
適用於 Apache Spark 的 SQL Server 巨量資料叢集執行階段程式庫更新

Spark 3 所使用的 Scala 2.12 與 Scala 2.11 不相容

如果根據 Scala 2.11 jar 來執行 Spark 工作，則需要使用 Scala 2.12 將其重建。 Scala 2.11 和 2.12 大多是原始程式碼相容，但二進位不相容。如需詳細資訊，請參閱 Scala 2.12.0。

需要下列變更：

變更所有 Scala 相依性的 Scala 版本。
變更所有 Spark 相依性的 Spark 版本。
變更所有 Spark 相依性已提供範圍，但外部相依性 (例如 spark-sql-kafka-0-10) 除外。

以下是範例 pom.xml，如下所示：

  <properties>
    <spark.version>3.1.2</spark.version>
    <scala.version.major>2.12</scala.version.major>
    <scala.version.minor>10</scala.version.minor>
    <scala.version>${scala.version.major}.${scala.version.minor}</scala.version>
  </properties>
 
  <dependencies>
 
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
      <scope>provided</scope>
    </dependency>
 
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.version.major}</artifactId>
      <version>${spark.version}</version>
     <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_${scala.version.major}</artifactId>
      <version>${spark.version}</version>
      <scope>provided</scope>
    </dependency>
 
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql-kafka-0-10_${scala.version.major}</artifactId>
      <version>${spark.version}</version>
    </dependency>
    
  </dependencies>

Spark 3 API 變更和取代

請檢閱官方 Apache Spark 3 移轉指南，其中詳細涵蓋所有 API 變更。

某些已擷取的重點摘要如下：

重大變更	動作
已在 Spark 3 中移除 `spark-submit` 參數 `yarn-client` 和 `yarn-clustermodes`	請改用 `spark-submit --master yarn --deploy-mode client` 或 `--deploy-mode cluster`。詳細資料，請參照 https://spark.apache.org/docs/latest/running-on-yarn.html
已移除 `HiveContext` 類別	請改用 `SparkSession.builder.enableHiveSupport()`
會反轉 TRIM 方法中的引數順序	使用 `TRIM(str, trimStr)` 來取代 `TRIM(trimStr, str)`
因為升級至 Scala 2.12，所以 `DataStreamWriter.foreachBatch` 與 Scala 程式的原始程式碼不相容	更新 Scala 原始程式碼，以區分 Scala 函數與 Java Lambda。

適用於 Apache Spark 的 SQL Server 巨量資料叢集執行階段程式庫更新

如適用於 Apache Spark 的 SQL Server 巨量資料叢集執行階段規格所涵蓋，CU13 版本上已更新所有預設 Python、R 和 Scala 程式庫。此外，已新增許多程式庫，來提供更佳的現成可用體驗。

請確定您的工作負載可以與較新的程式庫集搭配使用。
請檢閱自訂載入的程式庫現在是否為預設套件基準的一部分，並調整工作規格來移除自訂程式庫，以允許工作使用隨附的程式庫。

常見問題集

如何解決奇怪的 java.lang.NoSuchMethodError 或 java.lang.ClassNotFoundException

此錯誤最有可能是 Spark 或 Scala 版本衝突所造成。請仔細檢查下列項目，並重建您的專案。

請確定已更新所有 Scala 版本。
請確定所有 Spark 相依性都已使用正確的 Scala 版本和 Spark 版本進行更新。
請確定除了 spark-sql-kafka-0-10 以外，所有 Spark 相依性都已提供範圍。

因行事曆模式變更而造成 SparkUpgradeException

Spark 3.0 行事曆模型已變更。在 Spark SQL 中撰寫行事曆資料行時，您可能會看到如下的例外狀況：

Caused by: org.apache.spark.SparkUpgradeException: 
You may get a different result due to the upgrading of Spark 3.0:
writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet INT96 files can be dangerous,
as the files may be read by Spark 2.x or legacy versions of Hive later, 
which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. 
See more details in SPARK-31404.
You can set spark.sql.legacy.parquet.int96RebaseModeInWrite to 'LEGACY' to 
rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. 
Or set spark.sql.legacy.parquet.int96RebaseModeInWrite to 'CORRECTED' to 
write the datetime values as it is, if you are 100% sure that the written files 
will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.

解決方案：將設定 spark.sql.legacy.parquet.int96RebaseModeInWrite 設定為 LEGACY 或 CORRECTED，如上所述。以下是 PySpark 程式碼中的可能解決方案：

spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite","CORRECTED")

下一步

如需詳細資訊，請參閱 SQL Server 巨量資料叢集簡介。

共用方式為