使用適用於 Visual Studio Code 的 Databricks 擴充功能執行 pytest 的測試

文章
10/16/2024

本文說明如何使用適用於 Visual Studio Code 的 Databricks 擴充功能來執行測試 pytest 。請參閱什麼是適用於Visual StudioCode的 Databricks延伸模組？。

您可以在不需要連線到遠端 Azure Databricks 工作區中叢集的本機程式代碼上執行 pytest 。例如，您可以使用 pytest 來測試函式，以在本機記憶體中接受並傳回 PySpark DataFrames。若要開始使用pytest並在本機執行，請參閱檔中的pytest開始使用。

若要在遠端 Azure Databricks 工作區中的程式代碼上執行 pytest ，請在 Visual Studio Code 專案中執行下列動作：

步驟 1：建立測試

使用下列程式代碼新增 Python 檔案，其中包含要執行的測試。此範例假設此檔案命名為 spark_test.py ，且位於 Visual Studio Code 專案的根目錄。此檔案包含pytest裝置，讓叢集的 SparkSession （叢集上的Spark功能進入點）可供測試使用。此檔案包含單一測試，可檢查資料表中的指定儲存格是否包含指定的值。您可以視需要將自己的測試新增至此檔案。

from pyspark.sql import SparkSession
import pytest

@pytest.fixture
def spark() -> SparkSession:
  # Create a SparkSession (the entry point to Spark functionality) on
  # the cluster in the remote Databricks workspace. Unit tests do not
  # have access to this SparkSession by default.
  return SparkSession.builder.getOrCreate()

# Now add your unit tests.

# For example, here is a unit test that must be run on the
# cluster in the remote Databricks workspace.
# This example determines whether the specified cell in the
# specified table contains the specified value. For example,
# the third column in the first row should contain the word "Ideal":
#
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# |_c0 | carat | cut   | color | clarity | depth | table | price | x    | y     | z    |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# | 1  | 0.23  | Ideal | E     | SI2     | 61.5  | 55    | 326   | 3.95 | 3. 98 | 2.43 |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# ...
#
def test_spark(spark):
  spark.sql('USE default')
  data = spark.sql('SELECT * FROM diamonds')
  assert data.collect()[0][2] == 'Ideal'

步驟 2：建立 pytest 執行器

使用下列程式代碼新增 Python 檔案，以指示 pytest 從上一個步驟執行您的測試。此範例假設檔案已命名 pytest_databricks.py ，且位於Visual StudioCode專案的根目錄。

import pytest
import os
import sys

# Run all tests in the connected directory in the remote Databricks workspace.
# By default, pytest searches through all files with filenames ending with
# "_test.py" for tests. Within each of these files, pytest runs each function
# with a function name beginning with "test_".

# Get the path to the directory for this file in the workspace.
dir_root = os.path.dirname(os.path.realpath(__file__))
# Switch to the root directory.
os.chdir(dir_root)

# Skip writing .pyc files to the bytecode cache on the cluster.
sys.dont_write_bytecode = True

# Now run pytest from the root directory, using the
# arguments that are supplied by your custom run configuration in
# your Visual Studio Code project. In this case, the custom run
# configuration JSON must contain these unique "program" and
# "args" objects:
#
# ...
# {
#   ...
#   "program": "${workspaceFolder}/path/to/this/file/in/workspace",
#   "args": ["/path/to/_test.py-files"]
# }
# ...
#
retcode = pytest.main(sys.argv[1:])

步驟 3：建立自定義執行組態

若要指示 pytest 執行測試，您必須建立自定義回合組態。使用現有的 Databricks 叢集型執行組態來建立您自己的自定義執行組態，如下所示：

在主功能表上，按兩下 [ 執行 > 新增組態]。
在命令 選擇區中，選取 [Databricks]。

如果這個檔案不存在，Visual Studio Code 會將 .vscode/launch.json 檔案新增至您的專案。
變更入門執行組態，如下所示，然後儲存盤案：
- 在此範例Unit Tests (on Databricks)中，將此回合組態的名稱從 Run on Databricks 變更為此組態的一些唯一顯示名稱。
- 從 ${file} 變更program為包含測試執行器之專案中的路徑，在此範例${workspaceFolder}/pytest_databricks.py中為。
- 從 [] 變更args為包含測試檔案之專案中的路徑，在此範例["."]中為。
您的 launch.json 檔案看起來應該像這樣:
```
{
  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
  "version": "0.2.0",
  "configurations": [
    {
      "type": "databricks",
      "request": "launch",
      "name": "Unit Tests (on Databricks)",
      "program": "${workspaceFolder}/pytest_databricks.py",
      "args": ["."],
      "env": {}
    }
  ]
}
```

步驟 4：執行測試

請確定 pytest 已先安裝在叢集上。例如，在 Azure Databricks 工作區中開啟叢集的 [設定] 頁面時，請執行下列動作：

在 [ 連結庫 ] 索引標籤上，如果 顯示 pytest ，則 pytest 已安裝。如果 看不到 pytest ，請按兩下 [ 安裝新的]。
針對 [ 鏈接庫來源]，按兩下 [PyPI]。
針對[ 套件]，輸入 pytest。
按一下 [安裝]。
等候狀態從 [擱置] 變更為 [已安裝]。

若要執行測試，請從 Visual Studio Code 專案執行下列動作：

在主功能表上，按兩下 [ 檢視 > 執行]。
在 [ 執行和偵 錯] 列表中，如果尚未選取單元測試，請按兩下 [單元測試] （在 Databricks 上）。
按兩下綠色箭號（開始偵錯）圖示。

結果會顯示在 [偵pytest錯控制台] 中（主功能表上的 [檢視>偵錯控制台]。 例如，這些結果顯示檔案中 spark_test.py 至少有一個測試，而點（.）表示找到並通過單一測試。（失敗的測試會顯示。F

<date>, <time> - Creating execution context on cluster <cluster-id> ...
<date>, <time> - Synchronizing code to /Workspace/path/to/directory ...
<date>, <time> - Running /pytest_databricks.py ...
============================= test session starts ==============================
platform linux -- Python <version>, pytest-<version>, pluggy-<version>
rootdir: /Workspace/path/to/directory
collected 1 item

spark_test.py .                                                          [100%]

============================== 1 passed in 3.25s ===============================
<date>, <time> - Done (took 10818ms)

分享方式：

使用適用於 Visual Studio Code 的 Databricks 擴充功能執行 pytest 的測試

步驟 1：建立測試

步驟 2：建立 pytest 執行器

步驟 3：建立自定義執行組態

步驟 4：執行測試

意見反映

更多資源