Visual Studio Code 用 Databricks 拡張機能の pytest を使ってテストを実行する

[アーティクル]
03/01/2024

この記事では、Visual Studio Code 用 Databricks 拡張機能の pytest を使ってテストを実行する方法について説明します。「Visual Studio Code 用 Databricks 拡張機能とは」を参照してください。

この情報は、Visual Studio Code 用 Databricks 拡張機能のインストールと設定が完了していることを前提としています。「Visual Studio Code 用 Databricks 拡張機能をインストールする」を参照してください。

リモート Azure Databricks ワークスペース内のクラスターへの接続を必要としないローカルコードで pytest を実行できます。たとえば、pytest を使用して、ローカルメモリ内の PySpark DataFrame を受信および返信する関数をテストできます。 pytest の使用を開始してローカルで実行するには、pytest ドキュメントの「作業の開始」を参照してください。

リモートの Azure Databricks ワークスペースのコード上で pytest を実行するには、Visual Studio Code プロジェクトで次の操作を行います。

ステップ 1: テストを作成する

次のコードを含む Python ファイルを追加します。これには実行するテストも含まれます。この例では、このファイルの名前が spark_test.py で、Visual Studio Code プロジェクトのルートにあることを前提としています。このファイルにはpytest "フィクスチャ" が含まれています。これにより、クラスターの SparkSession (クラスター上の Spark 機能へのエントリポイント) をテストで使用できるようになります。このファイルには、テーブル内の指定したセルに指定した値が含まれているかどうかを確認する 1 つのテストが含まれています。必要に応じて、このファイルに独自のテストを追加できます。

from pyspark.sql import SparkSession
import pytest

@pytest.fixture
def spark() -> SparkSession:
  # Create a SparkSession (the entry point to Spark functionality) on
  # the cluster in the remote Databricks workspace. Unit tests do not
  # have access to this SparkSession by default.
  return SparkSession.builder.getOrCreate()

# Now add your unit tests.

# For example, here is a unit test that must be run on the
# cluster in the remote Databricks workspace.
# This example determines whether the specified cell in the
# specified table contains the specified value. For example,
# the third column in the first row should contain the word "Ideal":
#
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# |_c0 | carat | cut   | color | clarity | depth | table | price | x    | y     | z    |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# | 1  | 0.23  | Ideal | E     | SI2     | 61.5  | 55    | 326   | 3.95 | 3. 98 | 2.43 |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# ...
#
def test_spark(spark):
  spark.sql('USE default')
  data = spark.sql('SELECT * FROM diamonds')
  assert data.collect()[0][2] == 'Ideal'

ステップ 2: pytest ランナーを作成する

次のコードを含む Python ファイルを追加します。これは、前の手順でテストを実行するように pytest に指示します。この例では、ファイルの名前が pytest_databricks.py で、Visual Studio Code プロジェクトのルートにあることを前提としています。

import pytest
import os
import sys

# Run all tests in the connected directory in the remote Databricks workspace.
# By default, pytest searches through all files with filenames ending with
# "_test.py" for tests. Within each of these files, pytest runs each function
# with a function name beginning with "test_".

# Get the path to the directory for this file in the workspace.
dir_root = os.path.dirname(os.path.realpath(__file__))
# Switch to the root directory.
os.chdir(dir_root)

# Skip writing .pyc files to the bytecode cache on the cluster.
sys.dont_write_bytecode = True

# Now run pytest from the root directory, using the
# arguments that are supplied by your custom run configuration in
# your Visual Studio Code project. In this case, the custom run
# configuration JSON must contain these unique "program" and
# "args" objects:
#
# ...
# {
#   ...
#   "program": "${workspaceFolder}/path/to/this/file/in/workspace",
#   "args": ["/path/to/_test.py-files"]
# }
# ...
#
retcode = pytest.main(sys.argv[1:])

ステップ 3: カスタム実行構成を作成する

テストを実行するように pytest に指示するには、カスタム実行構成を作成する必要があります。既存の Databricks クラスターベースの実行構成を使用して、次のように独自のカスタム実行構成を作成します。

メインメニューの [実行] > [構成の追加] をクリックします。
[コマンドパレット] で、[Databricks] を選択します。

.vscode/launch.json ファイルがまだ存在しない場合は、Visual Studio Code によってプロジェクトにこのファイルが追加されます。
スターター実行構成を次のように変更し、ファイルを保存します。
- この実行構成の名前を Run on Databricks から、この構成の一意の表示名に変更します (この例では Unit Tests (on Databricks))。
- ${file} からテストランナーを含むプロジェクト内のパスに program を変更します (この例では、${workspaceFolder}/pytest_databricks.py)。
- [] からテスト内のファイルを含むプロジェクト内のパスに args を変更します (この例では、["."])。
launch.json ファイルは、次のようになります。
```
{
  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
  "version": "0.2.0",
  "configurations": [
    {
      "type": "databricks",
      "request": "launch",
      "name": "Unit Tests (on Databricks)",
      "program": "${workspaceFolder}/pytest_databricks.py",
      "args": ["."],
      "env": {}
    }
  ]
}
```

ステップ 4: テストを実行する

まず、pytest がクラスターに既にインストールされていることを確認します。たとえば、Azure Databricks ワークスペースでクラスターの設定ページを開いた状態で、次の操作を行います。

[ライブラリ] タブで、pytest が表示されている場合は、pytest は既にインストールされています。 pytest が表示されない場合は、[新規インストール] をクリックします。
[ライブラリソース] で、[PyPI] をクリックします。
[パッケージ] に「pytest」と入力します。
[インストール] をクリックします。
[状態] が [保留中] から [インストール済み] に変わるまで待ちます。

テストを実行するには、Visual Studio Code プロジェクトから次の操作を行います。

メインメニューで、[ビュー] > [実行] をクリックします。
まだ選択していない場合は、[実行とデバッグ] の一覧で、[Unit Tests (on Databricks)](単体テスト (Databricks 上)) をクリックします。
緑色の矢印 (デバッグの開始) アイコンをクリックします。

pytest の結果がデバッグコンソール (メインメニューで [ビュー] > [デバッグコンソール]) に表示されます。たとえば、これらの結果は、spark_test.py ファイル内に少なくとも 1 つのテストが見つかったことを示し、ドット (.) は 1 つのテストが見つかり、合格したことを意味します。 (失敗したテストでは、F が表示されます)。

<date>, <time> - Creating execution context on cluster <cluster-id> ...
<date>, <time> - Synchronizing code to /Workspace/path/to/directory ...
<date>, <time> - Running /pytest_databricks.py ...
============================= test session starts ==============================
platform linux -- Python <version>, pytest-<version>, pluggy-<version>
rootdir: /Workspace/path/to/directory
collected 1 item

spark_test.py .                                                          [100%]

============================== 1 passed in 3.25s ===============================
<date>, <time> - Done (took 10818ms)

次の方法で共有

Visual Studio Code 用 Databricks 拡張機能の pytest を使ってテストを実行する

ステップ 1: テストを作成する

ステップ 2: pytest ランナーを作成する

ステップ 3: カスタム実行構成を作成する

ステップ 4: テストを実行する

その他のリソース