Run tests with pytest for the Databricks extension for Visual Studio Code

Article
03/01/2024

This article describes how to run tests by using pytest for the Databricks extension for Visual Studio Code. See What is the Databricks extension for Visual Studio Code?.

This information assumes that you have already installed and set up the Databricks extension for Visual Studio Code. See Install the Databricks extension for Visual Studio Code.

You can run pytest on local code that does not need a connection to a cluster in a remote Azure Databricks workspace. For example, you might use pytest to test your functions that accept and return PySpark DataFrames in local memory. To get started with pytest and run it locally, see Get Started in the pytest documentation.

To run pytest on code in a remote Azure Databricks workspace, do the following in your Visual Studio Code project:

Step 1: Create the tests

Add a Python file with the following code, which contains your tests to run. This example assumes that this file is named spark_test.py and is at the root of your Visual Studio Code project. This file contains a pytest fixture, which makes the cluster’s SparkSession (the entry point to Spark functionality on the cluster) available to the tests. This file contains a single test that checks whether the specified cell in the table contains the specified value. You can add your own tests to this file as needed.

from pyspark.sql import SparkSession
import pytest

@pytest.fixture
def spark() -> SparkSession:
  # Create a SparkSession (the entry point to Spark functionality) on
  # the cluster in the remote Databricks workspace. Unit tests do not
  # have access to this SparkSession by default.
  return SparkSession.builder.getOrCreate()

# Now add your unit tests.

# For example, here is a unit test that must be run on the
# cluster in the remote Databricks workspace.
# This example determines whether the specified cell in the
# specified table contains the specified value. For example,
# the third column in the first row should contain the word "Ideal":
#
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# |_c0 | carat | cut   | color | clarity | depth | table | price | x    | y     | z    |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# | 1  | 0.23  | Ideal | E     | SI2     | 61.5  | 55    | 326   | 3.95 | 3. 98 | 2.43 |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# ...
#
def test_spark(spark):
  spark.sql('USE default')
  data = spark.sql('SELECT * FROM diamonds')
  assert data.collect()[0][2] == 'Ideal'

Step 2: Create the pytest runner

Add a Python file with the following code, which instructs pytest to run your tests from the previous step. This example assumes that the file is named pytest_databricks.py and is at the root of your Visual Studio Code project.

import pytest
import os
import sys

# Run all tests in the connected directory in the remote Databricks workspace.
# By default, pytest searches through all files with filenames ending with
# "_test.py" for tests. Within each of these files, pytest runs each function
# with a function name beginning with "test_".

# Get the path to the directory for this file in the workspace.
dir_root = os.path.dirname(os.path.realpath(__file__))
# Switch to the root directory.
os.chdir(dir_root)

# Skip writing .pyc files to the bytecode cache on the cluster.
sys.dont_write_bytecode = True

# Now run pytest from the root directory, using the
# arguments that are supplied by your custom run configuration in
# your Visual Studio Code project. In this case, the custom run
# configuration JSON must contain these unique "program" and
# "args" objects:
#
# ...
# {
#   ...
#   "program": "${workspaceFolder}/path/to/this/file/in/workspace",
#   "args": ["/path/to/_test.py-files"]
# }
# ...
#
retcode = pytest.main(sys.argv[1:])

Step 3: Create a custom run configuration

To instruct pytest to run your tests, you must create a custom run configuration. Use the existing Databricks cluster-based run configuration to create your own custom run configuration, as follows:

On the main menu, click Run > Add configuration.
In the Command Palette, select Databricks.

Visual Studio Code adds a .vscode/launch.json file to your project, if this file does not already exist.
Change the starter run configuration as follows, and then save the file:
- Change this run configuration’s name from Run on Databricks to some unique display name for this configuration, in this example Unit Tests (on Databricks).
- Change program from ${file} to the path in the project that contains the test runner, in this example ${workspaceFolder}/pytest_databricks.py.
- Change args from [] to the path in the project that contains the files with your tests, in this example ["."].
Your launch.json file should look like this:
```
{
  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
  "version": "0.2.0",
  "configurations": [
    {
      "type": "databricks",
      "request": "launch",
      "name": "Unit Tests (on Databricks)",
      "program": "${workspaceFolder}/pytest_databricks.py",
      "args": ["."],
      "env": {}
    }
  ]
}
```

Step 4: Run the tests

Make sure that pytest is already installed on the cluster first. For example, with the cluster’s settings page open in your Azure Databricks workspace, do the following:

On the Libraries tab, if pytest is visible, then pytest is already installed. If pytest is not visible, click Install new.
For Library Source, click PyPI.
For Package, enter pytest.
Click Install.
Wait until Status changes from Pending to Installed.

To run the tests, do the following from your Visual Studio Code project:

On the main menu, click View > Run.
In the Run and Debug list, click Unit Tests (on Databricks), if it is not already selected.
Click the green arrow (Start Debugging) icon.

The pytest results display in the Debug Console (View > Debug Console on the main menu). For example, these results show that at least one test was found in the spark_test.py file, and a dot (.) means that a single test was found and passed. (A failing test would show an F.)

<date>, <time> - Creating execution context on cluster <cluster-id> ...
<date>, <time> - Synchronizing code to /Workspace/path/to/directory ...
<date>, <time> - Running /pytest_databricks.py ...
============================= test session starts ==============================
platform linux -- Python <version>, pytest-<version>, pluggy-<version>
rootdir: /Workspace/path/to/directory
collected 1 item

spark_test.py .                                                          [100%]

============================== 1 passed in 3.25s ===============================
<date>, <time> - Done (took 10818ms)

Share via