Tests uitvoeren met pytest met behulp van de Databricks-extensie voor Visual Studio Code

Artikel
12/27/2024

In dit artikel wordt beschreven hoe u tests uitvoert met behulp van pytest de Databricks-extensie voor Visual Studio Code. Zie Wat is de Databricks-extensie voor Visual Studio Code?

U kunt pytest uitvoeren op lokale code die geen verbinding met een cluster nodig heeft in een externe Azure Databricks-werkruimte. U kunt bijvoorbeeld uw functies testen pytest die PySpark DataFrames accepteren en retourneren in het lokale geheugen. Zie pytest in de documentatie om aan de slag te gaan met pytest en deze lokaal uit te voeren.

Als u code wilt uitvoeren pytest in een externe Azure Databricks-werkruimte, gaat u als volgt te werk in uw Visual Studio Code-project:

Stap 1: De tests maken

Voeg een Python-bestand toe met de volgende code, die uw tests bevat die u wilt uitvoeren. In dit voorbeeld wordt ervan uitgegaan dat dit bestand een naam spark_test.py heeft en zich in de hoofdmap van uw Visual Studio Code-project bevindt. Dit bestand bevat een pytestarmaturen, waardoor het cluster SparkSession (het toegangspunt voor Spark-functionaliteit op het cluster) beschikbaar is voor de tests. Dit bestand bevat één test waarmee wordt gecontroleerd of de opgegeven cel in de tabel de opgegeven waarde bevat. U kunt indien nodig uw eigen tests toevoegen aan dit bestand.

from pyspark.sql import SparkSession
import pytest

@pytest.fixture
def spark() -> SparkSession:
  # Create a SparkSession (the entry point to Spark functionality) on
  # the cluster in the remote Databricks workspace. Unit tests do not
  # have access to this SparkSession by default.
  return SparkSession.builder.getOrCreate()

# Now add your unit tests.

# For example, here is a unit test that must be run on the
# cluster in the remote Databricks workspace.
# This example determines whether the specified cell in the
# specified table contains the specified value. For example,
# the third column in the first row should contain the word "Ideal":
#
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# |_c0 | carat | cut   | color | clarity | depth | table | price | x    | y     | z    |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# | 1  | 0.23  | Ideal | E     | SI2     | 61.5  | 55    | 326   | 3.95 | 3. 98 | 2.43 |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# ...
#
def test_spark(spark):
  spark.sql('USE default')
  data = spark.sql('SELECT * FROM diamonds')
  assert data.collect()[0][2] == 'Ideal'

Stap 2: De pytest-runner maken

Voeg een Python-bestand toe met de volgende code, waarmee pytest u uw tests uit de vorige stap kunt uitvoeren. In dit voorbeeld wordt ervan uitgegaan dat het bestand een naam pytest_databricks.py heeft en zich in de hoofdmap van uw Visual Studio Code-project bevindt.

import pytest
import os
import sys

# Run all tests in the connected directory in the remote Databricks workspace.
# By default, pytest searches through all files with filenames ending with
# "_test.py" for tests. Within each of these files, pytest runs each function
# with a function name beginning with "test_".

# Get the path to the directory for this file in the workspace.
dir_root = os.path.dirname(os.path.realpath(__file__))
# Switch to the root directory.
os.chdir(dir_root)

# Skip writing .pyc files to the bytecode cache on the cluster.
sys.dont_write_bytecode = True

# Now run pytest from the root directory, using the
# arguments that are supplied by your custom run configuration in
# your Visual Studio Code project. In this case, the custom run
# configuration JSON must contain these unique "program" and
# "args" objects:
#
# ...
# {
#   ...
#   "program": "${workspaceFolder}/path/to/this/file/in/workspace",
#   "args": ["/path/to/_test.py-files"]
# }
# ...
#
retcode = pytest.main(sys.argv[1:])

Stap 3: Een aangepaste uitvoeringsconfiguratie maken

Als u wilt instrueren pytest om uw tests uit te voeren, moet u een aangepaste uitvoeringsconfiguratie maken. Gebruik de bestaande runconfiguratie op basis van een Databricks-cluster om als volgt uw eigen aangepaste uitvoeringsconfiguratie te maken:

Klik in het hoofdmenu op Configuratie > toevoegen uitvoeren.
In het opdrachtpalet, selecteer Databricks.

Visual Studio Code voegt een .vscode/launch.json bestand toe aan uw project als dit bestand nog niet bestaat.
Wijzig de configuratie van de starter-uitvoering als volgt en sla het bestand op:
- Wijzig de naam van deze uitvoeringsconfiguratie in Run on Databricks een unieke weergavenaam voor deze configuratie, in dit voorbeeld Unit Tests (on Databricks).
- Ga program van ${file} naar het pad in het project dat de testloper bevat, in dit voorbeeld ${workspaceFolder}/pytest_databricks.py.
- Ga args in dit voorbeeld []van ["."] het pad naar het pad in het project dat de bestanden bevat met uw tests.
Het launch.json bestand moet er als volgt uitzien:
```
{
  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
  "version": "0.2.0",
  "configurations": [
    {
      "type": "databricks",
      "request": "launch",
      "name": "Unit Tests (on Databricks)",
      "program": "${workspaceFolder}/pytest_databricks.py",
      "args": ["."],
      "env": {}
    }
  ]
}
```

Stap 4: De tests uitvoeren

Zorg ervoor dat het pytest cluster al is geïnstalleerd. Als de instellingenpagina van het cluster bijvoorbeeld is geopend in uw Azure Databricks-werkruimte, gaat u als volgt te werk:

Als pytest Als pytest niet zichtbaar is, klikt u op Nieuw installeren.
Klik voor bibliotheekbron op PyPI.
Voer voor Pakket in pytest.
Klik op Install.
Wacht totdat de status is gewijzigd van In behandeling tot Geïnstalleerd.

Ga als volgt te werk om de tests uit te voeren vanuit uw Visual Studio Code-project:

Klik in het hoofdmenu op >.
Klik in de lijst Run and Debug op Unit Tests (on Databricks)als deze nog niet is geselecteerd.
Klik op het groene pijlpictogram (Foutopsporing starten).

De pytest resultaten worden weergegeven in de Console voor foutopsporing (Console > voor foutopsporing weergeven in het hoofdmenu). Deze resultaten laten bijvoorbeeld zien dat er ten minste één test in het spark_test.py bestand is gevonden en dat een punt (.) betekent dat er één test is gevonden en geslaagd. (Bij een mislukte test wordt een F.)

<date>, <time> - Creating execution context on cluster <cluster-id> ...
<date>, <time> - Synchronizing code to /Workspace/path/to/directory ...
<date>, <time> - Running /pytest_databricks.py ...
============================= test session starts ==============================
platform linux -- Python <version>, pytest-<version>, pluggy-<version>
rootdir: /Workspace/path/to/directory
collected 1 item

spark_test.py .                                                          [100%]

============================== 1 passed in 3.25s ===============================
<date>, <time> - Done (took 10818ms)

Delen via

Tests uitvoeren met pytest met behulp van de Databricks-extensie voor Visual Studio Code

Stap 1: De tests maken

Stap 2: De pytest-runner maken

Stap 3: Een aangepaste uitvoeringsconfiguratie maken

Stap 4: De tests uitvoeren

Feedback

Aanvullende resources