User-defined operator YAML reference

Important

This feature is in Public Preview.

This page describes the YAML configuration for user-defined operators in Lakeflow Designer. All operator types (uc-udf, uc-udtf, and python-run-function) use the user-defined-operator-v0.1.0 schema, which defines configuration fields using the JSON Schema format.

For information about how to build user-defined operators, see User-defined operators in Lakeflow Designer.

Root properties

Every operator YAML file starts with a set of root properties that identify the operator and define its behavior. The following example shows the general structure:

schema: user-defined-operator-v0.1.0
type: python-run-function
name: My Operator
id: my_operator
version: '1.0.0'
description: >
  What this operator does.
  Can be multiple lines.
config:
  type: object
  properties:
    my_field:
      type: string
      title: My Field
      description: Help text
ports:
  input:
    - name: data
      title: Input Data
  output:
    - name: out
      title: Output
run_function:
  type: inline
  code: |
    def run(config, inputs, spark):
        return {"out": inputs["data"]}
environment:
  environment_version: '4'
  dependencies:
    - 'pandas>=2.0'
Property Type Required Description
schema string Yes Schema identifier. Must be user-defined-operator-v0.1.0.
type string Yes Type of operator: uc-udf, uc-udtf, or python-run-function.
name string Yes Display name for the operator. Keep it short to fit the Lakeflow Designer UI. Minimum length of 1 character.
id string Yes Unique identifier for the operator type. Minimum length of 1 character. Consider using namespaces (such as finance. or ml.) to categorize operators.
description string Yes Detailed description of what the operator does. Shown to users in the UI. Use YAML multi-line syntax (>) for longer descriptions.
config object Yes JSON Schema object that defines configuration fields. See Config.
ports object No Input and output port definitions. See Ports.
version string Yes Version string (for example, "1.0.0"). Use this to track your own operator releases.
run_function object No Inline Python code for python-run-function operators. See run_function.
environment object No Python environment configuration, including dependencies. See environment.

Ports

Ports define how your operator connects to other operators in the pipeline. The ports object contains input and output arrays.

ports:
  input:
    - name: input_data
      title: Input Data
      mime: application/vnd.databricks.dataframe
      allowMultiple: true
      required: true
  output:
    - name: out
      title: Output
Property Type Required Description
name string Yes Unique identifier for the port. Used in connections and config references.
title string No Human-readable label displayed in the UI.
mime string No MIME type for the port data. For example, application/vnd.databricks.dataframe.
allowMultiple boolean No If true, the port accepts multiple incoming connections.
required boolean No If false, the port is optional. Default: true.

Only the documented port properties are accepted. Unknown keys (such as the legacy label field) are rejected by schema validation.

Port examples

UDF with input and output ports:

ports:
  input:
    - name: in
      title: Input Data
  output:
    - name: out
      title: Output

UDTF with input and output ports:

ports:
  input:
    - name: input_data
      title: Input Data
  output:
    - name: clustered_data
      title: Clustered Results

python-run-function with multiple inputs and an optional port:

ports:
  input:
    - name: main_data
      title: Main Data
    - name: reference_data
      title: Reference Table
      required: false
  output:
    - name: joined_output
      title: Joined Output

Config

The config field is a JSON Schema object. You define each configuration field as a property within the schema. This format gives you access to standard JSON Schema validation features like enum, minimum, maximum, and examples.

The config object must have type: object and a properties map. You can optionally include required (an array of required property names) and additionalProperties.

config:
  type: object
  properties:
    cluster_count:
      type: number
      title: Number of Clusters
      description: How many clusters to create
      default: 3
      minimum: 1
      maximum: 100
    algorithm:
      type: string
      title: Algorithm
      description: Clustering algorithm to use
      enum: ['kmeans', 'dbscan', 'hierarchical']
      default: kmeans
    feature_col:
      type: string
      title: Feature Column
      description: Column to use as input
      format: expression
      x-ui:
        widget: expression
        port: data
  required: [cluster_count, feature_col]
  additionalProperties: false

Config property fields

Each property in the config.properties object supports the following standard JSON Schema fields:

Field Type Description
type string Data type: string, number, integer, boolean, array, or object.
title string Human-readable label displayed in the UI.
description string Help text shown to users.
default any Default value for the field.
examples array Example values for the field.
enum array Fixed list of allowed values.
format string Semantic type hint. See Format values.
minimum number Minimum allowed value (for number and integer types).
maximum number Maximum allowed value (for number and integer types).
items object Schema for array elements (when type is array).
properties object Nested property definitions (when type is object).
required array List of required nested property names (when type is object).

Other standard JSON Schema fields such as minLength, maxLength, pattern, and const are also supported.

Format values

The format field on a config property provides a semantic type hint that tells Lakeflow Designer how to interpret the value. These hints enable specialized UI behavior and validation.

Format Description
expression Column reference or SQL expression.
table_source Table source reference.
file_source File source reference.
column_expressions Column expressions.
sort_expressions Sort expressions.
aggregation_expressions Aggregation expressions.
ai_function_expressions AI function expressions.
is_preview Automatic preview mode flag. Lakeflow Designer sets this to true during workflow preview. The config property name is arbitrary — only the format: is_preview tag matters. Use this to skip side effects like external API calls during preview.
string[] String array.

UI widgets

Widgets customize how a config field renders in the Lakeflow Designer interface. Define widgets in the x-ui property on each config property. If you omit the widget, Lakeflow Designer uses a default widget based on the data type.

Widget Data type Description
input string Single-line text input.
textarea string Multi-line text area. Supports optional rows property.
checkbox boolean Standard checkbox.
toggle boolean Toggle switch.
number number/integer Numeric input with optional constraints.
slider number/integer Visual slider for numeric ranges. Supports optional step property.
select string Single-select dropdown. Requires optionsSource.
multi-select array Multi-select dropdown. Requires optionsSource.
expression string Column/expression selector. Requires port.

input

Single-line text input field.

api_endpoint:
  type: string
  title: API Endpoint
  x-ui:
    widget: input

textarea

Multi-line text area for longer content. Supports an optional rows property to control the height.

message_body:
  type: string
  title: Message Body
  x-ui:
    widget: textarea
    rows: 4

checkbox

Standard checkbox for boolean values.

send_notification:
  type: boolean
  title: Send Notification
  default: false
  x-ui:
    widget: checkbox

toggle

Toggle switch for boolean values.

enable_logging:
  type: boolean
  title: Enable Logging
  default: true
  x-ui:
    widget: toggle

number

Numeric input field. Use minimum and maximum on the property itself to constrain the range.

num_clusters:
  type: number
  title: Number of Clusters
  default: 3
  minimum: 1
  maximum: 100
  x-ui:
    widget: number

slider

Visual slider for selecting numeric values within a range. Use minimum and maximum on the property to set the range, and step in x-ui to control the increment.

confidence_threshold:
  type: number
  title: Confidence Threshold
  default: 0.8
  minimum: 0
  maximum: 1
  x-ui:
    widget: slider
    step: 0.05

select

Single-select dropdown. Requires an optionsSource to define where the dropdown values come from. See Options sources.

aggregation_type:
  type: string
  title: Aggregation Type
  x-ui:
    widget: select
    optionsSource:
      type: static
      values: ['sum', 'avg', 'min', 'max', 'count']

multi-select

Multi-select dropdown for choosing multiple values. Use type: array with items: { type: string } on the property. Requires an optionsSource. See Options sources.

feature_columns:
  type: array
  title: Feature Columns
  items:
    type: string
  x-ui:
    widget: multi-select
    optionsSource:
      type: inputColumns
      port: input_data

expression

Column/expression selector that lets users pick a column from input data or write a custom SQL expression. Set format: expression on the property and specify the input port in x-ui. This is useful:

  • When the user should select a column from the input data.
  • When the user might want to write a custom SQL expression.
  • For parameters that reference dynamic data in the pipeline.
amount:
  type: string
  title: Amount
  format: expression
  x-ui:
    widget: expression
    port: input_data

Options sources

For select and multi-select widgets, you must define where the dropdown options come from using optionsSource.

Static options

A fixed list of values defined in the YAML.

optionsSource:
  type: static
  values: ['option1', 'option2', 'option3']
Property Type Required Description
type string Yes Must be static.
values array Yes Array of string values for the dropdown.

Input columns

Dynamically populates the dropdown with column names from an input port.

optionsSource:
  type: inputColumns
  port: input_data
Property Type Required Description
type string Yes Must be inputColumns.
port string Yes Name of the input port to get column names from. Must match the name of one of your defined input ports.

run_function

The run_function property lets you embed Python code directly in the YAML configuration for python-run-function operators. This eliminates the need to register a separate Unity Catalog function.

run_function:
  type: inline
  code: |
    def run(config, inputs, spark):
        df = inputs["data"]
        threshold = config["threshold"]
        return {"out": df.filter(df["score"] > threshold)}
Property Type Required Description
type string Yes Must be inline.
code string Yes Python source code. Must define a run() function.

The run() function receives three arguments:

  • config: A dictionary of configuration values set by the user in the UI.
  • inputs: A dictionary mapping input port names to DataFrames.
  • spark: The active SparkSession.

The function must return a dictionary mapping output port names to DataFrames. The keys must exactly match the name field of each output port defined in ports.output. For example, with an output port named out:

return {"out": result_df}

With multiple output ports:

return {"match": match_df, "rest": rest_df}

environment

The environment property specifies the Python environment for python-run-function operators. Use it to pin the environment version and declare pip dependencies.

environment:
  environment_version: '4'
  dependencies:
    - 'scikit-learn>=1.3'
    - 'pandas>=2.0'
Property Type Required Description
environment_version string No The environment version to use. For example, "4".
dependencies array of strings No List of pip dependency specifiers. Each entry follows standard pip syntax (for example, "pandas>=2.0").

Complete examples

UC-based UDF

This example defines a Unity Catalog-based UDF operator that calculates compound interest.

schema: user-defined-operator-v0.1.0
type: uc-udf
name: Compound Interest
id: finance.compound_interest
version: '1.0.0'
description: >
  Calculates compound interest based on principal, rate, and time period.

config:
  type: object
  properties:
    principal:
      type: string
      title: Principal Amount
      format: expression
      x-ui:
        widget: expression
        port: input_data

    annual_rate:
      type: number
      title: Annual Interest Rate
      default: 5.0
      minimum: 0
      maximum: 100
      x-ui:
        widget: number

    years:
      type: number
      title: Number of Years
      default: 10
      minimum: 1
      maximum: 50
      x-ui:
        widget: slider
        step: 1

    compound_frequency:
      type: string
      title: Compounding Frequency
      default: 'monthly'
      x-ui:
        widget: select
        optionsSource:
          type: static
          values: ['daily', 'monthly', 'quarterly', 'annually']
  required: [principal, annual_rate]
  additionalProperties: false

ports:
  input:
    - name: input_data
      title: Input Data
  output:
    - name: out
      title: Output

Python run-function operator

This example defines a python-run-function operator that segments customers using K-Means clustering.

schema: user-defined-operator-v0.1.0
type: python-run-function
name: Customer Segmentation
id: ml.customer_segmentation
version: '1.2.0'
description: >
  Segments customers into groups based on selected features
  using K-Means clustering. Returns customer IDs with their
  assigned segment numbers.

config:
  type: object
  properties:
    num_segments:
      type: integer
      title: Number of Segments
      description: How many customer segments to create
      default: 3
      minimum: 2
      maximum: 20
      x-ui:
        widget: number
    customer_id_column:
      type: string
      title: Customer ID Column
      description: Column containing customer identifiers
      x-ui:
        widget: select
        optionsSource:
          type: inputColumns
          port: customer_data
    feature_columns:
      type: array
      title: Feature Columns
      description: Columns to use for segmentation
      items:
        type: string
      x-ui:
        widget: multi-select
        optionsSource:
          type: inputColumns
          port: customer_data
    normalize_features:
      type: boolean
      title: Normalize Features
      description: Whether to normalize feature values before clustering
      default: true
      x-ui:
        widget: toggle
  required: [num_segments, customer_id_column, feature_columns]
  additionalProperties: false

ports:
  input:
    - name: customer_data
      title: Customer Data
      mime: application/vnd.databricks.dataframe
  output:
    - name: segmented_customers
      title: Segmented Customers

run_function:
  type: inline
  code: |
    def run(config, inputs, spark):
        from pyspark.ml.feature import VectorAssembler, StandardScaler
        from pyspark.ml.clustering import KMeans

        df = inputs["customer_data"]
        id_col = config["customer_id_column"]
        features = config["feature_columns"]
        k = config["num_segments"]
        normalize = config.get("normalize_features", True)

        assembler = VectorAssembler(inputCols=features, outputCol="features_vec")
        assembled = assembler.transform(df)

        if normalize:
            scaler = StandardScaler(inputCol="features_vec", outputCol="scaled_features")
            model = scaler.fit(assembled)
            assembled = model.transform(assembled)
            feature_col = "scaled_features"
        else:
            feature_col = "features_vec"

        kmeans = KMeans(k=k, featuresCol=feature_col, predictionCol="segment")
        result = kmeans.fit(assembled).transform(assembled)

        return {"segmented_customers": result.select(id_col, "segment")}

environment:
  environment_version: '4'
  dependencies:
    - 'scikit-learn>=1.3'

Quick reference

Required root properties

  • schema: user-defined-operator-v0.1.0
  • name: Display name
  • id: Unique identifier
  • description: What the operator does
  • config: JSON Schema object
  • type: uc-udf, uc-udtf, or python-run-function
  • version: Author-defined version string

Optional root properties

  • ports: Input and output port definitions
  • run_function: Inline Python code (python-run-function only)
  • environment: Python environment and dependencies (python-run-function only)

Config property data types

string | boolean | number | integer | array | object

UI widgets

input | textarea | checkbox | toggle | number | slider | select | multi-select | expression

Options sources

static (fixed values) | inputColumns (from input port)

Format values

expression | table_source | file_source | column_expressions | sort_expressions | aggregation_expressions | ai_function_expressions | is_preview | string[]