Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This feature is in Public Preview.
This page describes the YAML configuration for user-defined operators in Lakeflow Designer. All operator types (uc-udf, uc-udtf, and python-run-function) use the user-defined-operator-v0.1.0 schema, which defines configuration fields using the JSON Schema format.
For information about how to build user-defined operators, see User-defined operators in Lakeflow Designer.
Root properties
Every operator YAML file starts with a set of root properties that identify the operator and define its behavior. The following example shows the general structure:
schema: user-defined-operator-v0.1.0
type: python-run-function
name: My Operator
id: my_operator
version: '1.0.0'
description: >
What this operator does.
Can be multiple lines.
config:
type: object
properties:
my_field:
type: string
title: My Field
description: Help text
ports:
input:
- name: data
title: Input Data
output:
- name: out
title: Output
run_function:
type: inline
code: |
def run(config, inputs, spark):
return {"out": inputs["data"]}
environment:
environment_version: '4'
dependencies:
- 'pandas>=2.0'
| Property | Type | Required | Description |
|---|---|---|---|
schema |
string | Yes | Schema identifier. Must be user-defined-operator-v0.1.0. |
type |
string | Yes | Type of operator: uc-udf, uc-udtf, or python-run-function. |
name |
string | Yes | Display name for the operator. Keep it short to fit the Lakeflow Designer UI. Minimum length of 1 character. |
id |
string | Yes | Unique identifier for the operator type. Minimum length of 1 character. Consider using namespaces (such as finance. or ml.) to categorize operators. |
description |
string | Yes | Detailed description of what the operator does. Shown to users in the UI. Use YAML multi-line syntax (>) for longer descriptions. |
config |
object | Yes | JSON Schema object that defines configuration fields. See Config. |
ports |
object | No | Input and output port definitions. See Ports. |
version |
string | Yes | Version string (for example, "1.0.0"). Use this to track your own operator releases. |
run_function |
object | No | Inline Python code for python-run-function operators. See run_function. |
environment |
object | No | Python environment configuration, including dependencies. See environment. |
Ports
Ports define how your operator connects to other operators in the pipeline. The ports object contains input and output arrays.
ports:
input:
- name: input_data
title: Input Data
mime: application/vnd.databricks.dataframe
allowMultiple: true
required: true
output:
- name: out
title: Output
| Property | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Unique identifier for the port. Used in connections and config references. |
title |
string | No | Human-readable label displayed in the UI. |
mime |
string | No | MIME type for the port data. For example, application/vnd.databricks.dataframe. |
allowMultiple |
boolean | No | If true, the port accepts multiple incoming connections. |
required |
boolean | No | If false, the port is optional. Default: true. |
Only the documented port properties are accepted. Unknown keys (such as the legacy label field) are rejected by schema validation.
Port examples
UDF with input and output ports:
ports:
input:
- name: in
title: Input Data
output:
- name: out
title: Output
UDTF with input and output ports:
ports:
input:
- name: input_data
title: Input Data
output:
- name: clustered_data
title: Clustered Results
python-run-function with multiple inputs and an optional port:
ports:
input:
- name: main_data
title: Main Data
- name: reference_data
title: Reference Table
required: false
output:
- name: joined_output
title: Joined Output
Config
The config field is a JSON Schema object. You define each configuration field as a property within the schema. This format gives you access to standard JSON Schema validation features like enum, minimum, maximum, and examples.
The config object must have type: object and a properties map. You can optionally include required (an array of required property names) and additionalProperties.
config:
type: object
properties:
cluster_count:
type: number
title: Number of Clusters
description: How many clusters to create
default: 3
minimum: 1
maximum: 100
algorithm:
type: string
title: Algorithm
description: Clustering algorithm to use
enum: ['kmeans', 'dbscan', 'hierarchical']
default: kmeans
feature_col:
type: string
title: Feature Column
description: Column to use as input
format: expression
x-ui:
widget: expression
port: data
required: [cluster_count, feature_col]
additionalProperties: false
Config property fields
Each property in the config.properties object supports the following standard JSON Schema fields:
| Field | Type | Description |
|---|---|---|
type |
string | Data type: string, number, integer, boolean, array, or object. |
title |
string | Human-readable label displayed in the UI. |
description |
string | Help text shown to users. |
default |
any | Default value for the field. |
examples |
array | Example values for the field. |
enum |
array | Fixed list of allowed values. |
format |
string | Semantic type hint. See Format values. |
minimum |
number | Minimum allowed value (for number and integer types). |
maximum |
number | Maximum allowed value (for number and integer types). |
items |
object | Schema for array elements (when type is array). |
properties |
object | Nested property definitions (when type is object). |
required |
array | List of required nested property names (when type is object). |
Other standard JSON Schema fields such as minLength, maxLength, pattern, and const are also supported.
Format values
The format field on a config property provides a semantic type hint that tells Lakeflow Designer how to interpret the value. These hints enable specialized UI behavior and validation.
| Format | Description |
|---|---|
expression |
Column reference or SQL expression. |
table_source |
Table source reference. |
file_source |
File source reference. |
column_expressions |
Column expressions. |
sort_expressions |
Sort expressions. |
aggregation_expressions |
Aggregation expressions. |
ai_function_expressions |
AI function expressions. |
is_preview |
Automatic preview mode flag. Lakeflow Designer sets this to true during workflow preview. The config property name is arbitrary — only the format: is_preview tag matters. Use this to skip side effects like external API calls during preview. |
string[] |
String array. |
UI widgets
Widgets customize how a config field renders in the Lakeflow Designer interface. Define widgets in the x-ui property on each config property. If you omit the widget, Lakeflow Designer uses a default widget based on the data type.
| Widget | Data type | Description |
|---|---|---|
input |
string | Single-line text input. |
textarea |
string | Multi-line text area. Supports optional rows property. |
checkbox |
boolean | Standard checkbox. |
toggle |
boolean | Toggle switch. |
number |
number/integer | Numeric input with optional constraints. |
slider |
number/integer | Visual slider for numeric ranges. Supports optional step property. |
select |
string | Single-select dropdown. Requires optionsSource. |
multi-select |
array | Multi-select dropdown. Requires optionsSource. |
expression |
string | Column/expression selector. Requires port. |
input
Single-line text input field.
api_endpoint:
type: string
title: API Endpoint
x-ui:
widget: input
textarea
Multi-line text area for longer content. Supports an optional rows property to control the height.
message_body:
type: string
title: Message Body
x-ui:
widget: textarea
rows: 4
checkbox
Standard checkbox for boolean values.
send_notification:
type: boolean
title: Send Notification
default: false
x-ui:
widget: checkbox
toggle
Toggle switch for boolean values.
enable_logging:
type: boolean
title: Enable Logging
default: true
x-ui:
widget: toggle
number
Numeric input field. Use minimum and maximum on the property itself to constrain the range.
num_clusters:
type: number
title: Number of Clusters
default: 3
minimum: 1
maximum: 100
x-ui:
widget: number
slider
Visual slider for selecting numeric values within a range. Use minimum and maximum on the property to set the range, and step in x-ui to control the increment.
confidence_threshold:
type: number
title: Confidence Threshold
default: 0.8
minimum: 0
maximum: 1
x-ui:
widget: slider
step: 0.05
select
Single-select dropdown. Requires an optionsSource to define where the dropdown values come from. See Options sources.
aggregation_type:
type: string
title: Aggregation Type
x-ui:
widget: select
optionsSource:
type: static
values: ['sum', 'avg', 'min', 'max', 'count']
multi-select
Multi-select dropdown for choosing multiple values. Use type: array with items: { type: string } on the property. Requires an optionsSource. See Options sources.
feature_columns:
type: array
title: Feature Columns
items:
type: string
x-ui:
widget: multi-select
optionsSource:
type: inputColumns
port: input_data
expression
Column/expression selector that lets users pick a column from input data or write a custom SQL expression. Set format: expression on the property and specify the input port in x-ui. This is useful:
- When the user should select a column from the input data.
- When the user might want to write a custom SQL expression.
- For parameters that reference dynamic data in the pipeline.
amount:
type: string
title: Amount
format: expression
x-ui:
widget: expression
port: input_data
Options sources
For select and multi-select widgets, you must define where the dropdown options come from using optionsSource.
Static options
A fixed list of values defined in the YAML.
optionsSource:
type: static
values: ['option1', 'option2', 'option3']
| Property | Type | Required | Description |
|---|---|---|---|
type |
string | Yes | Must be static. |
values |
array | Yes | Array of string values for the dropdown. |
Input columns
Dynamically populates the dropdown with column names from an input port.
optionsSource:
type: inputColumns
port: input_data
| Property | Type | Required | Description |
|---|---|---|---|
type |
string | Yes | Must be inputColumns. |
port |
string | Yes | Name of the input port to get column names from. Must match the name of one of your defined input ports. |
run_function
The run_function property lets you embed Python code directly in the YAML configuration for python-run-function operators. This eliminates the need to register a separate Unity Catalog function.
run_function:
type: inline
code: |
def run(config, inputs, spark):
df = inputs["data"]
threshold = config["threshold"]
return {"out": df.filter(df["score"] > threshold)}
| Property | Type | Required | Description |
|---|---|---|---|
type |
string | Yes | Must be inline. |
code |
string | Yes | Python source code. Must define a run() function. |
The run() function receives three arguments:
config: A dictionary of configuration values set by the user in the UI.inputs: A dictionary mapping input port names to DataFrames.spark: The active SparkSession.
The function must return a dictionary mapping output port names to DataFrames. The keys must exactly match the name field of each output port defined in ports.output. For example, with an output port named out:
return {"out": result_df}
With multiple output ports:
return {"match": match_df, "rest": rest_df}
environment
The environment property specifies the Python environment for python-run-function operators. Use it to pin the environment version and declare pip dependencies.
environment:
environment_version: '4'
dependencies:
- 'scikit-learn>=1.3'
- 'pandas>=2.0'
| Property | Type | Required | Description |
|---|---|---|---|
environment_version |
string | No | The environment version to use. For example, "4". |
dependencies |
array of strings | No | List of pip dependency specifiers. Each entry follows standard pip syntax (for example, "pandas>=2.0"). |
Complete examples
UC-based UDF
This example defines a Unity Catalog-based UDF operator that calculates compound interest.
schema: user-defined-operator-v0.1.0
type: uc-udf
name: Compound Interest
id: finance.compound_interest
version: '1.0.0'
description: >
Calculates compound interest based on principal, rate, and time period.
config:
type: object
properties:
principal:
type: string
title: Principal Amount
format: expression
x-ui:
widget: expression
port: input_data
annual_rate:
type: number
title: Annual Interest Rate
default: 5.0
minimum: 0
maximum: 100
x-ui:
widget: number
years:
type: number
title: Number of Years
default: 10
minimum: 1
maximum: 50
x-ui:
widget: slider
step: 1
compound_frequency:
type: string
title: Compounding Frequency
default: 'monthly'
x-ui:
widget: select
optionsSource:
type: static
values: ['daily', 'monthly', 'quarterly', 'annually']
required: [principal, annual_rate]
additionalProperties: false
ports:
input:
- name: input_data
title: Input Data
output:
- name: out
title: Output
Python run-function operator
This example defines a python-run-function operator that segments customers using K-Means clustering.
schema: user-defined-operator-v0.1.0
type: python-run-function
name: Customer Segmentation
id: ml.customer_segmentation
version: '1.2.0'
description: >
Segments customers into groups based on selected features
using K-Means clustering. Returns customer IDs with their
assigned segment numbers.
config:
type: object
properties:
num_segments:
type: integer
title: Number of Segments
description: How many customer segments to create
default: 3
minimum: 2
maximum: 20
x-ui:
widget: number
customer_id_column:
type: string
title: Customer ID Column
description: Column containing customer identifiers
x-ui:
widget: select
optionsSource:
type: inputColumns
port: customer_data
feature_columns:
type: array
title: Feature Columns
description: Columns to use for segmentation
items:
type: string
x-ui:
widget: multi-select
optionsSource:
type: inputColumns
port: customer_data
normalize_features:
type: boolean
title: Normalize Features
description: Whether to normalize feature values before clustering
default: true
x-ui:
widget: toggle
required: [num_segments, customer_id_column, feature_columns]
additionalProperties: false
ports:
input:
- name: customer_data
title: Customer Data
mime: application/vnd.databricks.dataframe
output:
- name: segmented_customers
title: Segmented Customers
run_function:
type: inline
code: |
def run(config, inputs, spark):
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
df = inputs["customer_data"]
id_col = config["customer_id_column"]
features = config["feature_columns"]
k = config["num_segments"]
normalize = config.get("normalize_features", True)
assembler = VectorAssembler(inputCols=features, outputCol="features_vec")
assembled = assembler.transform(df)
if normalize:
scaler = StandardScaler(inputCol="features_vec", outputCol="scaled_features")
model = scaler.fit(assembled)
assembled = model.transform(assembled)
feature_col = "scaled_features"
else:
feature_col = "features_vec"
kmeans = KMeans(k=k, featuresCol=feature_col, predictionCol="segment")
result = kmeans.fit(assembled).transform(assembled)
return {"segmented_customers": result.select(id_col, "segment")}
environment:
environment_version: '4'
dependencies:
- 'scikit-learn>=1.3'
Quick reference
Required root properties
schema:user-defined-operator-v0.1.0name: Display nameid: Unique identifierdescription: What the operator doesconfig: JSON Schema objecttype:uc-udf,uc-udtf, orpython-run-functionversion: Author-defined version string
Optional root properties
ports: Input and output port definitionsrun_function: Inline Python code (python-run-functiononly)environment: Python environment and dependencies (python-run-functiononly)
Config property data types
string | boolean | number | integer | array | object
UI widgets
input | textarea | checkbox | toggle | number | slider | select | multi-select | expression
Options sources
static (fixed values) | inputColumns (from input port)
Format values
expression | table_source | file_source | column_expressions | sort_expressions | aggregation_expressions | ai_function_expressions | is_preview | string[]