Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Creates a PyArrow-native user defined table function (UDTF). This function provides a PyArrow-native interface for UDTFs, where the eval method receives PyArrow RecordBatches or Arrays and returns an Iterator of PyArrow Tables or RecordBatches. This enables true vectorized computation without row-by-row processing overhead.
Syntax
from pyspark.databricks.sql import functions as dbf
@dbf.arrow_udtf(returnType=<returnType>)
class MyUDTF:
def eval(self, ...):
...
Parameters
| Parameter | Type | Description |
|---|---|---|
cls |
class, optional |
The Python user-defined table function handler class. |
returnType |
pyspark.sql.types.StructType or str, optional |
The return type of the user-defined table function. The value can be either a StructType object or a DDL-formatted struct type string. |
Examples
UDTF with PyArrow RecordBatch input:
import pyarrow as pa
from pyspark.databricks.sql.functions import arrow_udtf
@arrow_udtf(returnType="x int, y int")
class MyUDTF:
def eval(self, batch: pa.RecordBatch):
# Process the entire batch vectorized
x_array = batch.column('x')
y_array = batch.column('y')
result_table = pa.table({
'x': x_array,
'y': y_array
})
yield result_table
df = spark.range(10).selectExpr("id as x", "id as y")
MyUDTF(df.asTable()).show()
UDTF with PyArrow Array inputs:
@arrow_udtf(returnType="x int, y int")
class MyUDTF2:
def eval(self, x: pa.Array, y: pa.Array):
# Process arrays vectorized
result_table = pa.table({
'x': x,
'y': y
})
yield result_table
MyUDTF2(lit(1), lit(2)).show()
Note
- The eval method must accept PyArrow RecordBatches or Arrays as input
- The eval method must yield PyArrow Tables or RecordBatches as output