arrow_udtf

Creates a PyArrow-native user defined table function (UDTF). This function provides a PyArrow-native interface for UDTFs, where the eval method receives PyArrow RecordBatches or Arrays and returns an Iterator of PyArrow Tables or RecordBatches. This enables true vectorized computation without row-by-row processing overhead.

Syntax

from pyspark.sql import functions as dbf

@dbf.arrow_udtf(returnType=<returnType>)
class MyUDTF:
    def eval(self, ...):
        ...

Parameters

Parameter	Type	Description
`cls`	`class`, optional	The Python user-defined table function handler class.
`returnType`	`pyspark.sql.types.StructType` or `str`, optional	The return type of the user-defined table function. The value can be either a StructType object or a DDL-formatted struct type string.

Examples

UDTF with PyArrow RecordBatch input:

import pyarrow as pa
from pyspark.sql.functions import arrow_udtf

@arrow_udtf(returnType="x int, y int")
class MyUDTF:
    def eval(self, batch: pa.RecordBatch):
        # Process the entire batch vectorized
        x_array = batch.column('x')
        y_array = batch.column('y')
        result_table = pa.table({
            'x': x_array,
            'y': y_array
        })
        yield result_table

df = spark.range(10).selectExpr("id as x", "id as y")
MyUDTF(df.asTable()).show()

UDTF with PyArrow Array inputs:

@arrow_udtf(returnType="x int, y int")
class MyUDTF2:
    def eval(self, x: pa.Array, y: pa.Array):
        # Process arrays vectorized
        result_table = pa.table({
            'x': x,
            'y': y
        })
        yield result_table

MyUDTF2(lit(1), lit(2)).show()

Note

The eval method must accept PyArrow RecordBatches or Arrays as input
The eval method must yield PyArrow Tables or RecordBatches as output

Phản hồi

Trang này có hữu ích không?

Last updated on 2026-04-27