Lưu ý
Cần có ủy quyền mới truy nhập được vào trang này. Bạn có thể thử đăng nhập hoặc thay đổi thư mục.
Cần có ủy quyền mới truy nhập được vào trang này. Bạn có thể thử thay đổi thư mục.
Creates a PyArrow-native user defined table function (UDTF). This function provides a PyArrow-native interface for UDTFs, where the eval method receives PyArrow RecordBatches or Arrays and returns an Iterator of PyArrow Tables or RecordBatches. This enables true vectorized computation without row-by-row processing overhead.
Syntax
from pyspark.sql import functions as dbf
@dbf.arrow_udtf(returnType=<returnType>)
class MyUDTF:
def eval(self, ...):
...
Parameters
| Parameter | Type | Description |
|---|---|---|
cls |
class, optional |
The Python user-defined table function handler class. |
returnType |
pyspark.sql.types.StructType or str, optional |
The return type of the user-defined table function. The value can be either a StructType object or a DDL-formatted struct type string. |
Examples
UDTF with PyArrow RecordBatch input:
import pyarrow as pa
from pyspark.sql.functions import arrow_udtf
@arrow_udtf(returnType="x int, y int")
class MyUDTF:
def eval(self, batch: pa.RecordBatch):
# Process the entire batch vectorized
x_array = batch.column('x')
y_array = batch.column('y')
result_table = pa.table({
'x': x_array,
'y': y_array
})
yield result_table
df = spark.range(10).selectExpr("id as x", "id as y")
MyUDTF(df.asTable()).show()
UDTF with PyArrow Array inputs:
@arrow_udtf(returnType="x int, y int")
class MyUDTF2:
def eval(self, x: pa.Array, y: pa.Array):
# Process arrays vectorized
result_table = pa.table({
'x': x,
'y': y
})
yield result_table
MyUDTF2(lit(1), lit(2)).show()
Note
- The eval method must accept PyArrow RecordBatches or Arrays as input
- The eval method must yield PyArrow Tables or RecordBatches as output