Can not use pydantic objects in UDF

Question

Can not use pydantic objects in UDF

Freia Vercruysse 36

When trying to use a pydantic object inside a UDF, I get the following error message;

PicklingError: Can't pickle <cyfunction int_validator at 0x7f1aa626dc70>: it's not the same object as pydantic.validators.int_validator

I use the following code (pyspark):

from pydantic import BaseModel
from pyspark.sql import functions as F
from pyspark.sql import Row
from pyspark.sql.types import StringType


data = [
    Row(zip_code='58542', dma='MIN'),
    Row(zip_code='58701', dma='MIN'),
    Row(zip_code='57632', dma='MIN')
]
df = spark.createDataFrame(data)

class TestClass(BaseModel):
    name: int = 0


@F.udf(StringType())
def udf_test(dossier):
    test = TestClass()
    return "test"

df.withColumn("test", udf_test(df['zip_code'])).show()

I get the same error message when trying to pickle a Pydantic object using cloupickle, but not using pickle:

from pyspark import cloudpickle
import pydantic
import pickle

class Bar(pydantic.BaseModel):
    a: int

p1 = pickle.loads(pickle.dumps(Bar(a=1))) # This works well
print(f"p1: {p1}")

p2 = cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below
print(f"p2: {p2}")

Is there any way to change the serializer that's used for a UDF?

MartinJaffer-MSFT 26,236 Reputation points

2023-02-09T20:14:46.49+00:00

@Freia Vercruysse Is this in Databricks, Synapse, or both?
Jason Wilmoth 0 Reputation points Microsoft Employee

2023-02-14T00:26:28.6633333+00:00

I am facing this exact same issue, though instead of UDF am using a lambda function.
Freia Vercruysse 36 Reputation points

2023-02-20T06:51:57.29+00:00

@MartinJaffer-MSFT both

1 answer

Your answer

MartinJaffer-MSFT 26,236 Reputation points

2023-02-09T20:14:46.49+00:00

@Freia Vercruysse Is this in Databricks, Synapse, or both?
Jason Wilmoth 0 Reputation points Microsoft Employee

2023-02-14T00:26:28.6633333+00:00

I am facing this exact same issue, though instead of UDF am using a lambda function.
Freia Vercruysse 36 Reputation points

2023-02-20T06:51:57.29+00:00

@MartinJaffer-MSFT both

Answer 1

I received the solution from Microsoft Support:

Create Repos
Add the Bar class in a separate python file test.py

   from pydantic import BaseModel
   
   
   class Bar(BaseModel):
       a: int = 0

3.Create a new notebook in the repos and try importing the class from the python file which is already defined.

   from pydantic import BaseModel
   from pyspark.sql import functions as F
   from pyspark.sql import Row
   from pyspark.sql.types import StringType
   from test import Bar
   
   
   data = [
       Row(zip_code='58542', dma='MIN'),
       Row(zip_code='58701', dma='MIN'),
       Row(zip_code='57632', dma='MIN')
   ]
   df = spark.createDataFrame(data)
   
   
   @F.udf(StringType())
   def udf_test(dossier):
       test = Bar()
       return "test"
   
   df.withColumn("test", udf_test(df['zip_code'])).show()

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-02-24T10:30:52.89+00:00

@Freia Vercruysse Glad to know that your issue has been resolved. And thanks for sharing the solution, which might be beneficial to other community members reading this thread.

Share via

Can not use pydantic objects in UDF

1 answer

Your answer