Pyspark UDF fails while trying to predict outcome

Question

Pyspark UDF fails while trying to predict outcome

ash 86

I received the following error while trying to predict an outcome using XGBoost model.

User's image

I am migrating a python code to pyspark and the corresponding ML model was trained in Python and so am using UDF to access the pickle file and make a prediction for entries across all partitions.

While investigating the rootcause I came across this blob. Is this error related to the reason given in the above blog?
Does anyone know how to overcome this problem?

Answer accepted by question author

1 additional answer

Your answer

Answer 1

@ash - I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer .

Ask: Pyspark UDF fails while trying to predict outcome

Solution: Looks like the problem is with Databricks runtime 15.0.

When I switch to Databricks runtime 14.1 then script runs successfully and give me the required outcome.

If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.

If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

Answer 2

PRADEEPCHEEKATLA 91,581 Moderator

@ash - Thanks for the question and using MS Q&A platform.

The error message you received indicates that the Spark environment directory is not found. This can happen if the cluster libraries are not properly configured or if there is an issue with the file system.

Regarding the blog you mentioned, it is not directly related to your issue. The blog discusses a different error message related to security rules are preventing workers from resolving the Python executable path. However, this may not be applicable to your case as you are using XGBoost model.

To resolve your issue, you can try the following steps:

Check if the cluster libraries are properly configured and if the required packages are installed.
Check if the file system is properly configured and if the Spark environment directory is accessible.
Try running the code on a different cluster to see if the issue persists.

If the issue still persists, you can provide more details about your code and the cluster configuration to help us better understand the issue and provide a more specific solution.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

ash 86 Reputation points

2024-03-27T15:28:09.7833333+00:00
I did check 2/3 the options that you suggested like

All required libraries are installed

Not sure what you mean by file system is properly configured. But I am extracting data from blob storage and the file read and write is working without any troubles.

Even with a new cluster the problem persist.
ash 86 Reputation points

2024-03-27T15:29:38.18+00:00

Looks like the problem is with Databricks runtime 15.0.

When I switch to Databricks runtime 14.1 then script runs successfully and give me the required outcome.
ash 86 Reputation points

2024-03-27T15:33:52.3466667+00:00
But I also want to use a python environment from my local machine. For this I followed this blog using Conda. first I packed my conda environment from my local machine into a tar.gz file. Since I am working on a cluster mode I want to run the following scripts

import os from pyspark.sql import SparkSession os.environ['PYSPARK_PYTHON'] = "./environment/bin/python" spark = SparkSession.builder.config( "spark.archives", # 'spark.yarn.dist.archives' in YARN. "pyspark_venv.tar.gz#environment").getOrCreate()

However I am not sure how to point my tar.gz file in this script.

Should I replace "pyspark_venv.tar.gz" with a path from dbfs where the tar.gz file is being uploaded?
PRADEEPCHEEKATLA 91,581 Reputation points Moderator

2024-03-28T05:14:10.89+00:00

@ash - I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer .

Share via

Pyspark UDF fails while trying to predict outcome

1 additional answer

Your answer