Databricks Cluster error when changing access from single user to shared

Question

Databricks Cluster error when changing access from single user to shared

Gracia Espelt, Pol 20

Dear Azure Forum Community,

I am currently using Databricks to execute an algorithm that involves transforming Spark DataFrames to Pandas DataFrames. To optimize this transformation, I have enabled Arrow with the following configuration:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

Initially, the algorithm ran smoothly on a single-user access cluster. However, the requirement has changed, and now this process needs to be moved to production. To achieve this, I cloned the cluster and changed its access mode from single user to shared. It is important to note that the performance settings of the shared cluster are identical to the single-user one.

The issue arises when attempting to execute the same notebook on the shared access cluster, particularly during the .toPandas() operation. An error occurs, suggesting that something may have changed in the runtime.
User's image

I have provided screenshots of the cluster for comparison:

Single User Cluster:
Shared Access Cluster (Clone of the above):

I kindly request the community's insights to help me understand the possible cause of this discrepancy. Any information or suggestions would be greatly appreciated.

Thank you in advance for your assistance.

Best regards,

Pol

Gracia Espelt, Pol 20 Reputation points

2023-08-04T07:45:03.9633333+00:00

Finally, I have not been able to find the runtime difference between the clusters.

To solve the issue I changed the DB runtime version to 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12).

This version does not conflicts with Arrow with a shared access cluster.

Thank you for your help.
Bhargava-MSFT 31,356 Reputation points Microsoft Employee Moderator

2023-08-04T19:45:20.8266667+00:00

Hello Gracia Espelt, Pol,

Thank you for sharing the info, and Glad to know the issue is resolved with runtime version 12.2 LTS.

1 answer

Your answer

Gracia Espelt, Pol 20 Reputation points

2023-08-04T07:45:03.9633333+00:00

Finally, I have not been able to find the runtime difference between the clusters.

To solve the issue I changed the DB runtime version to 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12).

This version does not conflicts with Arrow with a shared access cluster.

Thank you for your help.
Bhargava-MSFT 31,356 Reputation points Microsoft Employee Moderator

2023-08-04T19:45:20.8266667+00:00

Hello Gracia Espelt, Pol,

Thank you for sharing the info, and Glad to know the issue is resolved with runtime version 12.2 LTS.

Answer 1

Hello Gracia Espelt, Pol,

It seems you are encountering an error when executing a notebook on a shared access cluster that was running smoothly on a single-user access cluster. The error occurs during the .toPandas() operation, suggesting that something may have changed in the runtime.

I suspect the shared access cluster may have different configuration settings than the single-user access cluster, even if the performance settings are identical.

Can you verify that the Arrow configuration is still enabled on the shared access cluster?

spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")

# Enable Arrow-based columnar data transfers spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

Also, please check for any differences in the environment variables or dependencies between the two clusters.

https://learn.microsoft.com/en-us/azure/databricks/pandas/pyspark-pandas-conversion

import os
print(os.environ)

!pip freeze

A similar error was discussed on the below thread:

https://stackoverflow.com/questions/57662738/convert-spark-to-pandas-dataframe-has-exception-arrow-is-not-supported-when-usi

I hope this helps. Please let me know if you have any further questions.

Share via

Databricks Cluster error when changing access from single user to shared

1 answer

Your answer