Databricks Cluster error when changing access from single user to shared

Gracia Espelt, Pol 20 Reputation points
2023-08-03T11:59:59.52+00:00

Dear Azure Forum Community,

I am currently using Databricks to execute an algorithm that involves transforming Spark DataFrames to Pandas DataFrames. To optimize this transformation, I have enabled Arrow with the following configuration:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

Initially, the algorithm ran smoothly on a single-user access cluster. However, the requirement has changed, and now this process needs to be moved to production. To achieve this, I cloned the cluster and changed its access mode from single user to shared. It is important to note that the performance settings of the shared cluster are identical to the single-user one.

The issue arises when attempting to execute the same notebook on the shared access cluster, particularly during the .toPandas() operation. An error occurs, suggesting that something may have changed in the runtime.
User's image

I have provided screenshots of the cluster for comparison:

  • Single User Cluster:
    User's image
  • Shared Access Cluster (Clone of the above):
    User's image

I kindly request the community's insights to help me understand the possible cause of this discrepancy. Any information or suggestions would be greatly appreciated.

Thank you in advance for your assistance.

Best regards,

Pol

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,409 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 31,246 Reputation points Microsoft Employee
    2023-08-03T21:23:47.7466667+00:00

    Hello Gracia Espelt, Pol,

    It seems you are encountering an error when executing a notebook on a shared access cluster that was running smoothly on a single-user access cluster. The error occurs during the .toPandas() operation, suggesting that something may have changed in the runtime.

    I suspect the shared access cluster may have different configuration settings than the single-user access cluster, even if the performance settings are identical.

    Can you verify that the Arrow configuration is still enabled on the shared access cluster?

    spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
    
    # Enable Arrow-based columnar data transfers spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") 
    

    Also, please check for any differences in the environment variables or dependencies between the two clusters.

    https://learn.microsoft.com/en-us/azure/databricks/pandas/pyspark-pandas-conversion

    import os
    print(os.environ)
    
    !pip freeze
    
    
    

    A similar error was discussed on the below thread:

    https://stackoverflow.com/questions/57662738/convert-spark-to-pandas-dataframe-has-exception-arrow-is-not-supported-when-usi

    I hope this helps. Please let me know if you have any further questions.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.