Azure Synapse Python notebook cannot cache dataframes, error: Py4JJavaError: An error occurred while calling o4354.cache. Please ensure that the specified path is a directory which exists or can be created, and that files can be created in that directory.

Jeremiah Lee 0 Reputation points
2025-02-24T20:24:03.8+00:00

My organization has a Synapse Workspace, configured storage accounts (storage V2 minimum TLS 1.2), and a spark pool (Settings below)
Autoscale: Enabled

Node size: Medium (8 vCores / 64 GB)

Session-level packages: Disabled

Intelligent cache size: 50%

Dynamically allocate executors: Enabled

Node size family: Memory Optimized

Number of nodes: 3 to 10 nodes

Automatic pausing: Enabled (30 minutes idle)

Apache Spark version: 3.4

Python version: 3.10

Scala version: 2.12.17

Java version: 11

.NET Core version: N/A

.NET for Apache Spark version: N/A

Delta Lake version: 2.4

In a Synapse pyspark notebook we access SQL datasets from the workspace to transform and extract data. When using the .cache() operation on a dataframe, the notebook errors out with the following return when running :

Py4JJavaError: An error occurred while calling o4354.cache.

: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: CREATE EXTERNAL TABLE AS SELECT statement failed as the path name 'abfss://[Container]@[storageResource].dfs.core.windows.net/synapse/workspaces/[SynapseWorkSpaceName]/sparkpools/[sparkPool]/sparkpoolinstances/[sparkPoolInstanceGuid]/livysessions/2025/02/24/12/tempdata/SQLAnalyticsConnectorStaging/application_[applicationNumber]_0007/[tempTableID].tbl' could not be used for export. Please ensure that the specified path is a directory which exists or can be created, and that files can be created in that directory.

--

In past iterations of the notebook, the associated storage resource would have the file structure listed in the error created by the operation automatically when running the command: dimComponent.cache(). Now, since updating from a 3.2 spark pool to 3.4, it throws the above error message.

We have tried adjusting the intelligent cache size (0%, 50%, 75%), creating a new pipeline with the activity to runt the notebook in question, updating the package associated with the spark pool.

We'd appreciate any advice on how to fix the above issue, and better yet, understand why the .cache() method is no longer storing a temp copy of the dataframe in the azure storage container. It looks to me to be a permissions issue, however the Synapse workspace has the necessary roles assigned to have read/write/execute on the storage account.

Thanks in advance

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,264 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 30,916 Reputation points MVP
    2025-02-25T06:29:55.2233333+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    At the first level, I will try to verify the following. let me know if you have already done all these?

    1. Verify Storage Permissions
    2. Check the fs.azure.account.auth.type Configuration
    3. Check Spark 3.4 Behavior with cache()
    4. Ensure the Staging Path Exists
    5. Check Intelligent Cache Interactions
    6. Try Writing a Temporary Table Instead
    7. Investigate Logs

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.