My organization has a Synapse Workspace, configured storage accounts (storage V2 minimum TLS 1.2), and a spark pool (Settings below)
Autoscale: Enabled
Node size: Medium (8 vCores / 64 GB)
Session-level packages: Disabled
Intelligent cache size: 50%
Dynamically allocate executors: Enabled
Node size family: Memory Optimized
Number of nodes: 3 to 10 nodes
Automatic pausing: Enabled (30 minutes idle)
Apache Spark version: 3.4
Python version: 3.10
Scala version: 2.12.17
Java version: 11
.NET Core version: N/A
.NET for Apache Spark version: N/A
Delta Lake version: 2.4
In a Synapse pyspark notebook we access SQL datasets from the workspace to transform and extract data. When using the .cache() operation on a dataframe, the notebook errors out with the following return when running :
Py4JJavaError: An error occurred while calling o4354.cache.
: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: CREATE EXTERNAL TABLE AS SELECT statement failed as the path name 'abfss://[Container]@[storageResource].dfs.core.windows.net/synapse/workspaces/[SynapseWorkSpaceName]/sparkpools/[sparkPool]/sparkpoolinstances/[sparkPoolInstanceGuid]/livysessions/2025/02/24/12/tempdata/SQLAnalyticsConnectorStaging/application_[applicationNumber]_0007/[tempTableID].tbl' could not be used for export. Please ensure that the specified path is a directory which exists or can be created, and that files can be created in that directory.
--
In past iterations of the notebook, the associated storage resource would have the file structure listed in the error created by the operation automatically when running the command: dimComponent.cache(). Now, since updating from a 3.2 spark pool to 3.4, it throws the above error message.
We have tried adjusting the intelligent cache size (0%, 50%, 75%), creating a new pipeline with the activity to runt the notebook in question, updating the package associated with the spark pool.
We'd appreciate any advice on how to fix the above issue, and better yet, understand why the .cache() method is no longer storing a temp copy of the dataframe in the azure storage container. It looks to me to be a permissions issue, however the Synapse workspace has the necessary roles assigned to have read/write/execute on the storage account.
Thanks in advance