Our scheduled pipeline to run a notebook has sporadically failed with the following error. It sometimes runs fine. When manually executing the notebook, the error does not arise.
The python code which triggers the error comes after querying our sql pool with the following code:
data = (
spark.read
.option(Constants.DATABASE, "sqlpool")
.option(Constants.QUERY, f"select date, app, sum(spend) as spend from x.x where date = '{date}' group by date, app")
.synapsesql()
# .cache()
.toPandas()
)
Initially the error arose at the cache() method, then we disabled that and it arose at the toPandas() method. Then we tried keeping the spark frame and the error arose when calling data.select('app').distinct().collect().
The error:
Caused by: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: CREATE EXTERNAL TABLE AS SELECT statement failed as the path name 'abfss://xx@xxx.dfs.core.windows.net/synapse/workspaces/synapse-xx/sparkpools/SmallSparkPool/sparkpoolinstances/x/livysessions/2024/01/28/x/tempdata/SQLAnalyticsConnectorStaging/application_xxx_0019/xxx.tbl' could not be used for export. Please ensure that the specified path is a directory which exists or can be created, and that files can be created in that directory.
I then added the toPandas() method again and this option
.option(Constants.TEMP_FOLDER, 'abfss://xxx@xxx.dfs.core.windows.net/tmp')
and the next day we got this error:
Caused by: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: HdfsBridgeAbfsRestOperationException: Operation failed: "Server failed to authenticate the request. Please refer to the information in the www-authenticate header.", 401, HEAD, https://xxxx.dfs.core.windows.net/x/tmp/SQLAnalyticsConnectorStaging/application_1706775268740_0020/amSEpVUvjQ0c05af6a25a55465aac7e1a9b12ce2db9.tbl?upn=false&action=getStatus&timeout=90'
The synapse workspace has Storage Blob Data Owner and Contributor as well as approved private endpoints. As I said, the pipeline sometimes runs fine. It is running on a scheduled trigger in the mornings, there are other pipelines running at the same time so perhaps there is some interference?