Scheduled synapse pipeline fails: pyspark cannot store to adls gen2 with toPandas()

AdamMontgomery-2136 5 Reputation points
2024-02-01T13:39:24.3966667+00:00

Our scheduled pipeline to run a notebook has sporadically failed with the following error. It sometimes runs fine. When manually executing the notebook, the error does not arise.

The python code which triggers the error comes after querying our sql pool with the following code:

data = (
        spark.read
        .option(Constants.DATABASE, "sqlpool")
        .option(Constants.QUERY, f"select date, app, sum(spend) as spend from x.x where date = '{date}' group by date, app")
        .synapsesql()
        # .cache()
        .toPandas()
    )

Initially the error arose at the cache() method, then we disabled that and it arose at the toPandas() method. Then we tried keeping the spark frame and the error arose when calling data.select('app').distinct().collect().

The error:

Caused by: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: CREATE EXTERNAL TABLE AS SELECT statement failed as the path name 'abfss://xx@xxx.dfs.core.windows.net/synapse/workspaces/synapse-xx/sparkpools/SmallSparkPool/sparkpoolinstances/x/livysessions/2024/01/28/x/tempdata/SQLAnalyticsConnectorStaging/application_xxx_0019/xxx.tbl' could not be used for export. Please ensure that the specified path is a directory which exists or can be created, and that files can be created in that directory.

I then added the toPandas() method again and this option

        .option(Constants.TEMP_FOLDER, 'abfss://xxx@xxx.dfs.core.windows.net/tmp')

and the next day we got this error:

Caused by: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: HdfsBridgeAbfsRestOperationException: Operation failed: "Server failed to authenticate the request. Please refer to the information in the www-authenticate header.", 401, HEAD, https://xxxx.dfs.core.windows.net/x/tmp/SQLAnalyticsConnectorStaging/application_1706775268740_0020/amSEpVUvjQ0c05af6a25a55465aac7e1a9b12ce2db9.tbl?upn=false&action=getStatus&timeout=90'

The synapse workspace has Storage Blob Data Owner and Contributor as well as approved private endpoints. As I said, the pipeline sometimes runs fine. It is running on a scheduled trigger in the mornings, there are other pipelines running at the same time so perhaps there is some interference?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,372 questions
{count} vote

1 answer

Sort by: Most helpful
  1. Joao Silveira 5 Reputation points Microsoft Employee
    2024-03-28T14:54:30.1166667+00:00

    I am sharing in here a brief explanation on the cause of this and how you can address this:

    Reason for the Error:

    The error occurs when the token used for establishing a JDBC connection and executing queries expires while the request is still being processed. This typically happens when the request is part of a pipeline submission and the identity used is a Managed System Identity (MSI) associated with the workspace. Data Warehouse (DW) can refresh tokens for regular user identities but not for non-user identities due to limitations in the support provided by the Microsoft Identity platform. Further information can be found in the provided documentation.

    Suggested Mitigation Strategies:

    To resolve this issue, customers can consider two potential solutions:

    1. Implement a retry mechanism in the pipeline configuration that includes a delay between attempts. The hope is that a subsequent attempt will use an access token with a longer Time To Live (TTL), allowing the data staging CETAS query to complete successfully.
    2. Use the Constants.DATA_SOURCE option with the read request. This method eliminates the need to manage the storage account and credentials for staging data, as it is handled outside of the connector's process.
    0 comments No comments