Scheduled synapse pipeline fails: pyspark cannot store to adls gen2 with toPandas()

Question

Scheduled synapse pipeline fails: pyspark cannot store to adls gen2 with toPandas()

AdamMontgomery-2136 15

Our scheduled pipeline to run a notebook has sporadically failed with the following error. It sometimes runs fine. When manually executing the notebook, the error does not arise.

The python code which triggers the error comes after querying our sql pool with the following code:

data = (
        spark.read
        .option(Constants.DATABASE, "sqlpool")
        .option(Constants.QUERY, f"select date, app, sum(spend) as spend from x.x where date = '{date}' group by date, app")
        .synapsesql()
        # .cache()
        .toPandas()
    )

Initially the error arose at the cache() method, then we disabled that and it arose at the toPandas() method. Then we tried keeping the spark frame and the error arose when calling data.select('app').distinct().collect().

The error:

Caused by: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: CREATE EXTERNAL TABLE AS SELECT statement failed as the path name 'abfss://******@xxx.dfs.core.windows.net/synapse/workspaces/synapse-xx/sparkpools/SmallSparkPool/sparkpoolinstances/x/livysessions/2024/01/28/x/tempdata/SQLAnalyticsConnectorStaging/application_xxx_0019/xxx.tbl' could not be used for export. Please ensure that the specified path is a directory which exists or can be created, and that files can be created in that directory.

I then added the toPandas() method again and this option

        .option(Constants.TEMP_FOLDER, 'abfss://x******@xxx.dfs.core.windows.net/tmp')

and the next day we got this error:

Caused by: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: HdfsBridgeAbfsRestOperationException: Operation failed: "Server failed to authenticate the request. Please refer to the information in the www-authenticate header.", 401, HEAD, https://xxxx.dfs.core.windows.net/x/tmp/SQLAnalyticsConnectorStaging/application_1706775268740_0020/amSEpVUvjQ0c05af6a25a55465aac7e1a9b12ce2db9.tbl?upn=false&action=getStatus&timeout=90'

The synapse workspace has Storage Blob Data Owner and Contributor as well as approved private endpoints. As I said, the pipeline sometimes runs fine. It is running on a scheduled trigger in the mornings, there are other pipelines running at the same time so perhaps there is some interference?

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-02-02T10:36:31.4733333+00:00
@Adam Montgomery

Thanks for reaching out to Microsoft Q&A.

It seems like you’re encountering some issues with your scheduled pipeline in Azure Synapse. Let’s break down the problem and explore potential solutions.

Intermittent Failures:

You mentioned that the pipeline sporadically fails, but when you manually execute the notebook, it works fine. This inconsistency can be frustrating. One possibility is that the pipeline execution environment differs from manual execution. Let’s investigate further.

Spark Configuration and Versions:

The error messages indicate issues related to external file access and directory existence.

Deprecated Warnings: You mentioned deprecated warnings around the synapsesql function. While I don’t have the exact details of the deprecation, it’s essential to ensure that your code aligns with the latest best practices.

Spark Pool Configuration: Check the Apache Spark configuration for your notebook activity in the pipeline. Sometimes, differences in Spark versions or configurations can cause unexpected behavior. Solution: Try changing the Apache Spark “Spark Pool” configuration in the pipeline settings to a different pool, publish the changes, and then switch back to the original pool. This process might refresh the configuration and resolve any inconsistencies.

Path and Directory Issues:

The error messages mention issues with path names and directory existence.

Ensure that the specified path 'abfss://******@xxx.dfs.core.windows.net/tmp' exists and can be used for export.

Verify that the directory structure is correctly set up in your storage account.

Double-check permissions and access rights for the specified path. If possible, try using an absolute path instead of a relative one.

External File Access:

The second error specifically relates to external file access.

It’s essential to diagnose why the HDFS access failed. Check if there are any network issues, connectivity problems, or misconfigurations.

Review the HDFS setup and ensure that the necessary services are running. Consider checking logs or monitoring tools to identify any specific issues related to HDFS access.

Debugging Steps:

When the pipeline fails, head to the Synapse Studio’s CI/CD > Pipelines page to find detailed information about the failed run.

Look for any additional error messages or warnings that might provide more context. If possible, enable verbose logging or additional diagnostics to capture more details during pipeline execution. Hope this helps. Do let us know if you any further queries.
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-02-05T10:00:40.5533333+00:00

@Adam Montgomery We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
AdamMontgomery-2136 15 Reputation points

2024-02-05T14:42:07.0933333+00:00

Hi @phemanth , I have not found a solution yet
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-02-06T13:28:46.11+00:00
@Adam Montgomery

Let’s continue investigating the problem step by step. Here are some additional suggestions:

HDFS Path and Permissions: Double-check the accuracy of the HDFS path (abfss://******@xxx.dfs.core.windows.net/tmp). Verify the existence and write permissions of the specified directory. Confirm the correctness of credentials (storage account key or SAS token).

Spark Configuration: Review and align Spark configuration settings with your environment. Adjust Spark properties related to file access, temporary directories, or caching as needed.

Log Analysis: Check logs for detailed error messages and additional context. Enable verbose logging during pipeline execution for more information.

Retry Strategy: Implement a retry mechanism in your pipeline using Azure Data Factory’s retry policies or custom notebook logic. Retry the notebook automatically a few times before reporting an error.

Collaboration: Discuss the issue with colleagues for fresh perspectives. Share error details and troubleshooting steps to gather insights.
Anonymous123 0 Reputation points

2024-02-09T15:50:03.66+00:00

@Adam Montgomery we were having similar issues (we didn't try removing the cache to see if we got the same error you got after doing this though) and it (seems) to have been caused by concurrent jobs running. The other job was writing to the table we are reading from and also using the same storage account container for temp data, so not sure which of the two was causing the issues. In any case, rescheduling our pipeline to not run at the same time as the other job seems to have done the trick (though it's a daily job and it's only been a couple of days, so there is a small chance we've just gotten lucky).

1 answer

Your answer

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-02-05T10:00:40.5533333+00:00

@Adam Montgomery We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
AdamMontgomery-2136 15 Reputation points

2024-02-05T14:42:07.0933333+00:00

Hi @phemanth , I have not found a solution yet
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-02-06T13:28:46.11+00:00

@Adam Montgomery

Let’s continue investigating the problem step by step. Here are some additional suggestions:

HDFS Path and Permissions: Double-check the accuracy of the HDFS path (abfss://******@xxx.dfs.core.windows.net/tmp). Verify the existence and write permissions of the specified directory. Confirm the correctness of credentials (storage account key or SAS token).

Spark Configuration: Review and align Spark configuration settings with your environment. Adjust Spark properties related to file access, temporary directories, or caching as needed.

Log Analysis: Check logs for detailed error messages and additional context. Enable verbose logging during pipeline execution for more information.

Retry Strategy: Implement a retry mechanism in your pipeline using Azure Data Factory’s retry policies or custom notebook logic. Retry the notebook automatically a few times before reporting an error.

Collaboration: Discuss the issue with colleagues for fresh perspectives. Share error details and troubleshooting steps to gather insights.
Anonymous123 0 Reputation points

2024-02-09T15:50:03.66+00:00

@Adam Montgomery we were having similar issues (we didn't try removing the cache to see if we got the same error you got after doing this though) and it (seems) to have been caused by concurrent jobs running. The other job was writing to the table we are reading from and also using the same storage account container for temp data, so not sure which of the two was causing the issues. In any case, rescheduling our pipeline to not run at the same time as the other job seems to have done the trick (though it's a daily job and it's only been a couple of days, so there is a small chance we've just gotten lucky).

Answer 1

I am sharing in here a brief explanation on the cause of this and how you can address this:

Reason for the Error:

The error occurs when the token used for establishing a JDBC connection and executing queries expires while the request is still being processed. This typically happens when the request is part of a pipeline submission and the identity used is a Managed System Identity (MSI) associated with the workspace. Data Warehouse (DW) can refresh tokens for regular user identities but not for non-user identities due to limitations in the support provided by the Microsoft Identity platform. Further information can be found in the provided documentation.

Suggested Mitigation Strategies:

To resolve this issue, customers can consider two potential solutions:

Implement a retry mechanism in the pipeline configuration that includes a delay between attempts. The hope is that a subsequent attempt will use an access token with a longer Time To Live (TTL), allowing the data staging CETAS query to complete successfully.
Use the Constants.DATA_SOURCE option with the read request. This method eliminates the need to manage the storage account and credentials for staging data, as it is handled outside of the connector's process.

Share via

Scheduled synapse pipeline fails: pyspark cannot store to adls gen2 with toPandas()

1 answer

Your answer