Parallel Writing of Data to ADLS Delta Storage Causes "The specified path already exists.", 409, PUT" Error

Orowole, Ayebogbon-XT 0 Reputation points
2023-11-06T19:54:40.9533333+00:00

I have a Spark (Spark 3.1.2, Scala 2.12) application that reads json records from a table. The records are distributed across cluster executors.

In the application, I have used foreach function to loop through the records in the table. For each record, different transformations take place. After the transformations, the resulted dataframe is being writing to an ADLS storage.

It's possible during the for each loop two executors can write data to the same location at the same time in the ADLS storage (Parallel PUT operation). The application works fine in DEV and PRE-PROD. However, once in a while, I get the following error from the Log4j output. Sometimes, the job succeeded but still output the error in the log4j. Sometimes, the job failed

Error

ERROR AbfsClient: HttpRequest: 409,err=PathAlreadyExists,appendpos=,cid=9f83f144-94b8-4108-8e41-4b753eab3575,rid=e46cbae2-101f-0014-6286-104eea000000,connMs=0,sendMs=0,recvMs=48,sent=0,recv=168,method=PUT,url=https://storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json?timeout=90

ERROR AzureBlobFileSystem:V3: FS_OP_RENAME SRC[abfss://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/__tmp_path_dir/.00000000000000007568.json.8b7fd480-6154-443d-8ae9-b1357cec4e7b.tmp] DST[abfss://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json] Rename failed. AbfsRestOperationException: Operation failed: "The specified path already exists.", 409, PUT, https://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json?timeout=90, PathAlreadyExists, "The specified path already exists. RequestId:e46cbae2-101f-0014-6286-104eea000000 Time:2023-11-06T07:57:24.7258168Z"
Operation failed: "The specified path already exists.", 409, PUT, https://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json?timeout=90, PathAlreadyExists, "The specified path already exists. RequestId:e46cbae2-101f-0014-6286-104eea000000 Time:2023-11-06T07:57:24.7258168Z"
	at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:261)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.renamePath(AbfsClient.java:355) 	
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.rename(AzureBlobFileSystemStore.java:766) 	
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.renameWithInstrumentation(AzureBlobFileSystem.java:381

Similar question is asked in this link but no solution provided: https://learn.microsoft.com/en-ie/answers/questions/185752/streaming-upserts-constantly-report-mysterious-log

Azure Storage Accounts
Azure Storage Accounts
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
3,158 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,851 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,176 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.