Parallel Writing of Data to ADLS Delta Storage Causes "The specified path already exists.", 409, PUT" Error
I have a Spark (Spark 3.1.2, Scala 2.12) application that reads json records from a table. The records are distributed across cluster executors.
In the application, I have used foreach function to loop through the records in the table. For each record, different transformations take place. After the transformations, the resulted dataframe is being writing to an ADLS storage.
It's possible during the for each loop two executors can write data to the same location at the same time in the ADLS storage (Parallel PUT operation). The application works fine in DEV and PRE-PROD. However, once in a while, I get the following error from the Log4j output. Sometimes, the job succeeded but still output the error in the log4j. Sometimes, the job failed
Error
ERROR AbfsClient: HttpRequest: 409,err=PathAlreadyExists,appendpos=,cid=9f83f144-94b8-4108-8e41-4b753eab3575,rid=e46cbae2-101f-0014-6286-104eea000000,connMs=0,sendMs=0,recvMs=48,sent=0,recv=168,method=PUT,url=https://storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json?timeout=90
ERROR AzureBlobFileSystem:V3: FS_OP_RENAME SRC[abfss://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/__tmp_path_dir/.00000000000000007568.json.8b7fd480-6154-443d-8ae9-b1357cec4e7b.tmp] DST[abfss://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json] Rename failed. AbfsRestOperationException: Operation failed: "The specified path already exists.", 409, PUT, https://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json?timeout=90, PathAlreadyExists, "The specified path already exists. RequestId:e46cbae2-101f-0014-6286-104eea000000 Time:2023-11-06T07:57:24.7258168Z"
Operation failed: "The specified path already exists.", 409, PUT, https://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json?timeout=90, PathAlreadyExists, "The specified path already exists. RequestId:e46cbae2-101f-0014-6286-104eea000000 Time:2023-11-06T07:57:24.7258168Z"
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:261)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.renamePath(AbfsClient.java:355)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.rename(AzureBlobFileSystemStore.java:766)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.renameWithInstrumentation(AzureBlobFileSystem.java:381
Similar question is asked in this link but no solution provided: https://learn.microsoft.com/en-ie/answers/questions/185752/streaming-upserts-constantly-report-mysterious-log