Spark unable to write file onto Blob storage

Question

Spark unable to write file onto Blob storage

Sachin Shah 101

We use HDInsight with Spark, v3.6. So far, our code has been working as expected. As of last night, our job started failing. The error states that "output directory already exists". When looking at the blob storage, directories appear to be created as 'block blob' and not as directories.

Are there any suggestions on how to overcome this error?

User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory wasbs://******@xxx.blob.core.windows.net/output/275DPN45922 already exists  
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)  
    at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:287)  
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)  
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)  
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)  
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)  
    at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:550)  
    at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45)  
    at com.rm.integration.etl.generator.FixedWidthGenerator.genFlatFile(FixedWidthGenerator.java:114)  
    at com.rm.integration.etl.generator.PaymentOutboundGenerator.generate(PaymentOutboundGenerator.java:43)  
    at com.rm.integration.main.pipeline.PaymentPipeline.run(PaymentPipeline.java:115)  
    at com.rm.integration.main.PaymentOutboundApp.runApp(PaymentOutboundApp.java:35)  
    at com.rm.integration.app.DefaultSparkApplication.run(DefaultSparkApplication.java:40)  
    at com.rm.integration.main.Main.main(Main.java:16)  
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)  
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)  
    at java.lang.reflect.Method.invoke(Method.java:498)  
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)

It appears that there was an HDInsight update on September 28, which may have reached our region just now. However, the release notes don't mention any possible regressions or unsolved problems.

EDIT: link to release notes: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-release-notes#release-date-09282020

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2020-10-05T09:54:27.233+00:00

Hello @Sachin Shah ,

Welcome to Microsoft Q&A platform.

Thanks for bringing to our attention. I’m working with the product team and get back to you when I have more information.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2020-10-06T03:49:06.173+00:00

Hello @Sachin Shah ,

In order to investigate further, could you please share the code for creating and ready the directory?

Accepted answer

0 additional answers

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2020-10-05T09:54:27.233+00:00

Hello @Sachin Shah ,

Welcome to Microsoft Q&A platform.

Thanks for bringing to our attention. I’m working with the product team and get back to you when I have more information.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2020-10-06T03:49:06.173+00:00

Hello @Sachin Shah ,

In order to investigate further, could you please share the code for creating and ready the directory?

Answer 1

Sachin Shah 101

Hi @PRADEEPCHEEKATLA ,

It turned out to be our issue after all. It seems that we had a Null Pointer exception being thrown but Spark was swallowing our error and throwing its own instead. Once we changed some settings, we were able to see the NPE and correct it.

Bad data somehow made it through the front-end validation checks (causing the error), and it happened to coincide perfectly with the upgrade. Once we were able to reproduce it reliably, we found the root cause.

Sorry to have bothered you.

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2020-10-07T04:28:05.563+00:00

Hello @Sachin Shah ,

Glad to know that your issue has resolved. And thanks for sharing the solution, which might be beneficial to other community members reading this thread.

----------------------------------------------------------------------------------------

Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

Share via

Spark unable to write file onto Blob storage

0 additional answers

Your answer