Spark unable to write file onto Blob storage

Sachin Shah 101 Reputation points
2020-10-02T16:25:54.1+00:00

We use HDInsight with Spark, v3.6. So far, our code has been working as expected. As of last night, our job started failing. The error states that "output directory already exists". When looking at the blob storage, directories appear to be created as 'block blob' and not as directories.

Are there any suggestions on how to overcome this error?

User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory wasbs://payment-file-outbound@xxx.blob.core.windows.net/output/275DPN45922 already exists  
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)  
    at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:287)  
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)  
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)  
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)  
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)  
    at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:550)  
    at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45)  
    at com.rm.integration.etl.generator.FixedWidthGenerator.genFlatFile(FixedWidthGenerator.java:114)  
    at com.rm.integration.etl.generator.PaymentOutboundGenerator.generate(PaymentOutboundGenerator.java:43)  
    at com.rm.integration.main.pipeline.PaymentPipeline.run(PaymentPipeline.java:115)  
    at com.rm.integration.main.PaymentOutboundApp.runApp(PaymentOutboundApp.java:35)  
    at com.rm.integration.app.DefaultSparkApplication.run(DefaultSparkApplication.java:40)  
    at com.rm.integration.main.Main.main(Main.java:16)  
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)  
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)  
    at java.lang.reflect.Method.invoke(Method.java:498)  
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)  

It appears that there was an HDInsight update on September 28, which may have reached our region just now. However, the release notes don't mention any possible regressions or unsolved problems.

EDIT: link to release notes: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-release-notes#release-date-09282020

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,436 questions
Azure HDInsight
Azure HDInsight
An Azure managed cluster service for open-source analytics.
199 questions
{count} votes

Accepted answer
  1. Sachin Shah 101 Reputation points
    2020-10-06T12:35:39.727+00:00

    Hi @PRADEEPCHEEKATLA-MSFT ,

    It turned out to be our issue after all. It seems that we had a Null Pointer exception being thrown but Spark was swallowing our error and throwing its own instead. Once we changed some settings, we were able to see the NPE and correct it.

    Bad data somehow made it through the front-end validation checks (causing the error), and it happened to coincide perfectly with the upgrade. Once we were able to reproduce it reliably, we found the root cause.

    Sorry to have bothered you.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful