com.databricks.sql.io.FileReadException: Error while reading file wasbs:REDACTED_LOCAL_PART@****.blob.core.windows.net/

Mayuri Kadam 81 Reputation points Microsoft Employee
2021-01-29T20:32:13.813+00:00

Hi,
I am getting the following error message:

com.databricks.sql.io.FileReadException: Error while reading file wasbs:REDACTED_LOCAL_PART@****.blob.core.windows.net/cook/processYear=2021/processMonth=01/processDay=08/processHour=03/part-00003-tid-1903224826064875913-0ded1380-19a2-4ed2-9d4d-f19724b5bf5d-29101-1.c000.avro.
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:286)
Caused by: java.io.IOException
    at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:737)
Caused by: com.microsoft.azure.storage.StorageException: Blob hash mismatch (integrity check failed), Expected value is x2rC4SZaPjA==, retrieved 6kwtbjN2v/w==.
    at com.microsoft.azure.storage.blob.CloudBlob$9.postProcessResponse(CloudBlob.java:1409)

Any idea how to resolve this? Thanks.

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,427 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,917 questions
{count} votes

4 answers

Sort by: Most helpful
  1. Mayuri Kadam 81 Reputation points Microsoft Employee
    2021-02-02T19:35:10.22+00:00

    Hi Pradeep, please find the stack trace below:

    com.databricks.sql.io.FileReadException: Error while reading file wasbs:REDACTED_LOCAL_PART@*******.blob.core.windows.net/cook/processYear=2021/processMonth=01/processDay=09/processHour=00/part-00003-tid-4640843606947508963-a580-40bd-ad0d-e7c92f1e5b1f-29229-1.c000.avro.
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:286)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:264)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:205)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:354)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:205)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage58.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage65.agg_doAggregateWithKeys_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage65.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
        at org.apache.spark.scheduler.Task.run(Task.scala:112)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    Caused by: java.io.IOException
        at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:737)
        at com.microsoft.azure.storage.blob.BlobInputStream.dispatchRead(BlobInputStream.java:264)
        at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:448)
        at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(NativeAzureFileSystem.java:876)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at com.databricks.spark.metrics.FSInputStreamWithMetrics$$anonfun$read$3.apply$mcI$sp(FileSystemWithMetrics.scala:206)
        at com.databricks.spark.metrics.FSInputStreamWithMetrics$$anonfun$read$3.apply(FileSystemWithMetrics.scala:206)
        at com.databricks.spark.metrics.FSInputStreamWithMetrics$$anonfun$read$3.apply(FileSystemWithMetrics.scala:206)
        at com.databricks.spark.metrics.ExtendedTaskIOMetrics$class.withTimeMetric(FileSystemWithMetrics.scala:151)
        at com.databricks.spark.metrics.ExtendedTaskIOMetrics$class.com$databricks$spark$metrics$ExtendedTaskIOMetrics$$withTimeAndBytesMetric(FileSystemWithMetrics.scala:171)
        at com.databricks.spark.metrics.ExtendedTaskIOMetrics$$anonfun$withTimeAndBytesReadMetric$1.apply$mcI$sp(FileSystemWithMetrics.scala:185)
        at com.databricks.spark.metrics.ExtendedTaskIOMetrics$$anonfun$withTimeAndBytesReadMetric$1.apply(FileSystemWithMetrics.scala:185)
        at com.databricks.spark.metrics.ExtendedTaskIOMetrics$$anonfun$withTimeAndBytesReadMetric$1.apply(FileSystemWithMetrics.scala:185)
        at com.databricks.spark.metrics.SamplerWithPeriod.sample(FileSystemWithMetrics.scala:78)
        at com.databricks.spark.metrics.ExtendedTaskIOMetrics$class.withTimeAndBytesReadMetric(FileSystemWithMetrics.scala:185)
        at com.databricks.spark.metrics.FSInputStreamWithMetrics.withTimeAndBytesReadMetric(FileSystemWithMetrics.scala:192)
        at com.databricks.spark.metrics.FSInputStreamWithMetrics.read(FileSystemWithMetrics.scala:205)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.avro.mapred.FsInput.read(FsInput.java:54)
        at org.apache.spark.sql.avro.AvroFileFormat$.openAvroReader(AvroFileFormat.scala:275)
        at org.apache.spark.sql.avro.AvroFileFormat$$anonfun$buildReader$1.apply(AvroFileFormat.scala:202)
        at org.apache.spark.sql.avro.AvroFileFormat$$anonfun$buildReader$1.apply(AvroFileFormat.scala:183)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:134)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:235)
        ... 23 more
    Caused by: com.microsoft.azure.storage.StorageException: Blob hash mismatch (integrity check failed), Expected value is xmypzfnpTdq8eFLxZ49DhQ==, retrieved CY7+V9/JEfVroD5omBB2Uw==.
        at com.microsoft.azure.storage.blob.CloudBlob$9.postProcessResponse(CloudBlob.java:1409)
        at com.microsoft.azure.storage.blob.CloudBlob$9.postProcessResponse(CloudBlob.java:1310)
        at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:149)
        at com.microsoft.azure.storage.blob.CloudBlob.downloadRangeInternal(CloudBlob.java:1493)
        at com.microsoft.azure.storage.blob.BlobInputStream.dispatchRead(BlobInputStream.java:255)
        ... 53 more
    

  2. Mayuri Kadam 81 Reputation points Microsoft Employee
    2021-02-08T17:28:33.213+00:00

    hi @PRADEEPCHEEKATLA-MSFT , following is the code to upload files to azure blob container:

    spark.conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")  
    spark.conf.set("fs.azure.sas."+blobStorageAccContainerName+"."+blobStorageAccName+".blob.core.windows.net", blobStorageBlobSASToken)  
    containerPath = "wasbs://" + blobStorageAccContainerName + "@" + blobStorageAccName + ".blob.core.windows.net/"  
                var storageCheckpointDirectory = checkpointDirectory  
                if (storageCheckpointDirectory.isEmpty) {  
                  storageCheckpointDirectory = Paths.get(new java.io.File(".").getCanonicalPath).toString  
                }  
                storageCheckpointDirectory = storageCheckpointDirectory + blobStorageAccName + "/" + blobStorageAccContainerName + "/" + dirName  
                val queryName = "uploadDataToBlob:" + dirName  
                spark.sparkContext.setLocalProperty("spark.scheduler.pool", dirName)  
                var df = data.writeStream  
                              .option("checkpointLocation", storageCheckpointDirectory)  
                              .queryName(queryName)  
                              .format(format)  
                if (partitionCols.nonEmpty) df = df.partitionBy(partitionCols: _*)  
                df.option("path", blob.getcontainerPath + dirName)  
                  .start()  
    
    0 comments No comments

  3. Mayuri Kadam 81 Reputation points Microsoft Employee
    2021-02-08T17:29:57.387+00:00

    Hi @PRADEEPCHEEKATLA-MSFT , following is the code to read from azure blob container:

    spark.conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")  
    spark.conf.set("fs.azure.sas."+blobStorageAccContainerName+"."+blobStorageAccName+".blob.core.windows.net", blobStorageBlobSASToken)  
    containerPath = "wasbs://" + blobStorageAccContainerName + "@" + blobStorageAccName + ".blob.core.windows.net/"  
         spark.read  
                  .format(format)  
                  .load(containerPath  + dirName)  
    
    0 comments No comments

  4. Eugen Mirosch 1 Reputation point
    2021-04-13T12:00:41.21+00:00

    any updates regarding this issue? we are experiencing exactly the same

    0 comments No comments