How to fix ErrorCode=ParquetJavaInvocationException (could not read footer for file) while performing Copy action in Azure Data Factory

Stevan Thomas 5 Reputation points
2024-01-15T21:31:34.5966667+00:00

Hi

I am working on a project that will combine 10 parquet files into one file (with an additional column of $$FILENAME).However I keep running into this error:

ErrorCode=ParquetJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.io.IOException:Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=yellow_tripdata_2023-02.parquet; isDirectory=false; length=47748012; replication=-1; blocksize=2147483647; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
total entry:7
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:269)
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:210)
org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:112)
org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:45)
org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:202)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:62)
com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22)
java.io.IOException:Could not read footer for file FileStatus{path=yellow_tripdata_2023-02.parquet; isDirectory=false; length=47748012; replication=-1; blocksize=2147483647; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
total entry:6
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:261)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
java.io.IOException:can not read class org.apache.parquet.format.FileMetaData: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT32, encodings:[PLAIN_DICTIONARY, PLAIN, RLE], path_in_schema:[VendorID], codec:null, num_values:2913955, total_uncompressed_size:698054, total_compressed_size:370363, data_page_offset:39, dictionary_page_offset:4, encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:2)])
total entry:15
org.apache.parquet.format.Util.read(Util.java:216)
org.apache.parquet.format.Util.readFileMetaData(Util.java:73)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:774)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:771)
org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:653)
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:771)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:437)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:259)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
shaded.parquet.org.apache.thrift.protocol.TProtocolException:Required field 'codec' was not present! Struct: ColumnMetaData(type:INT32, encodings:[PLAIN_DICTIONARY, PLAIN, RLE], path_in_schema:[VendorID], codec:null, num_values:2913955, total_uncompressed_size:698054, total_compressed_size:370363, data_page_offset:39, dictionary_page_offset:4, encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:2)])
total entry:20
org.apache.parquet.format.ColumnMetaData.validate(ColumnMetaData.java:1777)
org.apache.parquet.format.ColumnMetaData.read(ColumnMetaData.java:1565)
org.apache.parquet.format.ColumnChunk.read(ColumnChunk.java:480)
org.apache.parquet.format.RowGroup.read(RowGroup.java:591)
org.apache.parquet.format.FileMetaData.read(FileMetaData.java:838)
org.apache.parquet.format.Util.read(Util.java:213)
org.apache.parquet.format.Util.readFileMetaData(Util.java:73)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:774)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:771)
org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:653)
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:771)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:437)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:259)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
.,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'

I have done the following checks:

  • The pipeline works for single files but when I select multiple files, it throws the error
  • I changed the destination file type to CSV and still get the same error
  • User's image

I am currently learning the ins and outs of Azure Data Factory (so apologies if this is a novice question).

Thank you

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,246 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Stevan Thomas 5 Reputation points
    2024-01-18T03:48:47.7266667+00:00

    Hi Kranthi Thank you for reaching out.

    Response

    That was my suspicion as looking through the files through a parquet viewer showed me that there was a mismatch in data type of the VendorID column (INT32 vs INT64). Eliminating the file from the process resulted in the same error.

    My Debugging Results/Hypothesis

    I figured that the error regarding the missing codec was the issue. It seems like the files in question were compressed under zstd algorithm (which i didn't find in the list). Hence the pipeline fails when any modifications are done on the dataset (in my case adding a column and merging them). Copying them as is worked fine

    Workaround

    I transferred the file to Delta Lake Storage Gen2 as is and then did the edits required using a Synapse Notebook. That worked for me. So hopefully this helps some poor soul If I am wrong in any of the points above, please do let me know. It would help in my understanding of Azure/Parquet, both of which I'm a novice at. Thank you!

    1 person found this answer helpful.

  2. KranthiPakala-MSFT 46,602 Reputation points Microsoft Employee
    2024-01-16T20:08:31.1533333+00:00

    Hi @Stevan Thomas Welcome to Microsoft Q&A forum and thanks for reaching out here.

    From the error message "Could not read footer for file FileStatus{path=yellow_tripdata_2023-02.parquet; isDirectory=false; length=47748012; replication=-1; blocksize=2147483647; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}"

    My understanding is that there could be a specific issue with that particular parquet file (yellow_tripdata_2023-02.parquet). Could you please try to remove that file and try to merge rest of the files using the additional column option and see if that works. If the rest of the files goes well without any issue, then it confirms the issue specific to the file mentioned in the error and need to validate the file format and the content.

    In addition, I would also recommend using the latest version of Self Hosted Integration Runtime as the is specific to Java Runtime.

    Kindly validate the called-out point and keep us posted how it goes. Thank you

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.