How to fix ErrorCode=ParquetJavaInvocationException (could not read footer for file) while performing Copy action in Azure Data Factory

Question

How to fix ErrorCode=ParquetJavaInvocationException (could not read footer for file) while performing Copy action in Azure Data Factory

Stevan Thomas 5

Hi

I am working on a project that will combine 10 parquet files into one file (with an additional column of $$FILENAME).However I keep running into this error:

ErrorCode=ParquetJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.io.IOException:Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=yellow_tripdata_2023-02.parquet; isDirectory=false; length=47748012; replication=-1; blocksize=2147483647; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
total entry:7
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:269)
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:210)
org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:112)
org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:45)
org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:202)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:62)
com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22)
java.io.IOException:Could not read footer for file FileStatus{path=yellow_tripdata_2023-02.parquet; isDirectory=false; length=47748012; replication=-1; blocksize=2147483647; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
total entry:6
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:261)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
java.io.IOException:can not read class org.apache.parquet.format.FileMetaData: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT32, encodings:[PLAIN_DICTIONARY, PLAIN, RLE], path_in_schema:[VendorID], codec:null, num_values:2913955, total_uncompressed_size:698054, total_compressed_size:370363, data_page_offset:39, dictionary_page_offset:4, encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:2)])
total entry:15
org.apache.parquet.format.Util.read(Util.java:216)
org.apache.parquet.format.Util.readFileMetaData(Util.java:73)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:774)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:771)
org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:653)
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:771)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:437)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:259)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
shaded.parquet.org.apache.thrift.protocol.TProtocolException:Required field 'codec' was not present! Struct: ColumnMetaData(type:INT32, encodings:[PLAIN_DICTIONARY, PLAIN, RLE], path_in_schema:[VendorID], codec:null, num_values:2913955, total_uncompressed_size:698054, total_compressed_size:370363, data_page_offset:39, dictionary_page_offset:4, encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:2)])
total entry:20
org.apache.parquet.format.ColumnMetaData.validate(ColumnMetaData.java:1777)
org.apache.parquet.format.ColumnMetaData.read(ColumnMetaData.java:1565)
org.apache.parquet.format.ColumnChunk.read(ColumnChunk.java:480)
org.apache.parquet.format.RowGroup.read(RowGroup.java:591)
org.apache.parquet.format.FileMetaData.read(FileMetaData.java:838)
org.apache.parquet.format.Util.read(Util.java:213)
org.apache.parquet.format.Util.readFileMetaData(Util.java:73)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:774)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:771)
org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:653)
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:771)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:437)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:259)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
.,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'

I have done the following checks:

The pipeline works for single files but when I select multiple files, it throws the error
I changed the destination file type to CSV and still get the same error

I am currently learning the ins and outs of Azure Data Factory (so apologies if this is a novice question).

Thank you

Stevan Thomas 5 Reputation points

2024-01-15T22:06:12.75+00:00

I've further narrowed down the issue. If I take off the additional column 'file_name' with $$FILENAME, it seems to work but for non-merging operations only. If I attempt to merge the files, then it throws the error again.
Stevan Thomas 5 Reputation points

2024-01-18T03:40:39.6433333+00:00

Deleted as I posted the same thing twice

2 answers

Your answer

Stevan Thomas 5 Reputation points

2024-01-15T22:06:12.75+00:00

I've further narrowed down the issue. If I take off the additional column 'file_name' with $$FILENAME, it seems to work but for non-merging operations only. If I attempt to merge the files, then it throws the error again.
Stevan Thomas 5 Reputation points

2024-01-18T03:40:39.6433333+00:00

Deleted as I posted the same thing twice

Answer 1

Hi Kranthi Thank you for reaching out.

Response

That was my suspicion as looking through the files through a parquet viewer showed me that there was a mismatch in data type of the VendorID column (INT32 vs INT64). Eliminating the file from the process resulted in the same error.

My Debugging Results/Hypothesis

I figured that the error regarding the missing codec was the issue. It seems like the files in question were compressed under zstd algorithm (which i didn't find in the list). Hence the pipeline fails when any modifications are done on the dataset (in my case adding a column and merging them). Copying them as is worked fine

Workaround

I transferred the file to Delta Lake Storage Gen2 as is and then did the edits required using a Synapse Notebook. That worked for me. So hopefully this helps some poor soul If I am wrong in any of the points above, please do let me know. It would help in my understanding of Azure/Parquet, both of which I'm a novice at. Thank you!

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2024-01-23T18:18:03.81+00:00

Glad to know that the issue has been resolved and appreciate much for sharing your findings here as it could be helpful to others reading this thread.

Answer 2

Hi @Stevan Thomas Welcome to Microsoft Q&A forum and thanks for reaching out here.

From the error message "Could not read footer for file FileStatus{path=yellow_tripdata_2023-02.parquet; isDirectory=false; length=47748012; replication=-1; blocksize=2147483647; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}"

My understanding is that there could be a specific issue with that particular parquet file (yellow_tripdata_2023-02.parquet). Could you please try to remove that file and try to merge rest of the files using the additional column option and see if that works. If the rest of the files goes well without any issue, then it confirms the issue specific to the file mentioned in the error and need to validate the file format and the content.

In addition, I would also recommend using the latest version of Self Hosted Integration Runtime as the is specific to Java Runtime.

Kindly validate the called-out point and keep us posted how it goes. Thank you

Share via

How to fix ErrorCode=ParquetJavaInvocationException (could not read footer for file) while performing Copy action in Azure Data Factory

2 answers

Response

My Debugging Results/Hypothesis

Workaround

Your answer