Hi
I am working on a project that will combine 10 parquet files into one file (with an additional column of $$FILENAME).However I keep running into this error:
ErrorCode=ParquetJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.io.IOException:Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=yellow_tripdata_2023-02.parquet; isDirectory=false; length=47748012; replication=-1; blocksize=2147483647; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
total entry:7
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:269)
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:210)
org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:112)
org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:45)
org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:202)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:62)
com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22)
java.io.IOException:Could not read footer for file FileStatus{path=yellow_tripdata_2023-02.parquet; isDirectory=false; length=47748012; replication=-1; blocksize=2147483647; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
total entry:6
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:261)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
java.io.IOException:can not read class org.apache.parquet.format.FileMetaData: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT32, encodings:[PLAIN_DICTIONARY, PLAIN, RLE], path_in_schema:[VendorID], codec:null, num_values:2913955, total_uncompressed_size:698054, total_compressed_size:370363, data_page_offset:39, dictionary_page_offset:4, encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:2)])
total entry:15
org.apache.parquet.format.Util.read(Util.java:216)
org.apache.parquet.format.Util.readFileMetaData(Util.java:73)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:774)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:771)
org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:653)
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:771)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:437)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:259)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
shaded.parquet.org.apache.thrift.protocol.TProtocolException:Required field 'codec' was not present! Struct: ColumnMetaData(type:INT32, encodings:[PLAIN_DICTIONARY, PLAIN, RLE], path_in_schema:[VendorID], codec:null, num_values:2913955, total_uncompressed_size:698054, total_compressed_size:370363, data_page_offset:39, dictionary_page_offset:4, encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:2)])
total entry:20
org.apache.parquet.format.ColumnMetaData.validate(ColumnMetaData.java:1777)
org.apache.parquet.format.ColumnMetaData.read(ColumnMetaData.java:1565)
org.apache.parquet.format.ColumnChunk.read(ColumnChunk.java:480)
org.apache.parquet.format.RowGroup.read(RowGroup.java:591)
org.apache.parquet.format.FileMetaData.read(FileMetaData.java:838)
org.apache.parquet.format.Util.read(Util.java:213)
org.apache.parquet.format.Util.readFileMetaData(Util.java:73)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:774)
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:771)
org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:653)
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:771)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461)
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:437)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:259)
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
.,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'
I have done the following checks:
- The pipeline works for single files but when I select multiple files, it throws the error
- I changed the destination file type to CSV and still get the same error
-

I am currently learning the ins and outs of Azure Data Factory (so apologies if this is a novice question).
Thank you