Azure Data Factory error 2200 writing to parquet file

David Cosentino 25 Reputation points
2023-03-13T14:53:54.4233333+00:00

Hi, trying to write some source tables into parquet files in our data lake, and this error will come up every now in then. Sometimes having it re-run will resolve the issue, but would like to get to the bottom of it.

{

"errorCode": "2200",

"message": "ErrorCode=ParquetJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.io.IOException:E\ntotal entry:15\r\ncom.microsoft.datatransfer.bridge.io.parquet.IoBridge.outputStreamWrite(Native Method)\r\ncom.microsoft.datatransfer.bridge.io.parquet.BridgeOutputFileStream.flush(BridgeOutputFileStream.java:45)\r\ncom.microsoft.datatransfer.bridge.io.parquet.BridgeOutputFileStream.write(BridgeOutputFileStream.java:28)\r\norg.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)\r\njava.io.DataOutputStream.write(Unknown Source)\r\njava.io.FilterOutputStream.write(Unknown Source)\r\norg.apache.parquet.bytes.ConcatenatingByteArrayCollector.writeAllTo(ConcatenatingByteArrayCollector.java:46)\r\norg.apache.parquet.hadoop.ParquetFileWriter.writeDataPages(ParquetFileWriter.java:446)\r\norg.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:191)\r\norg.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:251)\r\norg.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:170)\r\norg.apache.parquet.hadoop.InternalParquetRecordWriter.checkBlockSizeReached(InternalParquetRecordWriter.java:143)\r\norg.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:125)\r\norg.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:292)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetBatchWriter.addRows(ParquetBatchWriter.java:61)\r\n.,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'",
"failureType": "UserError",
"target": "SourceToDataLake",
"details": []

}

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
929 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
6,660 questions
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 56,301 Reputation points Microsoft Employee
    2023-03-14T07:41:57.4933333+00:00

    Hello @David Cosentino,

    Thanks for the question and using MS Q&A platform.

    The error message you are seeing indicates that there was an issue with invoking Java while writing to the Parquet file. The message also includes a stack trace that shows the sequence of function calls leading up to the error. The stack trace suggests that the issue occurred while writing data pages to the Parquet file.

    Here are some possible reasons why you may be encountering this error:

    1. Insufficient resources: Writing to a Parquet file can be resource-intensive, particularly when dealing with large amounts of data. If the system running the pipeline does not have enough resources (e.g. CPU, memory, disk I/O), it can result in the Java invocation error that you are seeing. You can try increasing the resources available to the system running the pipeline and see if that resolves the issue.
    2. Data format incompatibility: Parquet files have a specific format, and the format of the data being written to the Parquet file must be compatible with it. If the data being written is not compatible with the Parquet file format, it can result in errors. You can verify that the data being written to the Parquet file is in the correct format by reviewing the schema and data types.
    3. Issues with the data source: If the issue is related to the data source itself, you may want to check the logs and error messages from the source system. There may be issues with the data format or data quality that are causing the Java invocation error.

    To troubleshoot this issue, you can try the following steps:

    1. Review the logs: Review the logs for the data transfer operation to see if there are any additional error messages or warnings that may provide more information about the issue.
    2. Check the source data: Verify that the source data is in the correct format and that there are no data quality issues that may be causing the issue.
    3. Increase resources: If you suspect that the issue is related to resource constraints, try increasing the resources available to the system running the pipeline.
    4. Retry the operation: If the issue is intermittent, retry the operation to see if it resolves the issue.

    For more details, refer to Troubleshoot the Parquet format connector in Azure Data Factory and Azure Synapse

    If none of these steps help to resolve the issue, please do let us know.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


0 additional answers

Sort by: Most helpful