How to fix HybridDeliveryException,Message=An error occurred when invoking java, message: java.io.IOException while merging two parquet files in ADF

Question

How to fix HybridDeliveryException,Message=An error occurred when invoking java, message: java.io.IOException while merging two parquet files in ADF

abby_17 0

We have written a pipeline to copy data from database and merge data to existing parquet file.

While doing so, we found following error in the pipeline for merging the parquet file activity. This activity uses copy acitivity of the ADF.

ErrorCode=ParquetJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.io.IOException:S
total entry:15
com.microsoft.datatransfer.bridge.io.parquet.IoBridge.inputStreamRead(Native Method)
com.microsoft.datatransfer.bridge.io.parquet.BridgeInputFileStream.fillBuffer(BridgeInputFileStream.java:88)
com.microsoft.datatransfer.bridge.io.parquet.BridgeInputFileStream.read(BridgeInputFileStream.java:42)
java.io.DataInputStream.read(DataInputStream.java:149)
org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102)
org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850)
org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990)
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940)
org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1082)
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:130)
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.nextBuffer(ParquetBatchReaderBridge.java:168)
.,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'

Here is the copy activity JSON to start the activity.

{
"source": {    
"type": "ParquetSource",
    "storeSettings": {
        "type": "AzureBlobFSReadSettings",
        "recursive": false,
        "wildcardFolderPath": "policy-priced-peril-characteristic-commissions",
        "wildcardFileName": "policy_priced_peril_characteristic_commissions?*.parquet",
        "enablePartitionDiscovery": false
    },
    "formatSettings": {
        "type": "ParquetReadSettings"
    }
},
"sink": {
    "type": "ParquetSink",
    "storeSettings": {
        "type": "AzureBlobFSWriteSettings",
        "copyBehavior": "MergeFiles"
    },
    "formatSettings": {
        "type": "ParquetWriteSettings"
    }
},
"enableStaging": false,
"parallelCopies": 2,
"validateDataConsistency": true,
"logSettings": {
    "enableCopyActivityLog": true,
    "copyActivityLogSettings": {
        "logLevel": "Warning",
        "enableReliableLogging": false
    },
    "logLocationSettings": {
        "linkedServiceName": {
            "referenceName": "AzureDataLakeStorage2",
            "type": "LinkedServiceReference"
        },
        "path": "adf-logs"
    }
},
"dataIntegrationUnits": 4,
"translator": {
    "type": "TabularTranslator",
    "typeConversion": true,
    "typeConversionSettings": {
        "allowDataTruncation": true,
        "treatBooleanAsNumber": false
    }
}
}```

We are using **AutoResolveIntegrationRuntime**.

Here is output from this activity   

```json
{
	"dataRead": 275450924,
	"dataWritten": 0,
	"filesRead": 2,
	"filesWritten": 0,
	"sourcePeakConnections": 2,
	"sinkPeakConnections": 1,
	"rowsRead": 6853649,
	"rowsCopied": 6853649,
	"copyDuration": 184,
	"throughput": 2459.383,
	"logFilePath": "adf-logs/copyactivity-logs/MergeParquetFiles_copy1/d22650f7-9158-4720-8543-a735165a7caa/",
	"errors": [
		{
			"Code": 21000,
			"Message": "ErrorCode=ParquetJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.io.IOException:S\ntotal entry:15\r\ncom.microsoft.datatransfer.bridge.io.parquet.IoBridge.inputStreamRead(Native Method)\r\ncom.microsoft.datatransfer.bridge.io.parquet.BridgeInputFileStream.fillBuffer(BridgeInputFileStream.java:88)\r\ncom.microsoft.datatransfer.bridge.io.parquet.BridgeInputFileStream.read(BridgeInputFileStream.java:42)\r\njava.io.DataInputStream.read(DataInputStream.java:149)\r\norg.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102)\r\norg.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)\r\norg.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)\r\norg.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850)\r\norg.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990)\r\norg.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940)\r\norg.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1082)\r\norg.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:130)\r\norg.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)\r\norg.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.nextBuffer(ParquetBatchReaderBridge.java:168)\r\n.,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'",
			"EventType": 0,
			"Category": 5,
			"Data": {},
			"MsgId": null,
			"ExceptionType": null,
			"Source": null,
			"StackTrace": null,
			"InnerEventInfos": []
		}
	],
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (West Europe)",
	"usedDataIntegrationUnits": 4,
	"billingReference": {
		"activityType": "DataMovement",
		"billableDuration": [
			{
				"meterType": "ManagedVNetIR",
				"duration": 0.26666666666666666,
				"unit": "DIUHours"
			}
		],
		"totalBillableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.26666666666666666,
				"unit": "DIUHours"
			}
		]
	},
	"usedParallelCopies": 2,
	"executionDetails": [
		{
			"source": {
				"type": "AzureBlobFS",
				"region": "West Europe"
			},
			"sink": {
				"type": "AzureBlobFS",
				"region": "West Europe"
			},
			"status": "Failed",
			"start": "3/28/2025, 10:04:33 PM",
			"duration": 184,
			"usedDataIntegrationUnits": 4,
			"usedParallelCopies": 2,
			"profile": {
				"queue": {
					"status": "Completed",
					"duration": 70
				},
				"transfer": {
					"status": "Completed",
					"duration": 112,
					"details": {
						"listingSource": {
							"type": "AzureBlobFS",
							"workingDuration": 0
						},
						"readingFromSource": {
							"type": "AzureBlobFS",
							"workingDuration": 7
						},
						"writingToSink": {
							"type": "AzureBlobFS",
							"workingDuration": 0
						}
					}
				}
			},
			"detailedDurations": {
				"queuingDuration": 70,
				"transferDuration": 112
			}
		}
	],
	"dataConsistencyVerification": {
		"VerificationResult": "Verified"
	},
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	}
}

Any help or feedback would be helpful to understand and debug this error.

Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-03T12:07:28.86+00:00

@abby_17 We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

2 answers

Your answer

Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-03T12:07:28.86+00:00

@abby_17 We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Venkat Reddy Navari 3,470 Microsoft External Staff Moderator

Hi @abby_17 @abby_17

The HybridDeliveryException in Azure Data Factory (ADF) while merging Parquet files usually happens due to file compatibility or configuration issues. Try these steps to fix it:

Check the Parquet Files: Download the files from your storage (policy-priced-peril-characteristic-commissions). Use this Python script to check if they open correctly and compare their schemas:
```
   import pyarrow.parquet as pq  
   for file in ['file1.parquet', 'file2.parquet']:  
     table = pq.read_table(file)  
     print(f"Schema for {file}: {table.schema}")
```
If one file doesn’t open or schemas don’t match, that could be the issue.
Test with One File: Change "wildcardFileName" in your JSON to use just one file (e.g., policy_priced_peril_characteristic_commissions1.parquet Run the pipeline. If it works, the second file or the merge operation might be causing the problem.
Adjust Copy Activity Settings
- Reduce "parallelCopies": 2 → "parallelCopies": 1 (for better stability).
- Increase "dataIntegrationUnits": 4 → "dataIntegrationUnits": 8 (for more processing power).
Check the Logs for Errors
- Look at the logs in: These logs may point to the exact issue.
  adf-logs/copyactivity-logs/MergeParquetFiles_copy1/[execution-id]/

Try Merging the Files Manually

Run this Python script to merge the files:

     import pyarrow.parquet as pq  
     import pyarrow as pa  
     files = ['file1.parquet', 'file2.parquet']  
     tables = [pq.read_table(f) for f in files]  
     merged = pa.concat_tables(tables)  
     pq.write_table(merged, 'output.parquet')

For more details, check the Microsoft troubleshooting guide: Azure Data Factory Parquet Connector Troubleshooting

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

abby_17 0 Reputation points

2025-04-01T14:23:03.39+00:00

Hi @Venkat Reddy Navari thanks for your response.

We are running this pipeline on daily schedule and this error we got only once. Rest of the time its running correctly so doesn't seem like a schema issue. I downloaded the file for this run and tried to merge with previous run file and current file, it merged correctly using code mentioned by you. Double checked on schema too.

Beside that, I also looked at log files here the logs are registered as
Timestamp,Level,OperationName,OperationItem,Message

which is not helpful for both of us. Any tips to improve logging will be helpful

Thanks for suggesting copy activity setting and for us 4 DIU works good for now.

Any other way to debug this, as error message is not helpful and we don't want this error to reoccur.
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-02T15:19:28.7633333+00:00
@abby_17
Since the issue only occurred once and your schema validation and manual merge worked fine, the error might be intermittent due to transient issues in ADF or the underlying storage. Here are a few things you can try to improve logging and debugging for future occurrences:

Enable Debug-Level Logging: Currently, your logs are at the "Warning" level. Try increasing the logging level to capture more details. Update your copy activity log settings:
json CopyEdit "logSettings": { "enableCopyActivityLog": true, "copyActivityLogSettings": { "logLevel": "Verbose", "enableReliableLogging": true } }
This will capture more detailed logs about pipeline execution.

Check Storage Performance Metrics: If the error was caused by a temporary storage issue, you can check Azure Storage Metrics (latency, throttling, failures) in the Azure portal under Storage Account > Monitoring > Metrics. Look for anomalies around the failure time.

Use ADF Retry and Error Handling: Since this was a one-time failure, adding a retry mechanism can help mitigate transient issues:

Set Retry count in Copy Activity settings (e.g., 3 retries).

Use a Try-Catch block with a Failure Activity (like an alert or notification)

Try Processing in Smaller Batches: If the issue reoccurs, test with fewer files per batch to isolate the problem.Review AutoResolveIntegrationRuntime: Since you’re using AutoResolveIntegrationRuntime, it might have allocated resources differently during the failure run. You could try a self-hosted or region-specific runtime for consistency.
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-04T16:35:00.0033333+00:00

@abby_17 We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
abby_17 0 Reputation points

2025-04-08T08:36:47.6333333+00:00

@Venkat Reddy Navari .. above resolutions were not helpful. I opened a ticket with MS support for this now. If you have other suggestions, please update us on it.
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-08T12:21:19.17+00:00
Hi abby_17 I think we have already covered most of the key troubleshooting steps, but here are a couple more ideas that might help while support investigates:

Custom Logging: If you need deeper diagnostics, consider using custom logging through Azure Functions or Data Flow activities. This can give you more insights into what’s happening during the pipeline execution.

Monitor I/O Performance: Keep an eye on I/O performance specifically during the pipeline run. This can help identify any storage-related bottlenecks or throttling events that might be causing issues.
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-09T09:22:46.0333333+00:00

@abby_17 - We received your feedback that the answer provided on the thread was not helpful. Kindly let us know what we could have done better to improve the answer and make your engagement experience good. We are here to help you and strive to make your experience better and greatly value your feedback. I have provided a detailed answer which has clear steps for which you are looking for. If you wish, you may re-surveying/rating for the engagement you received on the thread. Your feedback is very important to us. Looking forward to your reply. Much appreciate your feedback!
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-10T09:13:29.93+00:00

@abby_17 -We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 2

abby_17 0

Hey,

I had talked on this topic with MS support and only resolutions without any know why is to set retry to 3 in merge/copy activity.

Hoping this helps others too.

Share via

How to fix HybridDeliveryException,Message=An error occurred when invoking java, message: java.io.IOException while merging two parquet files in ADF

2 answers

Your answer