Can't process ORC files in Data Factory : ErrorCode=ParquetJavaInvocationException

Question

Can't process ORC files in Data Factory : ErrorCode=ParquetJavaInvocationException

Joris 6

Hi,

Our organisation uses ORC formatted files for our central file storage, in our data factory I am unable to process most of the ORC files, I am only able to process really small files (< 1mb) for all the other ORC files the pipeline fails to run.
We need to convert the files to a type we can work with in dataflows such as parquet or csv, but we currently are not able to do this for most files.
The IR we use is the AutoResolveIntegrationRuntime from Azure, we are not able to use a self-hosted IR.
This is the full error when runnning a pipeline with a copy data activity:

{
"errorCode": "2200",
"message": "ErrorCode=ParquetJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.nio.BufferOverflowException:Unable to retrieve Java exception..,Source=Microsoft.DataTransfer.Richfile.OrcTransferPlugin,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'",
"failureType": "UserError",
"target": "Copy data1",
"details": []
}

Can you help us out?

1 answer

Your answer

Answer 1

KranthiPakala-MSFT 46,642 Microsoft Employee Moderator

Hi @Joris-3620,

Thanks for your query and sorry for your experience.

The cause for such error is that Default JVM heap size is not enough for JVM to do (de)serialization work in copying orc format data. To mitigate, we need to increase this default JVM heap size.

However, I could see currently Azure IR is used for this copy activity, and unluckily we could not modify JVM heap size in Azure IR. We need to use a Self-Hosted IR instead. After Self-Hosted IR is created, add the following environment System variable in the machine that hosts the self hosted IR and then restart the IR.:

_JAVA_OPTIONS "-Xms256m -Xmx16g" (Note: this is only a sample value. You could determine the min/max heap size)

I see that you have mentioned that you were not able to use a self-hosted IR - could you please elaborate more on why you weren't able to use SHIR for your copy activity? So that I can reach out to internal team for an alternate using Azure IR.

Please let me know.

Thank you
Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members.

Joris 6 Reputation points

2020-07-08T15:44:58.893+00:00

Hi @KranthiPakala-MSFT,

Thank you for the quick response, I already tried searching for the error and expected something like this.
We are a reporting department and have a simple Azure setup(data factory & ADLS) to feed our reports automatically from our organisation wide database, we are not allowed to spin up a VM for the SHIR or install it on our personal notebooks. We have a support plan, so submitted a ticket yesterday, currently the data factory product team is looking into this, I hope they will find a solution to process these files with an Azure IR.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2020-07-08T21:31:08.813+00:00

Hi Joris-3620,

Thanks for your detailed response. I totally get it now. I'll track the support case internally.
Please feel free to share the workaround/resolution details once the support ticket is closed, as it would be beneficial for other members of the community, who reads this thread.

Thanks

Share via

Can't process ORC files in Data Factory : ErrorCode=ParquetJavaInvocationException

1 answer

Your answer