Why does the ADF integration runtime have a dependency on Java for parquet compression purposes?

Question

Why does the ADF integration runtime have a dependency on Java for parquet compression purposes?

David Beavon 991

I've installed the IR (self-hosted) several times over the course of the past six months and have been confused by the dependency on the JDK.
(we use this one https://adoptopenjdk.net/ )

You can read about this dependency in the docs here: https://learn.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime
(see the notes related to prerequisites... )

I have also seen the errors from the SHIR when the JVM is not found (ie. "Java Runtime Environment is not found"). And I see that there are related questions that demonstrate that others are installing this as well : https://social.msdn.microsoft.com/Forums/en-US/4fb8e8dd-205a-480f-adf6-7054563a6313/copying-from-selfhosted-ir-sql-server-to-azure-datalake-gen-2-in-parquet-files?forum=AzureDataFactory

... But based on what I can tell by some simple monitoring, the self hosted runtime never actually launches any JVM processes, despite the fact that we extract tons of data in the parquet format, and use ADF to push it into our data lake.

So what exactly is the purpose of installing the JDK prerequisite? And even if we did see the use of java, then why would that be better than a native library ( or .Net -based) solution for generating parquet files? Any tips would be appreciated. We rarely ever install the JDK on our servers and it doesn't seem like we should need it in the scenario either.

2 answers

Your answer

Answer 1

HarithaMaddi-MSFT 10,146

Hi @David Beavon ,

Welcome to Microsoft Q&A Platform.

Thanks for posting the question. I understand the additional step needed, I reached out to the Product team for more insights on this prerequisite and they mentioned that JRE/OpenJDK is required when parsing/generating Parquet or ORC format on Self-hosted IR. Avro format doesn't need it and this will be corrected in the documentation soon. ADF built those formats using JAVA libraries thus the dependency you are seeing.

I would recommend you to provide feedback at the feedback forum. All the feedback you share, is closely monitored by the Data Factory Product team and implemented in future releases.

Hope this helps! Please let us know for more information.

David Beavon 991 Reputation points

2020-08-19T14:18:27.467+00:00

It is odd that I haven't yet noticed Java processes launching from the self-hosted runtime. I will have to take yet another close look at where/when/how my parquet datasets are triggering the use of Java.

Thanks for confirming that the parquet format does continue to require Java.
... Its not what I'm observing but I may have missed something. (Maybe I will have to uninstall the JDK after the fact, or sabotage/disable it, and see if the self-hosted runtime is still able to generate my parquet files for me ).

Answer 2

This is not an answer, just some more follow-up information. I did some more digging with procmon and I can see where the jar's are being accessed. They are accessed by diawp.exe. That makes use of Java jars in the JRE installation and in the SHIR directories, ie:

C:\Program Files\Microsoft Integration Runtime\4.0\Gateway\Jars\
C:\Program Files\AdoptOpenJDK\

What is really confusing to me, however, is that diawp.exe does NOT appear to be a java process.

By all appearances, it is a .Net CLR process. Can someone please explain this mystery? The only thing I can think of is that the ADF developers thought it might be really cool to use a bytecode translator to effectively run JVM bytecode as if it was .Net. Is that possible they would do all of that, just for the sake of parquet functionality? It seems way overkill since there are both native and .Net libraries for working with parquet files.

I'd love to understand this better. It truly seems odd that we are installing a JRE for the sake of the SHIR, especially given that we never run any java processes.

Can someone please comment? Here is a link to others that say it is theoretically possible to run jars from .net:
https://stackoverflow.com/questions/512124/use-a-jar-java-library-api-in-c

Share via

Why does the ADF integration runtime have a dependency on Java for parquet compression purposes?

2 answers

Your answer