Why does the ADF integration runtime have a dependency on Java for parquet compression purposes?

David Beavon 976 Reputation points
2020-08-07T15:46:37.32+00:00

I've installed the IR (self-hosted) several times over the course of the past six months and have been confused by the dependency on the JDK.
(we use this one https://adoptopenjdk.net/ )

You can read about this dependency in the docs here: https://learn.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime
(see the notes related to prerequisites... )

I have also seen the errors from the SHIR when the JVM is not found (ie. "Java Runtime Environment is not found"). And I see that there are related questions that demonstrate that others are installing this as well : https://social.msdn.microsoft.com/Forums/en-US/4fb8e8dd-205a-480f-adf6-7054563a6313/copying-from-selfhosted-ir-sql-server-to-azure-datalake-gen-2-in-parquet-files?forum=AzureDataFactory

... But based on what I can tell by some simple monitoring, the self hosted runtime never actually launches any JVM processes, despite the fact that we extract tons of data in the parquet format, and use ADF to push it into our data lake.

So what exactly is the purpose of installing the JDK prerequisite? And even if we did see the use of java, then why would that be better than a native library ( or .Net -based) solution for generating parquet files? Any tips would be appreciated. We rarely ever install the JDK on our servers and it doesn't seem like we should need it in the scenario either.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,459 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. HarithaMaddi-MSFT 10,136 Reputation points
    2020-08-10T11:51:50.587+00:00

    Hi @David Beavon ,

    Welcome to Microsoft Q&A Platform.

    Thanks for posting the question. I understand the additional step needed, I reached out to the Product team for more insights on this prerequisite and they mentioned that JRE/OpenJDK is required when parsing/generating Parquet or ORC format on Self-hosted IR. Avro format doesn't need it and this will be corrected in the documentation soon. ADF built those formats using JAVA libraries thus the dependency you are seeing.

    I would recommend you to provide feedback at the feedback forum. All the feedback you share, is closely monitored by the Data Factory Product team and implemented in future releases.

    Hope this helps! Please let us know for more information.


  2. David Beavon 976 Reputation points
    2020-08-28T23:17:59.16+00:00

    This is not an answer, just some more follow-up information. I did some more digging with procmon and I can see where the jar's are being accessed. They are accessed by diawp.exe. That makes use of Java jars in the JRE installation and in the SHIR directories, ie:

    • C:\Program Files\Microsoft Integration Runtime\4.0\Gateway\Jars\
    • C:\Program Files\AdoptOpenJDK\

    What is really confusing to me, however, is that diawp.exe does NOT appear to be a java process.

    By all appearances, it is a .Net CLR process. Can someone please explain this mystery? The only thing I can think of is that the ADF developers thought it might be really cool to use a bytecode translator to effectively run JVM bytecode as if it was .Net. Is that possible they would do all of that, just for the sake of parquet functionality? It seems way overkill since there are both native and .Net libraries for working with parquet files.

    I'd love to understand this better. It truly seems odd that we are installing a JRE for the sake of the SHIR, especially given that we never run any java processes.

    Can someone please comment? Here is a link to others that say it is theoretically possible to run jars from .net:
    https://stackoverflow.com/questions/512124/use-a-jar-java-library-api-in-c

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.