Connection between Delta Lake of DataFactory and HDInsight Spark

Question

Hi Everyone,

Below is the my design:

I am using ADF to ingest the data from Oracle DB to Raw Layer.
Then I am using ADF DataFlow to create and load data from RawLayer to DeltaLake. PFA.
Now I want to use HDinsight Spark cluster for further transformation.

Below are my queries:

How can I link my Deltalake created through ADF and do further transformation from HDinsight Spark cluster.
Is there any way to do select * and view the data of the deltalake.

Answer

Thanks for reaching out to Microsoft Q&A.

How can I link my Deltalake created through ADF and do further transformation from HDinsight Spark cluster.

To link your Delta Lake created through ADF and perform further transformations from an HDInsight Spark cluster, follow these steps...

Mount your Delta Lake storage to your HDInsight Spark cluster
Try the 'azure-datalake-store' library in python to mount ADLS to your hdnisight cluster. This allows you to access the delta Lake files directly using spark jobs running on the cluster.
Once the delta Lake storage is mounted, use spark to read the delta Lake files as dataframes and perform further transformations as required. The best way is to write spark jobs using PySpark to read the delta Lake data, apply your transformations, and then write the results back to delta Lake

Is there any way to do select * and view the data of the deltalake.

On the mounted delta lake files, you can query it using spark sql and view the results. For ex., reading the data from delta table...

User's image

Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

1 answer