How to query 3rd party Azure DataLake Gen2 and only store the results

Question

First, what I am trying to do is I want to query and aggregate raw JSON files stored in a 3rd party's Azure Data Lake (Gen2) and store those aggregates in my own data lake or relation db. I do not want to physically copy all of those raw JSON files because of the data volume and velocity, as well as adding the additional storage cost and introduce any un necessary latency. I am looking for how to do that/what is the best tool set to use for this.

A bit more detail:

The data in the 3rd party's Azure Data Lake Gen2.
I have read only access to that data lake. Currently via a SAS, that can change if SAS not supported.
The data files in the lake are stored in the following folder structure yyyy/mm/dd/hh. There are thousands of JSON files per each hh folder.
Files are added to data lake every minute of every day to most recent hh folder.
New files are only added and do not change once added, so once I query a folder, other than the most current, it never changes again/needs to be re-queried.
I want to be able to query the files ASAP they are posted to the 3rd party's data lake.
Once I query the files, I have no other need for them and do not need to import/keep them.

Answer

Hi @JasonW-5564 ,

Welcome to Microsoft Q&A Platform. Thanks for posting the query.

Dataflows in data factory is a suitable cloud ETL tool for such requirements that has lot of transformation activities which will help in working with the data before loading from source to sink.

Event Triggers in Azure Data Factory are useful in running the pipelines as soon as files are added into the ADLS and incremental load approach using last modified date in data factory is useful in controlling loading files uploaded in certain time duration.

Please suggest for further queries and we will be glad to assist.

--

Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
Want a reminder to come back and check responses? Here is how to subscribe to a notification.

Share via

How to query 3rd party Azure DataLake Gen2 and only store the results

1 answer

Your answer