Millions of records per day JSON data being stored in a Data Lake Gen2 now which tool(s) to use to dedupe and aggregate, etc.?

Question

We need to ingest data from 3rd party vendor being stored in a Data Lake Gen2 in JSON format. We have been granted read only access to the data lake. The data is stored in JSON files in a folder structure as follows: Year/Month/Day/Hour/ then within each hour folder there are many .json files. What we need to do is aggregate that data hourly ensuring no duplicates/double counting of rows and store that in our traditional SQL Server db. Each row in the json files has a unique id column. Our aggregation is pretty simple. We need to aggregate on 3 columns in the data. I know how to do this in a traditional SQL Server table environment, but not sure which tool(s) to use in Azure Data Lake world? The data is continuously being added to the data lake, every hour of every day (say 100 million rows per day currently). I want to calculate these aggregates each hour. My current setup uses a row id in the sql table to keep track of the last row of raw data that was aggregated, then the next run picks up starting at the next row id, calculates those new aggregates, then upserts them to the master aggregate table.

Also, we want to calculate these aggregates off of the vendor's data lake so we can report in near real time. Then we want to copy these files from their data lake to ours (both Azure Gen 2) for retention and further deeper analysis. Any suggestions of methodology and tools to use for this task as well?

Answer

Hello Jason,

I believe that Serverless SQL pool in Azure Synapse Analytics can help you do this. This is the perfect Technology/Tool for you!

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files

You can process the Json perform your aggregation and then Save to your Data Lake.

Hope this helps

Answer

Hello @JasonW-5564 and welcome to Microsoft Q&A.

I would like to break down your ask into parts to make answering easier.

Process data coming in at a near streaming basis
Do further analysis
Access vendor's data lake in real time
Copy original data to your lake for archiving

For the data processing, there are a number of possible technologies, including HDInsight, Databricks, Azure Stream Analytics, and Azure Synapse (Analytics).

I agree with @Silvano P that Azure Synapse Analytics is a good candidate, but for a different reason. Synapse Analytics gives you many ways to work with your data; SQL (dedicated and on-demand serverless), Spark distributed computing, ETL of Mapping Data Flows fom Azure Data Factory, and built-in Azure Data Lake Gen 2 Integration. There is also compatibility with Stream Analytics and PowerBI integration. Synapse Analytics is like a meeting place for all these technologies.

For the archival and access to the vendor's data lake, I strongly recommend you look at Azure Data Share. Azure Data Share has two modes. One effectively copies their data lake to yours on a regular basis. This would be good for the archiving. The other mode effectively gives you real-time read-only access to their data lake.

Answer

Thank you for the positive feedback @JasonW-5564 .

While Synapse Analytics Workspace is new, I think you would be the most comfortable with it. Also, from my view, I do not see any way Synapse would be deprecated any time soon.

Data Factory is like a visual SSIS for the cloud. Fortunately, Synapse Analytics Workspace come with a Data Factory inside it!
Synapse Analytics revamps Azure SQL Warehouse. That is T-SQL , so your skill (mostly) transfers over.
The Spark jobs can be written in Python, another way to use Synapse.
If you want to continue to use your existing SSIS packages, there is an integration available.

Databricks has an established presence, available in places beyond Azure.
While Databricks does allow you to write jobs in SQL and in Python, it itself is not a database. Databricks can connect to a database, but the back-and-fourth can add up. To get the full benefits of Databricks, you will need to learn about "dataframes".

HDInsight is very heavy-weight, and is always-on. While you can do SQL stuff in HDInsight, you also are responsible for managing the cluster.
I am not sure how HDInsight compares to a large SQL Dedicated Pool in cost. In my personal opinion, HDInsight has a much steeper learning curve than Synapse.

Overall, I say Azure Synapse Analytics is the way to go.

If this answers your question, please mark as accepted answer.
Happy Holidays of your choice!

Share via

Millions of records per day JSON data being stored in a Data Lake Gen2 now which tool(s) to use to dedupe and aggregate, etc.?

3 answers