How to convert LOTS and LOTS of log data stored in JSON format in Azure Data Lake (Gen2) to parquet files?

JasonW-5564 161 Reputation points
2021-01-16T05:06:51.797+00:00

I have tens of thousands of devices in the field that store their logs to a centralized Azure Data Lake Gen 2 BLOB storage. The data is JSON format. The data is actually very flat, not a lot of nested arrays, etc. It could almost be a CSV format just as easy, but JSON format is what I get and cannot change it. What I want to do is convert these JSON files to a parquet format so that I can query then from Azure Synapse for some deep dive data analytics. The data is loaded into a folder structure as follows YYYY/MM/DD/HH/<1000's of .json files in the HH folders>. I am assuming Azure Data Factory is the way to go here as most of our experience is with traditional SSIS processing.

I want to convert all my existing .json files now. Long term, I want it to run every hour and convert the "new files" in the latest HH level folder as the previous folders do not change/get new files added. I just need to run the process against the "next" HH folder. Any help/direction is greatly appreciated as we are new to this world and come from a more traditional SQL Server/SSIS environment.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,367 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,917 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,537 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. HarithaMaddi-MSFT 10,126 Reputation points
    2021-01-18T07:36:08.54+00:00

    Hi @JasonW-5564 ,

    Welcome to Microsoft Q&A Platform. Thanks for posting the query.

    Copy new files by LastModifiedDate template in data factory seems to be an appropriate approach for this requirement. This document currently uses binary source and binary destination datasets as this shows to copy any type of files. In your requirement, since soaurce type is JSON and destination type is Parquet, the same datasets can be created and used. This article also uses tumbling window trigger to pass the parameters and to run pipeline after scheduled time (In this case, it can be 1hour). More details and options in tumbling window trigger can be found in this document.

    Please let us know for further queries and we will be glad to assist.

    --

    • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification.