Querying Parquet file from ADL 2.0

rajendar erabathini 616 Reputation points
2021-03-01T09:37:43.37+00:00

HI - I need some design recommendations for my requirement. I have list of parquet files stored in Data lake. I need to execute some SQL statements against them , ex. select max(amount), count() and return the results so that i can write back them to my On premise SQL server. Please note that I am looking for a solution without storing the data in intermittently in Dedicated SQL Pool using Copy or external tables.

  1. Using OpenRowset() function in SQL Serverless pool only help to analyse the data but I can't return the resultset and store in my On premise server using data pipelines. Please correct me if my understanding is not right here.
  2. Data pipelines allows to transfer the data from ADL to other stores but it doesn't have capability to query and return results so that i can sink it to On Premise SQL server using copy activity. However, it looks like without storing to SQL Pool we can't achieve it either .

One more option i am considering is, Can I pass the parquet file name to spark note book and does it have capability to run the queries and return the results so that I will store the results to On-premise server. I will use Data pipeline copy activity to automate the workflow. Is it possible to implement the way I am thinking, please confirm.

Please let me know other possible options to achieve my objective.

thanks

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,567 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,933 questions
Azure Data Lake Analytics
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Samara Soucy - MSFT 5,051 Reputation points
    2021-03-04T21:28:22.39+00:00

    It doesn't have to be a SQL Pool, but you will need to write staging data somewhere. This includes using a Notebook/Spark Pool to do the transformations. The Data Flow activity in Pipelines is a codeless method of doing the transformations you need that runs on top of a Spark cluster, but it also requires outputting the data into a file or database.

    Writing back to the Data Lake is going to be your best option, especially if you don't want to create a SQL Pool. It will also give you a way to troubleshoot bugs and if the sink back to on-prem fails due to network issues you won't have to completely start over, so there are benefits to doing it that way..

    1 person found this answer helpful.
    0 comments No comments