Share via

environment design

arkiboys 9,711 Reputation points
2022-02-15T14:46:48.693+00:00

As you see below, we are trying to cater for realtime data and put them in ADLS Gen2 perhaps.


Requirements:
1- To Capture real-time transactions data from external providers and store into azure (Possibly ADLS Gen2)...
2- There are thousand (Not millions) of rows coming in
3- There are a-lot of columns in each feed. Some may have over 100 columns
4- Not to have negative impact on performance
5- The captured data in azure to make it available to all systems, i.e. Power BI, python/.net apps, etc.
6- The data received are in different file formats or from various data sources...
7- They have a lot of columns and perhaps with thousands of rows...
8- The frequency of arrivals are regular and various times...
9- may need to scrape from websites to pull data through


Questions:

shoud we :
use synapse workspace?
store the captured data in .parquet?
use serverless or dedicated sql pool?
azure sql server?
perhaps event hubs to capture realtime data?
Databricks or synapse notebook?
etc.

Thank you

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.

0 comments No comments

Answer accepted by question author

ShaikMaheer-MSFT 38,631 Reputation points Microsoft Employee Moderator
2022-02-16T13:57:21.283+00:00

Hello @arkiboys ,

Thanks for the ask and using Microsoft Q&A platform .

If I understand your ask correctly, you are looking for services to use take data from external sources and load it to ADLS and then serve data from there to downstream apps such as PBI, python or .net apps.

  • Azure Synapse pipelines/Azure data factory have connectors for n number of sources so I would recommend use them to load data in to ADLS from your sources system. Click here to know all connectors in ADF & Synapse. If your source systems or like sensors or IoT and event streams then you can consider Event Hub.
  • Storing captured data in ADLS in .parquet format helps better as Parquet is used to efficiently store large data and good for Big data analytics.
  • If you wish to take data from ADLS files directly and then serverless SQL pool itself can do the job with the capability of external tables. If you are looking for a provisioned SQL resources with its own storage and resources then you should consider dedicated SQL pool.
  • As most of resources come under synapse umbrella using Synapse notebooks itself will be give better manageability. I feel Azure data bricks is not needed as Synapse notebooks can do that job internally.

As your need data movement and transformation from external sources to Azure and then serve data back to PBI or applications, above solution best works.

Please do let me know how it goes.

---------------------------------------------------------------------------------

Please consider hitting Accept Answer. Accepted answers helps community as well.

Was this answer helpful?


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.