Data Lake as a storage/database for Express Angular Application?

Question

So currently I'm using SQL server for our structured data, client uploads a file which has a minimum of a million of records, which gets uploaded to blob storage and then those million of records gets inserted into different tables in SQL. What I want to know is can I completely move this architecture to Data Lake? does Data Lake supports transactions? I also have web-jobs and function apps in Azure and the plan is to convert the function app to PySpark jobs as well, so in short everything will be moved from Azure to Fabric, but I did not find any article which supports this and I don't know what impact this is going to have on performance.

Since we are getting millions of records from a single customer, so client is pushing us to move towards DataLake. I would like to know thoughts on this.

Thanks!

Since we are getting millions of records from a single customer, so client is pushing us to move towards DataLake. I would like to know thoughts on this.

Thanks!

Accepted Answer

Hello Jayesh Nathani,

Greetings! Welcome to Microsoft Q&A Platform.

A data lake is a centralized repository that ingests and stores large volumes of data in its original form. The data can then be processed and used as a basis for a variety of analytic needs. Due to its open, scalable architecture, a data lake can accommodate all types of data from any source, from structured (database tables, Excel sheets) to semi-structured (XML files, webpages) to unstructured (images, audio files, tweets), all without sacrificing fidelity.

You can indeed move your architecture from SQL Server to a data lake. On, consider the following factors:

Data Structure: Data lakes hold data of all types, including raw and unprocessed data. Ensure that your data pipeline can handle this diversity.

Processing: Data lakes allow you to process data using tools like PySpark, but you’ll need to design your processing logic accordingly.

Querying: While SQL is commonly used with data lakes, it’s essential to understand that querying unstructured data might require different approaches.

Security: Implement proper access controls to secure your data lake.

Azure Data Lake Storage is a scalable storage solution that can hold and protect data in one central place, eliminating silos at an optimal cost.

refer - https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-a-data-lake/, https://www.mssqltips.com/sqlservertip/4970/data-lake-vs-data-warehouse/

Consider using Azure Databricks or Azure Synapse Analytics (formerly SQL Data Warehouse) for more transactional capabilities as Data lakes do not inherently support transactions in the same way as relational databases.

Performance wise it is very good in below factors,

Data lakes handle large volumes efficiently.
Processing data in a data lake might take longer due to its raw format.
You can evaluate the cost of storage and processing.
Data Lake Analytics: Use tools like PySpark for efficient data processing.
Data Lakehouse: Explore the concept of a data lakehouse, which combines data lake and data warehouse features.

Similar thread for reference- https://learn.microsoft.com/en-us/answers/questions/1108571/why-sql-server-%28and-when%29-instead-of-data-bricks-o, https://stackoverflow.com/questions/77599641/performance-tuning-tips-for-writing-a-pyspark-dataframe-to-an-azure-sql-database, https://learn.microsoft.com/en-us/azure/azure-functions/python-scale-performance-reference

Once data is available in Datawarehouse systems we can either populate that data on to reports directly or we can create tabular model using Azure Analysis services on top of it. And then from tabular model we can take data to reports using power BI. Azure data factory or Azure Synapse pipelines can be used to create ETL pipelines. Azure Synapse Analytics will also give capability to query data on data lake storages directly using SQL runtime or Spark runtime. Power BI can also directly connect with Azure Synapse Analytics to generate reports.

Yes, it’s possible to submit a Python code/job to an Azure Synapse Spark pool from an Azure function. Refer - https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-job-definitions

Similar thread for reference - https://stackoverflow.com/questions/77010742/how-to-convert-a-pyspark-notebook-to-a-job, https://stackoverflow.com/questions/72664052/calling-databricks-python-notebook-in-azure-function, https://gist.github.com/lordlinus/21c53f2fad9c63fe29c328599383f992.

Click here to know more about data lake storages.

Click here to know more about Azure data factory.

Click here to know more about Azure Synapse Analytics.

Click here to know more about Azure Analysis Service.

click here to know more about Pyspark

Hope this answer helps! Please let us know if you have any further queries. I’m happy to assist you further.

Please "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Share via

Data Lake as a storage/database for Express Angular Application?

0 additional answers