I'm accepting my own comment as answer to close this question.
Which is the best option for data warehouse / Power BI solution?
Hello,
I am building a data warehouse in Azure from several external data sources. Data can be imported via pipelines in Data Factory (or Azure Synapse). Data is to be stored in Azure. Reporting through Power BI is required.
There seem to be a multitude of methods to achieve this. I could store the data in an Azure SQL database, or use blob storage. I could use an Azure Data Lake. I could use Azure Analysis Service or Azure Synapse Analytics to integrate with Power BI.
For example, as I understand it, I could do something this:
External DBs-->Azure Data Factory-->Azure Blob Storage/Azure Data Lake-->Azure Synapse Analytics-->Power BI Reports
But it might be better (or worse) to do this:
External DBs-->Azure Data Factory-->Azure SQL Database-->Azure Analytics Service-->Power BI Reports
I am looking for advice on why I would use one particular approach rather than the other.
Thanks.
5 additional answers
Sort by: Most helpful
-
System2000 236 Reputation points
2022-06-22T09:22:46.797+00:00 As there were thousands of tables in the source databases, I decided it would be more cost effective to store them in Azure Data Lake Gen 2 and create a logical data warehouse in Azure Synapse Serverless SQL Pool.
I used Azure Data Factory to ingest data instead of the Azure Synapse pipelines because the latter has some issue with on-premises integration runtimes. I implemented an incremental loading mechanism and stored data (as CSV) something like this:
filesystem/database_name/table_name/YYYY/MM/DD/data.csv
Where YYYY/MM/DD represents the ingestion date.In the Synapse Serverless SQL pool I created views using 'openrowset' to read the data lake. However, I found that many of these views failed when using inferred data types - so the data types needed to be explicitly defined in the views. Data types such as IMAGE, TEXT, etc, caused failures of the view -- so I needed to exclude these from views. I used | (pipe) and ~ for field terminator and field quotes, as the data contained lots of commas and " marks, which caused failures.
Once the Serverless SQL pool was set up in Azure, it was relatively simple to connect via Power BI. However, the Serverless SQL pool is not ideal for using 'Direct Query' mode in PBi - there were timeouts and performance issues - so Import mode needed to be used.
In a future phase we should define which tables are useful, then clean/curate this data and convert to a dimensional model in Parquet format using CETAS, and consider the partitioning structure in the data lake to optimise query performance.