Which is the best option for data warehouse / Power BI solution?

System2000 236 Reputation points
2021-10-19T10:02:24.153+00:00

Hello,

I am building a data warehouse in Azure from several external data sources. Data can be imported via pipelines in Data Factory (or Azure Synapse). Data is to be stored in Azure. Reporting through Power BI is required.

There seem to be a multitude of methods to achieve this. I could store the data in an Azure SQL database, or use blob storage. I could use an Azure Data Lake. I could use Azure Analysis Service or Azure Synapse Analytics to integrate with Power BI.

For example, as I understand it, I could do something this:
External DBs-->Azure Data Factory-->Azure Blob Storage/Azure Data Lake-->Azure Synapse Analytics-->Power BI Reports

But it might be better (or worse) to do this:
External DBs-->Azure Data Factory-->Azure SQL Database-->Azure Analytics Service-->Power BI Reports

I am looking for advice on why I would use one particular approach rather than the other.

Thanks.

Azure Analysis Services
Azure Analysis Services
An Azure service that provides an enterprise-grade analytics engine.
456 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,971 questions
{count} votes

Accepted answer
  1. System2000 236 Reputation points
    2021-10-28T20:31:34.033+00:00

    I'm accepting my own comment as answer to close this question.


5 additional answers

Sort by: Most helpful
  1. System2000 236 Reputation points
    2022-06-22T09:22:46.797+00:00

    Hi @Bernardo Romero

    As there were thousands of tables in the source databases, I decided it would be more cost effective to store them in Azure Data Lake Gen 2 and create a logical data warehouse in Azure Synapse Serverless SQL Pool.

    I used Azure Data Factory to ingest data instead of the Azure Synapse pipelines because the latter has some issue with on-premises integration runtimes. I implemented an incremental loading mechanism and stored data (as CSV) something like this:
    filesystem/database_name/table_name/YYYY/MM/DD/data.csv
    Where YYYY/MM/DD represents the ingestion date.

    In the Synapse Serverless SQL pool I created views using 'openrowset' to read the data lake. However, I found that many of these views failed when using inferred data types - so the data types needed to be explicitly defined in the views. Data types such as IMAGE, TEXT, etc, caused failures of the view -- so I needed to exclude these from views. I used | (pipe) and ~ for field terminator and field quotes, as the data contained lots of commas and " marks, which caused failures.

    Once the Serverless SQL pool was set up in Azure, it was relatively simple to connect via Power BI. However, the Serverless SQL pool is not ideal for using 'Direct Query' mode in PBi - there were timeouts and performance issues - so Import mode needed to be used.

    In a future phase we should define which tables are useful, then clean/curate this data and convert to a dimensional model in Parquet format using CETAS, and consider the partitioning structure in the data lake to optimise query performance.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.