This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.
This article presents a solution for an enterprise data warehouse in Azure that:
- Brings together all your data, no matter the scale or format.
- Provides a way for all your users to get insights from your data through analytical dashboards, operational reports, and advanced analytics.
Apache® and Apache Spark are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Download a Visio file of this architecture.
- Azure Synapse Analytics pipelines bring together structured, unstructured, and semi-structured data, such as logs, files, and media. The pipelines store the data in Azure Data Lake Storage.
- Apache Spark pools in Azure Synapse Analytics clean and transform the Data Lake Storage data.
- Azure Synapse Analytics combines the processed data with existing structured data, creating one unified data hub.
- A dedicated SQL pool makes the data available for operational reports and analytical dashboards that derive insights. Azure Analysis Services serves the reports and dashboards to thousands of end users.
- Azure Synapse Analytics is an analytics service for data warehouses and big data systems. This tool uses a massively parallel processing architecture and has deep integration with Azure services.
- Azure Synapse Analytics pipelines provide a way for you to create, schedule, and orchestrate workflows, such as extract, load, transform (ELT) and extract, transform, load (ETL) workflows.
- Azure Blob Storage provides massively scalable, cost-effective object storage for any type of unstructured data—images, videos, audio, documents, and more.
- Data Lake Storage is a storage repository that holds a large amount of data in its native, raw format. Data Lake Storage is built on top of Blob Storage. As a result, Data Lake Storage offers the scalability, tiered storage, high availability, and disaster recovery capabilities of Blob Storage.
- Azure Synapse Analytics Spark pools provide a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications.
- Analysis Services is an enterprise-grade analytics engine that provides an easy way for users to perform ad hoc data analysis. You can use Analysis Services to govern, test, and deliver business solutions at scale.
- Power BI is a suite of business analytics tools that deliver insights throughout your organization. You can use Power BI to connect to hundreds of data sources, simplify data preparation, and drive ad hoc analysis. You can also produce beautiful reports and publish them for your organization to consume on the web and across mobile devices.
An enterprise data warehouse brings all your data together, no matter the source, format, or scale. A data warehouse also provides a way for you to run high-performance analytics on your data, so you can gain insights through analytical dashboards, operational reports, and advanced analytics.
This solution establishes a data warehouse that:
- Is a single source of truth for your data.
- Integrates relational data sources with other unstructured datasets.
- Uses semantic modeling and powerful visualization tools for simpler data analysis.
To integrate data into a unified platform, this solution uses Azure Synapse Analytics pipelines. These pipelines offer ELT and ETL capabilities. Specifically, you can use the pipelines to move data in data-driven workflows. The pipelines work with various data formats and structures.
The pipelines store the data in Data Lake Storage, which is built on Blob Storage. This storage service can handle large volumes of unstructured data.
Azure Synapse Analytics Spark pools form a key part of the solution. These pools clean and transform data that's stored in Azure. Their parallel processing framework supports in-memory processing for speed and efficiency. The pools also support auto-scaling, so they can add or remove nodes as needed.
A dedicated SQL pool makes the processed data available for high-performance analytics. This pool stores data in relational tables with columnar storage, a format that significantly reduces the cost of data storage. It also improves query performance, so you can run analytics at massive scale.
Potential use cases
You can use this solution in scenarios like the following ones that involve large volumes of data:
- IoT device integration
- Customer data platforms
- Natural language processing
- Machine learning algorithms
To view an estimate of the cost of this solution, see a pricing sample in the pricing calculator.
- Azure Synapse Analytics documentation
- Azure Synapse Analytics pipelines documentation
- Introduction to object storage in Azure
- Azure Synapse Analytics Spark pools
- Analysis Services documentation
- Power BI documentation