This example workload shows several ways that small businesses (SMBs) can modernize legacy data stores and explore big data tools and capabilities, without overextending current budgets and skillsets. These end-to-end Azure data warehousing solutions integrate easily with tools like Azure Machine Learning, Microsoft Power Platform, Microsoft Dynamics, and other Microsoft technologies.
Download a Visio file of this architecture.
Legacy SMB data warehouses might contain several types of data:
- Unstructured data, like documents and graphics
- Semi-structured data, such as logs, CSVs, JSON, and XML files
- Structured relational data, including databases that use stored procedures for extract-transform-load/extract-load-transform (ETL/ELT) activities
The following dataflow demonstrates the ingestion of your chosen data type:
Azure Synapse Analytics pipelines ingest the legacy data warehouses into Azure.
The pipelines orchestrate the flow of migrated or partially refactored legacy databases and SSIS packages into Azure SQL Database. This lift-and-shift approach is fastest to implement, and offers a smooth transition from an on-premises SQL solution to an eventual Azure platform-as-a-service (PaaS). You can modernize databases incrementally after the lift and shift.
The pipelines can also pass unstructured, semi-structured, and structured data into Azure Data Lake Storage for centralized storage and analysis with other sources. Use this approach when fusing data provides more business benefit than simply replatforming the data.
Microsoft Dynamics data sources can be used to build centralized BI dashboards on augmented datasets using Synapse Serverless analysis tools. You can bring the fused, processed data back into Dynamics and Power BI for further analysis.
Real-time data from streaming sources can also enter the system via Azure Event Hubs. For customers with real-time dashboard requirements, Azure Stream Analytics can analyze this data immediately.
The data can also enter the centralized Data Lake for further analysis, storage, and reporting.
Serverless analysis tools are available in the Azure Synapse Analytics workspace. These tools use serverless SQL pool or Apache Spark compute capabilities to process the data in Data Lake Storage Gen2. Serverless pools are available on demand, and don't require any provisioned resources.
Serverless pools are ideal for:
- Ad hoc data science explorations in T-SQL format.
- Early prototyping for data warehouse entities.
- Defining views that consumers can use, for example in Power BI, for scenarios that can tolerate performance lag.
Azure Synapse is tightly integrated with potential consumers of your fused datasets, like Azure Machine Learning. Other consumers can include Power Apps, Azure Logic Apps, Azure Functions apps, and Azure App Service web apps.
Azure Synapse Analytics is an analytics service that combines data integration, enterprise data warehousing, and big data analytics. In this solution:
- An Azure Synapse Workspace promotes collaboration between data engineers, data scientists, data analysts, and business intelligence (BI) professionals.
- Azure Synapse pipelines orchestrate and ingest data into SQL Database and Data Lake Storage Gen2.
- Azure Synapse serverless SQL pools analyze unstructured and semi-structured data in Data Lake Storage Gen2 on demand.
- Azure Synapse serverless Apache Spark pools do code-first explorations in Data Lake Storage Gen2 with Spark languages like Spark SQL, pySpark, and Scala.
Azure SQL Database is an intelligent, scalable, relational database service built for the cloud. In this solution, SQL Database holds the enterprise data warehouse and performs ETL/ELT activities that use stored procedures.
Azure Event Hubs is a real-time data streaming platform and event ingestion service. Event Hubs can ingest data from anywhere, and seamlessly integrates with Azure data services.
Azure Stream Analytics is a real-time, serverless analytics service for streaming data. Stream Analytics offers rapid, elastic scalability, enterprise-grade reliability and recovery, and built-in machine learning capabilities.
Azure Machine Learning is a toolset for data science model development and lifecycle management. Machine Learning is one example of the Azure and Microsoft services that can consume fused, processed data from Data Lake Storage Gen2.
Azure IoT Hub could replace or complement Event Hubs. The solution you choose depends on the source of your streaming data, and whether you need cloning and bidirectional communication with the reporting devices.
You can use Azure Data Factory for data integration instead of Azure Synapse pipelines. The choice depends on several factors:
- Azure Synapse pipelines keep the solution design simpler, and allow collaboration inside a single Azure Synapse workspace.
- Azure Synapse pipelines don't support SSIS packages rehosting, which is available in Azure Data Factory.
- Synapse Monitor Hub monitors Azure Synapse pipelines, while Azure Monitor can monitor Data Factory.
For more information and a feature comparison between Azure Synapse pipelines and Data Factory, see Data integration in Azure Synapse Analytics versus Azure Data Factory.
You can use Synapse Analytics dedicated SQL pools for storing enterprise data, instead of using SQL Database. Review the use cases and considerations in this article and related resources to make a decision.
Small and medium businesses (SMBs) face a choice when modernizing their on-premises data warehouses for the cloud. They can adopt big data tools for future extensibility, or keep traditional, SQL-based solutions for cost efficiency, ease of maintenance, and smooth transition.
However, a hybrid approach combines easy migration of the existing data estate with the opportunity to add big data tools and processes for some use cases. SQL-based data sources can keep running in the cloud and continue to modernize as appropriate.
This example workload shows several ways that SMBs can modernize legacy data stores and explore big data tools and capabilities, without overextending current budgets and skillsets. These end-to-end Azure data warehousing solutions integrate easily with Azure and Microsoft services and tools like Azure Machine Learning, Microsoft Power Platform, and Microsoft Dynamics.
Potential use cases
Several scenarios can benefit from this workload:
Migrating a traditional, on-premises relational data warehouse that's smaller than 1 TB and extensively uses SQL Server Integration Services (SSIS) packages to orchestrate stored procedures.
Using innovative techniques to interact with centralized Data Lake Storage Gen2 data. Techniques include serverless analysis, knowledge mining, data fusion between domains, and end-user data exploration.
Setting up eCommerce companies to adopt a data warehouse to optimize their operations.
This solution isn't recommended for:
Greenfield deployment of data warehouses that are estimated to be > 1 TB within one year.
Migrating on-premises data warehouses that are > 1 TB or projected to grow to that size within a year.
These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.
The following considerations apply to this scenario.
SQL Database is a PaaS service that can meet your high availability (HA) and disaster recovery (DR) requirements. Be sure to pick the SKU that meets your requirements. For guidance, see High availability for Azure SQL Database.
SQL Database uses SQL Server Management Studio (SSMS) to develop and maintain legacy artifacts like stored procedures.
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.
See a pricing sample for a SMB data warehousing scenario in the Azure pricing calculator. Adjust the values to see how your requirements affect the costs.
SQL Database bases costs on the selected Compute and Service tiers, and the number of vCores and Database Transaction Units (DTUs). The example shows a single database with provisioned Compute and eight vCores, based on the assumption that you need to run stored procedures in SQL Database.
Data Lake Storage Gen2 pricing depends on the amount of data you store and how often you use the data. The sample pricing includes 1 TB of data stored, with further transactional assumptions. The 1 TB refers to the size of the data lake, not the original legacy database size.
Azure Synapse pipelines base costs on the number of data pipeline activities, integration runtime hours, data flow cluster size, and execution and operation charges. Pipeline costs increase with additional data sources and amounts of data processed. The example assumes one data source batched every hour for 15 minutes on an Azure-hosted integration runtime.
Azure Synapse Spark pool bases pricing on node size, number of instances, and uptime. The example assumes one small compute node with five hours a week to 40 hours a month utilization.
Azure Synapse serverless SQL pool bases pricing on TBs of data processed. The sample assumes 50 TBs processed a month. This figure refers to the size of the data lake, not the original legacy database size.
Event Hubs bills based on tier, throughput units provisioned, and ingress traffic received. The example assumes one throughput unit in Standard tier over one million events for a month.
Stream Analytics bases costs on the number of provisioned streaming units. The sample assumes one streaming unit used over the month.
This article is being updated and maintained by Microsoft. It was originally written by the following contributors.
- Galina Polyakova | Senior Cloud Solution Architect
- For training content and labs, see the Data Engineer Learning Paths.
- Tutorial: Get started with Azure Synapse Analytics
- Create a single database - Azure SQL Database
- Create a storage account for Azure Data Lake Storage Gen2
- Azure Event Hubs Quickstart - Create an event hub using the Azure portal
- Quickstart - Create a Stream Analytics job by using the Azure portal
- Quickstart: Get started with Azure Machine Learning