Edit

Choose an analytical data store in Azure

Big data architecture often needs an analytical data store that serves processed data in a structured format. You can query that data by using analytical tools. Analytical data stores that support querying of both hot-path and cold-path data are collectively referred to as the serving layer, or data serving storage.

The serving layer handles processed data from both the hot path and the cold path. In Lambda architecture, the serving layer is subdivided into two layers. The speed serving layer contains the incrementally processed data. The batch serving layer contains the batch-processed output.

While the overall serving layer requires strong support for random reads that have low latency, data storage for the speed layer should also support random writes because batch loading data into this store introduces undesired delays. Conversely, data storage for the batch layer needs to support batch writes, not random writes.

No single data-management solution fits every data storage task. Different solutions are optimal for specific tasks. Most real-world cloud apps and big data processes have various data storage requirements and often use a combination of storage solutions.

Modern analytical solutions, such as Microsoft Fabric, provide a comprehensive platform that integrates various data services and tools to meet diverse analytical needs. Fabric includes OneLake, which is a single, unified, logical data lake for your entire organization. OneLake is designed to store, manage, and secure all organizational data in one location. This flexibility allows your organization to address a wide range of data storage and processing requirements.

Choose an analytical data store

Microsoft offers several options for data serving storage, depending on your needs:

Different database models suit different types of tasks:

  • Key-value data stores hold a single serialized object for each key value. They can manage large volumes of data when retrieval is based on a specific key, without the need to query other item properties.

  • Document data stores are key-value data stores in which the values are documents. In this context, a document is a collection of named fields and values. The data store typically stores the data in a format such as XML, YAML, JSON, or binary JSON, but might use plain text. Document data stores can query non-key fields and define secondary indexes to improve querying efficiency. This capability makes a document database more suitable for applications that need to retrieve data based on criteria that's more complex than the value of the document key. For example, you could query fields such as product ID, customer ID, or customer name.

  • Column-family data stores are key-value data stores that keep each column separately on disk. A wide column store stores column families, not only single columns. For example, a census database might have a separate column family for each of an individual's attributes:

    • First, middle, and last name
    • Mailing address
    • Profile information, like date of birth or gender

    The data store can store each column family in a separate partition, while keeping all the data for one person related to the same key. An application can read a single column family without scanning all the data for an entity.

  • Graph data stores hold information as a collection of objects and relationships. A graph data store can efficiently perform queries that traverse the network of objects and the relationships between them. For example, the objects might be employees in a human resources database, and you might want to facilitate queries such as "find all employees who directly or indirectly work for Scott."

  • Telemetry and time-series databases are an append-only collection of objects. Telemetry databases efficiently index data in various column stores and in-memory structures. This capability makes them the optimal choice for storing and analyzing vast quantities of telemetry and time-series data.

Fabric supports various database models, including key-value, document, column store, graph, and telemetry databases. This flexibility ensures scalability for a wide range of analytical tasks. To choose the right Fabric data store for your analytical workloads, see Fabric decision guide: choose a data store.

Key selection criteria

To refine the selection process, consider the following criteria:

  • Do you need serving storage that can serve as a hot path for your data? If yes, choose options that are optimal for a speed serving layer.

  • Do you need massively parallel processing support, where queries are automatically distributed across several processes or nodes? If yes, select an option that supports query scale-out.

  • Do you prefer to use a relational data store? If you do, choose options that have a relational database model. However, some nonrelational stores support SQL syntax for querying, and you can use tools like SQL analytics endpoints to query nonrelational data stores such as OneLake.

  • Do you collect time-series data? Do you use append-only data? OneLake supports multiple analytical engines, including Analysis Services, T-SQL, and Apache Spark. Eventhouse is well suited for diverse time-series data processing and querying needs.

Capability matrix

The following tables summarize the key differences in capabilities among these managed services.

General capabilities

Capability Lakehouse Data Warehouse Eventhouse Fabric SQL Database Azure SQL Database Azure Cosmos DB Analysis Services
Primary database model Unified data lake, relational, user-managed delta lake format using Apache Parquet Unified data lake, relational, system-managed delta lake format using Apache Parquet Time-series append-oriented data store, graph, vector Relational (column store format when you use columnstore indexes) Relational (column store format when you use columnstore indexes) Document store, graph, key-value store, wide column store Tabular semantic models
SQL language support Yes1 Yes Yes2 Yes Yes Yes No
Optimized for speed serving layer Yes Yes Yes3 Yes4 Yes5 Yes No

[1] T-SQL via SQL analytics endpoint.

[2] Kusto Query Language (KQL) has partial T-SQL language support.

[3] Supports queued ingestion and streaming ingestion.

[4] Supports transactional precision with low-latency access and real-time updates.

[5] By using memory-optimized tables and hash or nonclustered indexes.

Scalability capabilities

Capability Lakehouse Data Warehouse Eventhouse Fabric SQL Database Azure SQL Database Azure Cosmos DB Analysis Services
Redundant regional servers for high availability Yes1,2 Yes1,2 Yes Yes Yes Yes Yes
Supports query scale-out Yes3 Yes4 Yes5 Yes No Yes Yes
Dynamic scalability (scale up) Yes3 Yes4 Yes5 Yes Yes Yes Yes
Supports in-memory caching of data Yes6 Yes6 Yes7 Yes Yes Yes No

[1] SQL endpoints route through global traffic managers, but the assigned Fabric capacity region always processes the data.

[2] Lakehouse and Warehouse store data in OneLake in the Delta Parquet format, which supports querying and replication across engines.

[3] Lakehouse supports Spark-based scale-out for unstructured and structured data.

[4] Warehouse uses T-SQL and supports multitable transactions, autonomous workload management, and distributed query processing (DQP). DQP acts like a cluster manager, dynamically allocating compute resources based on query complexity.

[5] Eventhouse supports KQL and SQL federation for real-time analytics across multiple sources and for scaling up compute resources if hot cache usage exceeds ~95%.

[6] Intelligent cache for Spark jobs, in-memory caching, result set caching for SQL analytics endpoints.

[7] Frequently accessed data is stored in a hot cache that includes in-memory and SSD storage.

Security capabilities

Capability Lakehouse Data Warehouse Eventhouse Fabric SQL Database Azure SQL Database Azure Cosmos DB Analysis Services
Authentication Microsoft Entra ID Microsoft Entra ID Microsoft Entra ID Microsoft Entra ID SQL or Microsoft Entra ID Database users or Microsoft Entra ID via access control (identity and access management) Microsoft Entra ID
Data encryption at rest Yes Yes Yes Yes Yes1 Yes Yes
Row-level security Yes Yes Yes Yes Yes No Yes
Supports firewalls Yes2 Yes2 Yes3 Yes Yes Yes Yes
Dynamic data masking Yes4 Yes4 No Yes Yes No No

[1] Requires that you use transparent data encryption to encrypt and decrypt your data at rest.

[2] Use Private Links and Microsoft Entra Conditional Access to restrict access to Fabric resources.

[3] Fabric Eventhouse and Real-Time Intelligence workloads can ingest data from secure sources like Kafka, Azure Event Hubs, and AMQP, with routing through secure endpoints.

[4] Apply it at the Fabric SQL Endpoint level.

Contributors

Microsoft maintains this article. The following contributors wrote this article.

Principal author:

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps