Edit

Share via


Choose an analytical data store in Azure

In a big data architecture, there's often a need for an analytical data store that serves processed data in a structured format that can be queried by using analytical tools. Analytical data stores that support querying of both hot-path and cold-path data are collectively referred to as the serving layer, or data serving storage.

The serving layer handles processed data from both the hot path and the cold path. In the Lambda architecture, the serving layer is subdivided into two layers. The speed serving layer contains the incrementally processed data. The batch serving layer contains the batch-processed output. The serving layer requires strong support for random reads that have low latency. Data storage for the speed layer should also support random writes because batch loading data into this store introduces undesired delays. Alternatively, data storage for the batch layer needs to support batch writes, not random writes.

There's no single best data management choice for all data storage tasks. Different data management solutions are optimized for different tasks. Most real-world cloud apps and big data processes have various data storage requirements and often use a combination of data storage solutions.

Modern analytical solutions, such as Microsoft Fabric, provide a comprehensive platform that integrates various data services and tools to meet diverse analytical needs. Fabric includes OneLake, which is a single, unified, logical data lake for your entire organization. OneLake is designed to store, manage, and secure all organizational data in one location. This flexibility allows your organization to address a wide range of data storage and processing requirements.

Choose an analytical data store

There are several options for data serving storage in Azure, depending on your needs:

The following database models are optimized for different types of tasks:

  • Key-value databases store a single serialized object for each key value. They're well-suited for managing large volumes of data when retrieval is based on a specific key, without the need to query other item properties.

  • Document databases are key-value databases in which the values are documents. In this context, a document is a collection of named fields and values. The database typically stores the data in a format such as XML, YAML, JSON, or binary JSON, but might use plain text. Document databases can query on non-key fields and define secondary indexes to improve querying efficiency. This capability makes a document database more suitable for applications that need to retrieve data based on criteria that's more complex than the value of the document key. For example, you could query on fields such as product ID, customer ID, or customer name.

  • Column store databases are key-value data stores that store each column separately on disk. A wide column store database is a type of column store database that stores column families, not only single columns. For example, a census database might have a separate column family for each of the following items:

    • A person's first, middle, and last name

    • That person's address

    • That person's profile information, like their date of birth or gender

    The database can store each column family in a separate partition, while keeping all the data for one person related to the same key. An application can read a single column family without scanning all the data for an entity.

  • Graph databases store information as a collection of objects and relationships. A graph database can efficiently perform queries that traverse the network of objects and the relationships between them. For example, the objects might be employees in a human resources database, and you might want to facilitate queries such as "find all employees who directly or indirectly work for Scott."

  • Telemetry and time-series databases are an append-only collection of objects. Telemetry databases efficiently index data in various column stores and in-memory structures. This capability makes them the optimal choice for storing and analyzing vast quantities of telemetry and time-series data.

Fabric supports various database models, including key-value, document, column store, graph, and telemetry databases. This flexibility ensures scalability for a wide range of analytical tasks. To choose the right Fabric datastore for your analytical workloads, see Fabric decision guide: choose a datastore.

Key selection criteria

To refine the selection process, consider the following criteria:

  • Do you need serving storage that can serve as a hot path for your data? If yes, narrow your options to those that are optimized for a speed serving layer.

  • Do you need massively parallel processing support, where queries are automatically distributed across several processes or nodes? If yes, select an option that supports query scale-out.

  • Do you prefer to use a relational data store? If you do, narrow your options to those that have a relational database model. However, some nonrelational stores support SQL syntax for querying, and tools such as SQL Endpoint can be used to query nonrelational data stores such as OneLake.

  • Do you collect time-series data? Do you use append-only data? Fabric OneLake supports multiple analytical engines, including Analysis Services, T-SQL, and Apache Spark. Fabric Eventhouse makes it suitable for various data processing and querying needs of time-series data.

Capability matrix

The following tables summarize the key differences in capabilities in these managed services.

General capabilities

Capability Fabric Lakehouse Fabric Warehouse Fabric Eventhouse Fabric SQL Database Azure SQL Database Azure Cosmos DB Analysis Services
Primary database model Unified data lake, relational, user managed delta lake format using apache parquet Unified data lake, relational, system managed delta lake format using apache parquet Time series append oriented data store, graph, vector Relational (column store format when you use columnstore indexes) Relational (column store format when you use columnstore indexes) Document store, graph, key-value store, wide column store Tabular semantic models
SQL language support Yes1 Yes Yes2 Yes Yes Yes No
Optimized for speed serving layer Yes Yes Yes3 Yes4 Yes5 Yes No

[1] T-SQL via SQL Analytics Endpoint.

[2] KQL has partial T-SQL language support.

[3] Supports queued ingestion and streaming ingestion.

[4] Supports transactional precision with low-latency access and real-time updates.

[5] Using memory-optimized tables and hash or nonclustered indexes.

Scalability capabilities

Capability Fabric Lakehouse Fabric Warehouse Fabric Eventhouse Fabric SQL Database Azure SQL Database Azure Cosmos DB Analysis Services
Redundant regional servers for high availability Yes1,2 Yes1,2 Yes Yes Yes Yes Yes
Supports query scale-out Yes3 Yes4 Yes5 Yes No Yes Yes
Dynamic scalability (scale up) Yes3 Yes4 Yes5 Yes Yes Yes Yes
Supports in-memory caching of data Yes6 Yes6 Yes7 Yes Yes Yes No

[1] SQL Endpoints are routed via global traffic managers, but data is always processed in the assigned Fabric capacity region.

[2] Lakehouse and Warehouse store data in OneLake using Delta Parquet format, which supports querying and replication across engines.

[3] Lakehouse supports Spark-based scale-out for unstructured and structured data.

[4] Warehouse uses T-SQL and supports multi-table transactions, autonomous workload management, and distributed query processing (DQP). DQP acts like a cluster manager, dynamically allocating compute resources based on query complexity.

[5] Eventhouse supports KQL and SQL federation, enabling real-time analytics across multiple sources as well as scale up compute resources if hot cache usage exceeds ~95%.

[6] Intelligent cache for Spark jobs, in-memory caching, result set caching for SQL analytics endpoints.

[7] Frequently accessed data is stored in a hot cache which includes in-memory and SSD storage.

Security capabilities

Capability Fabric Lakehouse Fabric Warehouse Fabric Eventhouse Fabric SQL Database Azure SQL Database Azure Cosmos DB Analysis Services
Authentication Microsoft Entra ID Microsoft Entra ID Microsoft Entra ID Microsoft Entra ID SQL or Microsoft Entra ID Database users or Microsoft Entra ID via access control (identity and access management) Microsoft Entra ID
Data encryption at rest Yes Yes Yes Yes Yes1 Yes Yes
Row-level security Yes Yes Yes Yes Yes No Yes
Supports firewalls Yes2 Yes2 Yes3 Yes Yes Yes Yes
Dynamic data masking Yes4 Yes4 No Yes Yes No No

[1] Requires you to use transparent data encryption to encrypt and decrypt your data at rest.

[2] Private Links and Entra Conditional Access can be used to restrict access to Fabric resources.

[3] Fabric Eventhouse and Real-Time Intelligence workloads can ingest data from secure sources like Kafka, Azure Event Hubs, and AMQP, with routing through secure endpoints.

[4] It can be applied at the Fabric SQL Endpoint Level

Contributors

Microsoft maintains this article. The following contributors wrote this article.

Principal authors:

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps