Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Well-Architected Framework workload functionality and performance must be monitored in diverse ways and for diverse reasons. Azure Monitor Log Analytics workspaces are the primary log and metric sink for a large portion of the monitoring data. Workspaces support multiple features in Azure Monitor including ad-hoc queries, visualizations, and alerts. For general monitoring principles, see Monitoring and diagnostics guidance. The guidance presents general monitoring principles. It identifies the different types of data. It identifies the required analysis that Azure Monitor supports and it also identifies the data stored in the workspace that enables the analysis.
This article assumes that you understand system design principles. You also need a working knowledge of Log Analytics workspaces and features in Azure Monitor that populate operational workload data. For more information, see Log Analytics workspace overview.
Important
How to use this guide
Each section has a design checklist that presents architectural areas of concern along with design strategies localized to the technology scope.
Also included are recommendations on the technology capabilities or deployment topologies that can help materialize those strategies. The recommendations don't represent an exhaustive list of all configurations available for Log Analytics workspaces and its related Azure Monitor resources. Instead, they list the key recommendations mapped to the design perspectives. Use the recommendations to build your proof-of-concept, design your workload monitoring environment, or optimize your existing workload monitoring solution.
This guide focuses on the interrelated decisions for the following Azure resources.
The purpose of the Reliability pillar is to provide continued functionality by building enough resilience and the ability to recover fast from failures.
The Reliability design principles provide a high-level design strategy applied for individual components, system flows, and the system as a whole.
The reliability situations to consider for Log Analytics workspaces are:
There's currently no standard feature for failover between workspaces in different regions, but there are strategies to use if you have particular requirements for availability or compliance.
Start your design strategy based on the design review checklist for Reliability and determine its relevance to your business requirements while keeping in mind the SKUs and features of virtual machines (VMs) and their dependencies. Extend the strategy to include more approaches as needed.
Recommendation | Benefit |
---|---|
Don't include your Log Analytics workspaces in your workload's critical path. Your workspaces are important to a functioning observability system, but the functionality of your workload shouldn't depend on them. | Keeping your workspaces and associated functions out of your workload's critical path minimizes the risk of issues affecting your observability system from affecting the runtime execution of your workload. |
To support high durability of workspace data, deploy Log Analytics workspaces into a region that supports data resilience. Data resilience is only possible through linking of the workspace to a dedicated cluster in the same region. | When you use a dedicated cluster, it lets you spread the associated workspaces across availability zones, which offer protection against datacenter outages. If you don't collect enough data now to justify a dedicated cluster, this preemptive regional choice supports future growth. |
Choose your workspace deployment based on proximity to your workload. Use data collection endpoints (DCE) in the same region as the Log Analytics workspace. |
Deploy your workspace in the same region as the instances of your workload. Having your workspace and DCEs in the same region as your workload mitigates the risk of impacts by outages in other regions. DCEs are used by the Azure Monitor agent and the Logs Ingestion API to send workload operational data to a Log Analytics workspace. You might need multiple DCEs even though your deployment only has a single workspace. For more information on how to configure DCEs for your particular environment, see How to set up data collection endpoints based on your deployment.<br If your workload is deployed in an active-active design, consider using multiple workspaces and DCEs spread across the regions in which your workload is deployed. Deploying workspaces in multiple regions adds complexity to your environment. Balance the criteria detailed in Design a Log Analytics workspace architecture with your availability requirements. |
If you require the workspace to be available in a region failure, or you don't collect enough data for a dedicated cluster, configure data collection to send critical data to multiple workspaces in different regions. This practice is also known as log multicasting. For example, configure DCRs for multiple workspaces for Azure Monitor agent running on VMs. Configure multiple diagnostic settings to collect resource logs from Azure resources and send the logs to multiple workspaces. |
|
In this way, workload operational data is available in the alternate workspace if there's a regional failure. But know that resources that rely on the data such as alerts and workbooks wouldn't automatically be replicated to the other regions. Consider storing Azure Resource Manager (ARM) templates for critical alerting resources with configuration for the alternate workspace or deploying them in all regions but disabling them to prevent redundant alerts. Both options support quick enablement in a regional failure. Tradeoff: This configuration results in duplicate ingestion and retention charges so only use it for critical data. |
|
If you require data to be protected in a datacenter or region failure, configure data export from the workspace to save data in an alternate location. This option is similar to the previous option of multicasting the data to different workspaces. But this option costs less because the extra data is written to storage. Use Azure Storage redundancy options, including geo-redundant storage (GRS) and geo-zone-redundant storage (GZRS), to further replicate this data to other regions. Data export doesn't provide resiliency against incidents impacting the regional ingestion pipeline. |
While the historic operational log data might not be readily queryable in the exported state, it ensures the data survives a prolonged regional outage and can be accessed and retained for extended period. If you require the export of tables not supported by data export, you can use other methods of exporting data, including Logic Apps, to protect your data. For this strategy to work as a viable recovery plan, you must have processes in place to reconfigure diagnostic settings for your resources in Azure and on all agents that provide data. You must also plan to manually rehydrate your exported data into a new workspace. As with the previously described option, you also need to define processes for those resources that rely on the data like alerts and workbooks. |
For mission-critical workloads requiring high availability, consider implementing a federated workspace model that uses multiple workspaces to provide high availability if there's a regional failure. | Mission-critical provides prescriptive best practice guidance for designing highly reliable applications on Azure. The design methodology includes a federated workspace model with multiple Log Analytics workspaces to deliver high availability if there are multiple failures, including the failure of an Azure region. This strategy eliminates egress costs across regions and remains operational with a region failure. But it requires more complexity that you must manage with configuration and processes described in Health modeling and observability of mission-critical workloads on Azure. |
Use infrastructure as code (IaC) to deploy and manage your workspaces and associated functions. | When you automate as much of your deployment and your mechanisms for resilience and recovery as practical, it ensures that these operations are reliable. You save critical time in your operations processes and minimize the risk of human error. Ensure that functions like saved log queries are also defined through your IaC to recover them to a new region if recovery is required. |
Design DCRs with a single responsibility principle to keep DCR rules simple. While one DCR could be loaded with all the input, rules, and destinations for the source systems, it's preferable to design narrowly focused rules that rely on fewer data sources. Use composition of rule assignments to arrive at the desired observability scope for the logical target. Also, minimize transformation in DCRs |
When you use narrowly focused DCRs, it minimizes the risk of a rule misconfiguration having a broader effect. It limits the effect to only the scope for which the DCR was built. For more information, see Best practices for data collection rule creation and management in Azure Monitor. While transformation can be powerful and necessary in some situations, it can be challenging to test and troubleshoot the keyword query language (KQL) work being done. When possible, minimize the risk of data loss by ingesting the data raw and handling transformations downstream at query time. |
When setting a daily cap or a retention policy, be sure you're maintaining your reliability requirements by ingesting and retaining the logs that you need. | A daily cap stops the collection of data for a workspace once a specified amount is reached, which helps you maintain control over your ingestion volume. But only use this feature after careful planning. Ensure that your daily cap isn't being hit with regularity. If that happens, your cap is set too restrictively. You need to reconfigure the daily cap so you don't miss critical signals coming from your workload. Likewise, be sure to carefully and thoughtfully approach the lowering of your data retention policy to ensure that you don't inadvertently lose critical data. |
Use Log Analytics workspace insights to track ingestion volume, ingested data versus your data cap, unresponsive log sources, and failed queries among other data. Create health status alerts to proactively notify you if a workspace becomes unavailable because of a datacenter or regional failure. | This strategy ensures that you're able to successfully monitor the health of your workspaces and proactively act if the health is at risk of degrading. Like any other component of your workload, it's critical that you're aware of health metrics and can identify trends to improve your reliability over time. |
Azure offers no policies related to reliability of Log Analytics workspaces. You can create custom policies to build compliance guardrails around your workspace deployments, such as ensuring workspaces are associated to a dedicated cluster.
While not directly related to the reliability of Log Analytics workspaces, there are Azure policies for nearly every service available. The policies ensure that diagnostics settings are enabled for that service and validate that the service's log data is flowing into a Log Analytics workspace. All services in workload architecture should be sending their log data to a Log Analytics workspace for their own reliability needs, and the policies can help enforce it. Likewise, policies exist to ensure agent-based platforms, such as VMs and Kubernetes, have the agent installed.
Azure offers no Azure Advisor recommendations related to the reliability of Log Analytics workspaces.
The purpose of the Security pillar is to provide confidentiality, integrity, and availability guarantees to the workload.
The Security design principles provide a high-level design strategy for achieving these goals by applying approaches to the technical design around your monitoring and logging solution.
Start your design strategy based on the design review checklist for Security and identify vulnerabilities and controls to improve the security posture. Extend the strategy to include more approaches as needed.
Recommendation | Benefit |
---|---|
Use customer managed keys if you require your own encryption key to protect data and saved queries in your workspaces. Azure Monitor ensures that all data and saved queries are encrypted at rest using Microsoft-managed keys (MMK). If you require your own encryption key and collect enough data for a dedicated cluster, use customer-managed key. You can encrypt data by using your own key in Azure Key Vault, for control over the key lifecycle, and ability to revoke access to your data. If you use Microsoft Sentinel, make sure that you're familiar with the considerations at Set up Microsoft Sentinel customer-managed key. |
This strategy lets you encrypt data by using your own key in Azure Key Vault, for control over the key lifecycle, and ability to revoke access to your data. |
Configure Log query auditing to track which users are running queries. Configure the audit logs for each workspace to be sent to the local workspace or consolidate in a dedicated security workspace if you separate your operational and security data. Use Log Analytics workspace insights to periodically review this data. Consider creating log query alert rules to proactively notify you if unauthorized users are attempting to run queries. |
Log query auditing records the details for each query run in a workspace. Treat this audit data as security data and secure the LAQueryLogs table appropriately. This strategy bolsters your security posture by helping to ensure that unauthorized access is caught immediately if it ever happens. |
Help secure your workspace through private networking and segmentation measures. Use private link functionality to limit communications between log sources and your workspaces to private networking. |
When you use private link, it also lets you control which virtual networks can access a given workspace, further bolstering your security through segmentation. |
Use Microsoft Entra ID instead of API keys for workspace API access where available. | API key-based access to the query APIs doesn't leave a per-client audit trail. Use sufficiently scoped Entra ID-based access so that you can properly audit programmatic access. |
Configure access for different types of data in the workspace required for different roles in your organization. Set the access control mode for the workspace to Use resource or workspace permissions. This access control lets resource owners use resource-context to access their data without being granted explicit access to the workspace. Use table level RBAC for users who require access to a set of tables across multiple resources. |
This setting simplifies your workspace configuration and helps to ensure users can't access operational data they shouldn't. Assign the appropriate built-in role to grant workspace permissions to administrators at either the subscription, resource group, or workspace level depending on their scope of responsibilities. Users with table permissions have access to all the data in the table regardless of their resource permissions. See Manage access to Log Analytics workspaces for details on the different options for granting access to data in the workspace. |
Export logs that require long-term retention or immutability. Use data export to send data to an Azure Storage account with immutability policies to help protect against data tampering. Not every type of log has the same relevance for compliance, auditing, or security, so determine the specific data types that should be exported. |
You might collect audit data in your workspace that's subject to regulations requiring its long-term retention. Data in a Log Analytics workspace can't be altered, but it can be purged. Exporting a copy of the operational data for retention purposes lets you build a solution that meets your compliance requirements. |
Determine a strategy to filter or obfuscate sensitive data in your workspace. You might be collecting data that includes sensitive information. Filter records that shouldn't be collected by using the configuration for the particular data source. Use a transformation if only particular columns in the data should be removed or obfuscated. If you have standards that require the original data to be unmodified, you can use the 'h' literal in KQL queries to obfuscate query results displayed in workbooks. |
Obfuscating or filtering out sensitive data in your workspace helps ensure you maintain confidentiality on sensitive information. In many cases, compliance requirements dictate the ways that you can handle sensitive information. This strategy helps you comply with the requirements proactively. |
Azure offers policies related to the security of Log Analytics workspaces to help enforce your desired security posture. Examples of such policies are:
Azure also offers numerous policies to help enforce private link configuration, such as Log Analytics workspaces should block log ingestion and querying from public networks or even configuring the solution through DINE policies such as Configure Azure Monitor Private Link Scope to use private DNS zones.
Azure offers no Azure Advisor recommendations related to the security of Log Analytics workspaces.
Cost Optimization focuses on detecting spend patterns, prioritizing investments in critical areas, and optimizing in others to meet the organization's budget while meeting business requirements.
The Cost Optimization design principles provide a high-level design strategy for achieving those business goals. They also help you make tradeoffs as necessary in the technical design related to your monitoring and logging solution.
For more information on how data charges are calculated for your Log Analytics workspaces, see Azure Monitor Logs cost calculations and options.
Start your design strategy based on the design review checklist for Cost Optimization for investments and fine tune the design so that the workload is aligned with the budget allocated for the workload. Your design should use the right Azure capabilities, monitor investments, and find opportunities to optimize over time.
Recommendation | Benefit |
---|---|
Configure the pricing tier for the amount of data that each Log Analytics workspace typically collects. | By default, Log Analytics workspaces uses pay-as-you-go pricing with no minimum data volume. If you collect enough data, you can significantly decrease your cost by using a commitment tier, which lets you commit to a daily minimum of data collected in exchange for a lower rate. If you collect enough data across workspaces in a single region, you can link them to a dedicated cluster and combine their collected volume by using cluster pricing. For more information on commitment tiers and guidance on determining what's most appropriate for your level of usage, see Azure Monitor Logs cost calculations and options. To view estimated costs for your usage at different pricing tiers, see Usage and estimated costs. |
Configure data retention and archiving. | There's a charge for retaining data in a Log Analytics workspace beyond the default of 31 days. It's 90 days if Microsoft Sentinel is enabled on the workspace and 90 days for Application Insights data. Consider your particular requirements for having data readily available for log queries. You can significantly reduce your cost by configuring archived logs. Archived logs let you retain data for up to seven years and still access it occasionally. You access the data by using search jobs or restoring a set of data to the workspace. |
If you use Microsoft Sentinel to analyze security logs, consider employing a separate workspace to store those logs. | When you use a dedicated workspace for log data that your SIEM uses, it can help you control costs. The workspaces that Microsoft Sentinel uses are subject to Microsoft Sentinel pricing. Your security requirements dictate the types of logs that are required to be included in your SIEM solution. You might be able to exclude operational logs, which would be charged at the standard Log Analytics pricing if they're in a separate workspace. |
Configure tables used for debugging, troubleshooting, and auditing as Basic Logs. | Tables in a Log Analytics workspace configured for Basic Logs have a lower ingestion cost in exchange for limited features and a charge for log queries. If you query these tables infrequently and don't use them for alerting, this query cost can be more than offset by the reduced ingestion cost. |
Limit data collection from data sources for the workspace. | The primary factor for the cost of Azure Monitor is the amount of data that you collect in your Log Analytics workspace. Be sure that you collect no more data than you require to assess the health and performance of your services and applications. For each resource, select the right categories for the diagnostic settings you configure to provide the amount of operational data you need. It helps you successfully manage your workload, and not manage ignored data. There might be a tradeoff between cost and your monitoring requirements. For example, you might be able to detect a performance issue more quickly with a high sample rate, but you might want a lower sample rate to save costs. Most environments have multiple data sources with different types of collection, so you need to balance your particular requirements with your cost targets for each. See Cost optimization in Azure Monitor for recommendations on configuring collection for different data sources. |
Regularly analyze workspace usage data to identify trends and anomalies. Use Log Analytics workspace insights to periodically review the amount of data collected in your workspace. Further analyze data collection by using methods in Analyze usage in Log Analytics workspace to determine if there's other configurations that can decrease your usage further. |
By helping you understand the amount of data collected by different sources, it identifies anomalies and upward trends in data collection that could result in excess cost. This consideration is important when you add a new set of data sources to your workload. For example, if you add a new set of VMs, enable new Azure diagnostics settings on a service, or change log levels in your application. |
Create an alert when data collection is high. | To avoid unexpected bills, you should be proactively notified anytime you experience excessive usage. Notification lets you address any potential anomalies before the end of your billing period. |
Consider a daily cap as a preventative measure to ensure that you don't exceed a particular budget. | A daily cap disables data collection in a Log Analytics workspace for the rest of the day after your configured limit is reached. Don't use this practice as a method to reduce costs as described in When to use a daily cap, but instead to prevent runaway ingestion due to misconfiguration or abuse. If you set a daily cap, create an alert when the cap is reached. Be sure to also create an alert rule when some percentage is reached. For example, you can set an alert rule for when 90 percent capacity is reached. This alert gives you an opportunity to investigate and address the cause of the increased data before the cap shuts off critical data collection from your workload. |
Azure offers no policies related to cost optimization of Log Analytics workspaces. You can create custom policies to build compliance guardrails around your workspace deployments, such as ensuring that your workspaces contain the right retention settings.
Azure Advisor makes recommendations to move specific tables in a workspace to the low-cost Basic Log data plan for tables that receive relatively high ingestion volume. Understand the limitations by using basic logs before switching. For more information, see When should I use Basic Logs?. Azure Advisor might also recommend changing pricing commitment tier for the whole workspace based on overall usage volume.
Operational Excellence primarily focuses on procedures for development practices, observability, and release management.
The Operational Excellence design principles provide a high-level design strategy for achieving those goals towards the operational requirements of the workload.
Start your design strategy based on the design review checklist for Operational Excellence for defining processes for observability, testing, and deployment related to Log Analytics workspaces.
Recommendation | Benefit |
---|---|
Design a workspace strategy to meet your business requirements. See Design a Log Analytics workspace architecture for guidance on designing a strategy for your Log Analytics workspaces. Include how many to create and where to place them. If you required your workload to use a centralized platform team offering, ensure that you set all necessary operational access. Also, construct alerts to ensure workload observability needs are met. |
A single or at least minimal number of workspaces maximize your workload's operational efficiency. It limits the distribution of your operational and security data, increases visibility into potential issues, makes patterns easier to identify, and minimizes your maintenance requirements. You might have requirements for multiple workspaces such as multiple tenants, or you might need workspaces in multiple regions to support your availability requirements. So, ensure that you have appropriate processes in place to manage this increased complexity. |
Use infrastructure as code (IaC) to deploy and manage your workspaces and associated functions. | Use infrastructure as code (IaC) to define the details of your workspaces in ARM templates, Azure BICEP, or Terraform. It lets you use your existing DevOps processes to deploy new workspaces and Azure Policy to enforce their configuration. Colocating all of your IaC code with your application code helps ensure that your safe deployment practices are maintained for all deployments. |
Use Log Analytics workspace insights to track the health and performance of your Log Analytics workspaces, and create meaningful and actionable alerts to be proactively notified of operational issues. Log Analytics workspace insights provides a unified view of the usage, performance, health, agents, queries, and change log for all your workspaces. Each workspace has an operation table that logs important activities affecting workspace. |
Review the information that Log Analytics insights provides regularly to track the health and operation of each of your workspaces. When you use this information, it lets you create easily understood visualizations like dashboards or reports that operations and stakeholders can use to track the health of your workspaces. Create alert rules based on this table to be proactively notified when an operational issue occurs. You can use recommended alerts for the workspace to simplify how you create the most critical alert rules. |
Practice continuous improvement by frequently revisiting Azure diagnostic settings on your resources, data collection rules, and application log verbosity. Ensure that you're optimizing your log collection strategy through frequent reviews of your resource settings. From an operational standpoint, look to reduce the noise in your logs by focusing on those logs that provide useful information about a resource's health status. |
By optimizing in this manner, you enable operators to investigate and troubleshoot issues when they arise, or perform other routine, improvised, or emergency tasks. When new diagnostic categories are made available for a resource type, review the types of logs that are emitted with this category to understand whether enabling them might help you optimize your collection strategy. For example, a new category might be a subset of a larger set of activities that are being captured. The new subset might let you reduce the volume of logs coming in by focusing on the activities that are important for your operations to track. |
Azure offers no policies nor Azure Advisor recommendations related to the operational excellence of Log Analytics workspaces.
Performance Efficiency is about maintaining user experience even when there's an increase in load by managing capacity. The strategy includes scaling resources, identifying and optimizing potential bottlenecks, and optimizing for peak performance.
The Performance Efficiency design principles provide a high-level design strategy for achieving those capacity goals against the expected usage.
Start your design strategy based on the design review checklist for Performance Efficiency for defining a baseline for your Log Analytics workspaces and associated functions.
Recommendation | Benefit |
---|---|
Configure log query auditing and use Log Analytics workspace insights to identify slow and inefficient queries. Log query auditing stores the compute time required to run each query and the time until results are returned. Log Analytics workspace insights uses this data to list potentially inefficient queries in your workspace. Consider rewriting these queries to improve their performance. Refer to Optimize log queries in Azure Monitor for guidance on optimizing your log queries. |
Optimized queries return results faster and use less resources on the back end, which makes the processes that rely on those queries more efficient as well. |
Understand service limits for Log Analytics workspaces. In certain high-traffic implementations, you might run into service limits that affect your performance and your workspace or workload design. For example, the query API limits the number of records and data volume returned by a query. The Logs Ingestion API limits the size of each API call. For a complete list of Azure Monitor and Log Analytics workspaces limits and limits specific to the workspace itself, see Azure Monitor service limits. |
Understanding the limits that might affect the performance of your workspace helps you design appropriately to mitigate them. You might decide to use multiple workspaces to avoid hitting limits associated with a single workspace. Weigh the design decisions to mitigate service limits against requirements and targets for other pillars. |
Create DCRs specific to data source types inside one or more defined observability scopes. Create separate DCRs for performance and events to optimize the backend processing compute utilization. | When you use separate DCRs for performance and events, it helps mitigate backend resource exhaustion. By having DCRs that combine performance events, it forces every associated virtual machine to transfer, process, and run configurations that might not be applicable according to the installed software. An excessive compute resource consumption and errors in processing a configuration might happen and cause the Azure Monitor Agent (AMA) to become unresponsive. |
Azure offers no policies nor Azure Advisor recommendations related to the performance of Log Analytics workspaces.
Events
Mar 17, 9 PM - Mar 21, 10 AM
Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.
Register nowTraining
Learning path
Use advance techniques in canvas apps to perform custom updates and optimization - Training
Use advance techniques in canvas apps to perform custom updates and optimization
Certification
Microsoft Certified: Information Protection and Compliance Administrator Associate - Certifications
Demonstrate the fundamentals of data security, lifecycle management, information security, and compliance to protect a Microsoft 365 deployment.
Documentation
Learn about Azure Well-Architected Framework design considerations and configuration recommendations that are relevant for Application Insights.
Learn the recommendations for designing and creating an observability system. The system provides a foundation for monitoring, detection, and alerting.
Monitor Azure platform landing zone components - Cloud Adoption Framework
Learn how to deploy baseline monitoring for your Azure platform services to ensure availability, reliability, security, and scalability.