Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
When you're designing workloads in Microsoft Fabric, reliability is an ultimate goal that you constantly drive towards. Your architecture decisions, workload patterns, and even data placement all influence how resilient your solution will be under pressure.
This article guides you through thinking about reliability in a practical, architect-focused way: understanding constraints, planning redundancy, scaling smartly, monitoring effectively, and preparing for disasters.
Start with constraints: know your boundaries
Before you can make your workloads resilient, you need to know the limits of your playground. Each Fabric capacity defines how much compute and memory is available. If you're running a heavy Spark job, refreshing a large semantic model, or handling dozens of concurrent Power BI queries, hitting those limits can cause delays or failures.
Subscription quotas can add additional limits to factor into your approach. Not all tenants are created equal; some regions may cap capacities differently. Plan ahead and request quota increases before you hit hard limits.
Your workspace design matters because workspaces act as isolation boundaries. Some behaviors, especially with preview features or private endpoints, can differ depending on the workspace, so test and design with those nuances in mind.
Fabric is SaaS, so low-level infrastructure tweaks aren't in your hands. You still have some options. You can move workloads across capacities, separate critical pipelines, or use multi-capacity patterns to work around platform quirks. Always map dependencies and constraints visually to help your team anticipate where failures might appear.
Some Fabric behaviors are by design and can be feel restrictive. Microsoft Fabric official documentation maintains up-to-date lists of well-known issues, preview constraints, and platform behaviors. Pay attention to workspace boundaries, feature differences when Private Endpoint is used, and how data moves across workspaces.
Refer to:
Understand where failures happen
Failures usually happen in predictable places: resource limits and external dependencies. If your Spark jobs or pipelines saturate their capacity, they fail. If upstream gateways or databases go down, your workloads may stall even if Fabric itself is healthy.
Infrastructure-level failures including regional outages are rare because of Fabric's automated mitigations, but they can still happen. Specialized workloads, like dedicated databases or Eventhouses, can behave differently under partial failures or stress. Have a good understanding about the failure points in your workload.
Tip
Always map out your dependencies and capacity constraints. Visualizing failure points helps target monitoring and mitigation efforts.
Define reliability goals for your workloads
Measure reliability against your goals. Start by defining service level objectives (SLOs) that make sense for your business. Fabric guarantees 99.9% uptime, but your pipelines, Spark jobs, and Power BI reports might need tighter operational targets. Focus on SLIs like pipeline success rates, job completion times, execution latency, and capacity utilization.
Remember external dependencies. An unreliable upstream API reduces your effective availability. Mitigate these risks with caching, queuing, or backup sources.
Be realistic with recovery targets. Localized failures (compute node issues) might resolve in minutes, regional outages take longer. Document all this in runbooks, dashboards, and architecture diagrams so your team knows what "reliable" really means.
Refer to:
Build self-healing through redundancy
When you're designing workloads in Microsoft Fabric, think of self healing as your first line of defense and redundancy as a mechanism that enables self-healing. Fabric already gives you a strong foundation: compute nodes automatically fail over, OneLake maintains multiple copies of your data, and services can restart themselves if something goes wrong. But your design choices determine how far that protection extends.
Redundancy exists at multiple layers, and understanding which layer affects your workloads helps you make smarter design decisions:
| Level | How it works | What you control |
|---|---|---|
| Instance-level | Individual node failover happens automatically. | Managed by Fabric; ensure workloads can resume without manual intervention. |
| Zone-level | Azure Availability Zones spread workloads across physical locations within a region. | Fabric uses Azure Availability Zones when they are supported by region and workload, without requiring any customer configuration. |
| Region-level | Geo-replication moves data to a secondary region for disaster recovery. | You decide which capacities enable the Disaster Recovery feature for OneLake data, which capacities or workspaces are active in each region, and how failover is orchestrated. |
Traffic distribution and failover for local failures happen automatically as requests are load-balanced across healthy nodes. In a regional outage, Microsoft manages failover to a secondary region for data access, while you execute the rest of your disaster recovery plan. By deploying multiple capacities or workspaces as independent deployment stamps, you can isolate failures, contain the blast radius, and ensure users or jobs can be routed to healthy replicas without manual intervention.
Tradeoff: Zone-level redundancy is nearly invisible, but cross-region strategies require intentional design.
Fabric heals itself from common failures, but disaster recovery requires shared responsibility - your architecture can enhance it. Dedicate capacities to critical workloads, modularize pipelines and datasets, design idempotent operations, implement retries with exponential backoff, and use circuit breakers where necessary.
Graceful degradation ensures partial failures don't stop everything. Pipelines can skip optional steps, reports can show read-only results with warnings. Health checks should tie to recovery actions like restarting pipelines, switching to backup sources, or scaling capacities automatically. Isolation is key: failures should remain local, preventing cascading effects across your environment.
Tradeoff: More redundancy reduces downtime and improves self-healing, but it comes at the cost of additional resources and operational complexity. Align your approach with the criticality of your workloads and your recovery expectations.
Tradeoff: There's a trade-off between immediate availability and resource efficiency. Maintaining redundant capacities or regions ensures minimal downtime but increases cost and operational overhead. Relying on Fabric's built-in resiliency in a single region reduces cost but can introduce delays while failover or recovery processes complete. Your redundancy strategy should reflect the criticality of workloads and your recovery expectations.
Scale vertically and horizontally with reliability
Scaling is about keeping your workloads running smoothly as demand grows. For predictable workloads, allocate capacities that meet peak demand. For spikes, rely on Fabric's elasticity features like bursting and smoothing. Scaling operations are generally fast. Still, communicating scaling events prevents surprises for dependent teams and avoids transient slowdowns.
Treat each capacity as a self-contained unit. If one saturates, others continue running. Distribute heavy workloads across multiple workspaces and capacities to reduce contention. Use autoscale billing for highly elastic workloads, such as Spark and data warehousing. Apply surge protection at capacity and workspace level. Enable the capacity overage billing feature as a safety net for unanticipated spikes. And implement graceful shutdowns and retry logic to keep long-running jobs resilient during scaling events.
Important
Don't wait until a capacity is maxed out. Maintain a buffer of extra resources and enable the capacity overage billing feature to absorb spikes smoothly. Track usage trends and anticipate growth, proactively adding capacity or distributing workloads across multiple units ensures scaling events are seamless and workloads remain reliable as demand grows.
Monitor metrics and observe proactively
Your monitoring strategy should focus on the places where failures are likely. Track service availability, pipeline success, job runtimes, and capacity usage. Don't ignore dependencies: on-prem gateways, APIs, and internal services can silently become points of failure.
Fabric provides rich telemetry, but pair it with your own health probes and dashboards. Integrate with Azure Monitor. Early alerts give your team time to act before users notice anything. Correlate logs and metrics across systems to spot the root cause, not just the symptom.
For more information about what to monitor, see What is workspace monitoring?
Apply self-preservation techniques
Fabric's built-in resilience is strong, but you can strengthen it further:
- Dedicated capacities: Keep critical workloads isolated from "noisy neighbors".
- Use autoscale billing for highly elastic workloads, such as Spark and data warehousing.
- Apply surge protection at capacity and workspace level.
- Enable the capacity overage billing feature as a safety net for unanticipated spikes.
- Modular workloads: Break large pipelines or datasets into smaller, independent pieces. If one fails, the rest keep running.
- Idempotent operations: Make sure retries don't cause duplicate writes or processing errors.
- Retry logic: Combine exponential backoff with alerting to recover gracefully from transient failures.
- Circuit breakers: Pause requests to a failing dependency to prevent cascading failures.
Implement graceful degradation so that even if some dependencies are unavailable. For example, pipelines can skip non-critical steps or use cached data, and reports can display read-only results with warnings instead of failing completely.
Health checks are the backbone of automated recovery. Schedule regular checks, such as sanity pipelines or API probes, and tie them to recovery actions like restarting pipelines, switching to backup data sources, or scaling capacities. Ensure these checks are themselves reliable and alert if they fail, so the system can self-correct before users notice problems
It's important to emphasize the idea of isolation. Deploy multiple workspaces or capacities, dedicate resources to separate workloads, and avoid creating single points of failure in shared dependencies. This containment strategy ensures localized failures do not ripple across the system, keeping most workloads unaffected and your environment reliably operational.
Plan for disaster recovery
Most Fabric items rely on continuous replication of data instead of traditional backups. OneLake keeps multiple copies and can geo-replicate to paired regions, but asynchronous replication may lead to minor data loss in catastrophic events.
Manual backups are still necessary for critical assets outside OneLake, such as pipelines or archived datasets. Geo-redundancy enables secondary environments ready to take over if the primary region fails. Understand your recovery point objectives (RPO) and recovery time objectives (RTO) and simulate failovers regularly. Practice restores in non-production environments so your team can respond confidently when it matters.
Important
Don't wait for a real incident to find out your plan doesn't work. Simulate outages in a non-production environment, execute failover, then practice failback. Measure recovery times, validate data integrity, and train your team so they can act quickly and confidently when it really counts.
Test reliability
Direct chaos testing inside Fabric isn't possible, but you can validate your environment and dependencies. Simulate adverse conditions: break connectivity to data sources, inject load to trigger throttling, or use Azure Chaos Studio for network latency or downtime in supporting resources.
Planned maintenance windows are also excellent opportunities to test failover and recovery. Measure recovery times, validate data integrity, and refine your procedures before real failures occur.
Next steps
Review the best practices, organized by pillars. Follow the guidance in Security.