Edit

Troubleshooting guide for Spark jobs in Microsoft Fabric

Use this guide to identify and resolve common issues when running Spark jobs in Microsoft Fabric. Each section includes error examples, root causes, and actionable steps to help you recover efficiently.

Note

This guide focuses on Spark execution errors, runtime failures, and job-specific issues. For capacity and throttling errors, permission and authorization errors, session timeout errors, or library installation errors, see Troubleshoot permissions and capacity errors, Fabric notebooks troubleshooting guide, and Manage Apache Spark libraries.

Common Spark job issues at a glance

These are the most common categories of Spark job issues in Fabric, along with their associated error codes. Use this table to quickly navigate to the relevant section for your error. If you tried the relevant steps and the issue persists, see When to contact support.

Section Description
Memory and executor failures Out-of-memory errors (exit code 137), executor crashes, container killed errors, and bad node failures. Includes Spark_Ambiguous_Executor_MaxExecutorFailures, Spark_System_Executor_ExitCode137BadNode, and memory tuning strategies.
INCONSISTENT_BEHAVIOR_CROSS_VERSION Issues after upgrading Spark runtime versions, including changes in behavior, performance, or results.
AnalysisException in Spark SQL query analysis errors, schema mismatches, column resolution failures. Includes ANALYSIS_EXCEPTION, Spark_Ambiguous_SQL_AnalysisException, and Delta Lake analysis exceptions.
Session startup and submit errors Spark session initialization timeouts, SparkSubmit failures, configuration personalization errors. Includes SparkContextInitializationTimedOut, SparkSubmitProcessTimedOut, PersonalizationFailed, and YARN application startup issues.
Storage and connectivity errors ABFS storage access failures, JDBC connection errors, and SQL Server exceptions. Includes Spark_Ambiguous_ABFS_StorageAccountDoesNotExist, Spark_System_ABFS_OperationFailed, Spark_Ambiguous_JDBC_ConnectionFailed, and Spark_Ambiguous_JDBC_SQLServerException.
File and path errors File not found errors, path doesn't exist errors, and incorrect path references. Includes Spark_User_FileInput_FileNotFound and Spark_User_SQL_PathDoesNotExist.
Authentication and token errors Token provider failures, unauthorized access (403), and authentication errors. Includes UNABLE_TO_GENERATE_SESSION_TOKEN_WITH_TOKEN_PROVIDER, Spark_Ambiguous_CustomTokenProvider_Unauthorized, Spark_User_ABFS_Unauthorized, and TOKEN_PROVIDER_USER_ERROR.
Delta Lake and streaming errors Delta Lake data transformation exceptions, streaming query failures, and checkpoint issues. Includes Spark_Ambiguous_DeltaLake_DataTransformationException and Spark_Ambiguous_DeltaLake_StreamingQueryException.
Application code errors User code exceptions including NullPointerException, IllegalStateException, and Python errors. Includes Spark_Ambiguous_UserApp_NullPointer, Spark_Ambiguous_UserApp_JobAborted, Spark_User_NonJvmUserApp_TypeError, Spark_User_UserApp_KeyError, and Spark_User_UserApp_AttributeError.
Library and environment errors Library installation failures, pip errors, conda environment issues, and package dependency conflicts. Includes Spark_User_Conda_PipFailed.
Platform and engine errors Native Execution Engine errors, metastore/Hive exceptions, and platform-level failures. Includes Spark_System_NativeExecutionEngine_InvalidState and Spark_System_MetaStore_HiveException.
NotebookUtils EmptyString Errors related to NotebookUtils returning empty strings when accessing notebook parameters or secrets. Includes Spark_Ambiguous_MsSparkUtils_EmptyString.

Access the Spark UI

The Spark UI is Apache Spark's built-in monitoring interface for viewing detailed execution metrics and logs. While you access it from within the Fabric portal, it opens as a separate browser-based interface that provides low-level diagnostic information about your Spark jobs. Throughout this guide, troubleshooting steps reference specific tabs in the Spark UI to help you identify root causes, such as checking exit codes in the Executors tab, detecting data skew in the Stages tab, or reviewing memory usage in the Storage tab. Access the Spark UI whenever you need to investigate a failed or slow-running Spark job.

To access the Spark UI for your application:

  1. From the left navigation in your Fabric workspace, select the ellipsis (...), then select Monitor to open the Monitor hub.

  2. In the Monitor hub, select the Filter button.

  3. Filter by Item type, and select the type of item you want to view (for example, Notebook).

  4. From the table of activities, select an Activity name to open the activity detail page.

  5. Select the Jobs tab.

  6. Select the Description of a job to open the Spark UI in a new tab.

Key tabs in the Spark UI:

  • Jobs — Shows active and completed Spark jobs.

  • Stages — Shows task-level duration and data size (useful for skew detection).

  • Storage — Shows cached DataFrames and memory usage.

  • Environment — Shows all active Spark configurations.

  • Executors — Shows executor status, memory, and exit codes.

How to access logs

While the Spark UI provides visual insights into job execution patterns and resource usage, you need to download text log files when troubleshooting specific error messages, examining stack traces, or reviewing application output (stdout/stderr). Use logs when you need to see the exact wording of an error, trace a failure through detailed driver or executor logs, or review what your code printed during execution.

To view or download Spark logs (driver logs, executor logs, stdout, stderr):

  • Monitor hub (Logs tab): In the Monitor hub, select Apache Spark applications, select your application, then select the Logs tab. Choose Driver, Livy, or Prelaunch logs from the left panel. Use keyword search or filter by Notebook or Lakehouse for high-concurrency sessions, then select Download log to save locally. Logs might not be available if the job was queued or if cluster creation failed. In that case, check capacity utilization in the Capacity Metrics app.

  • Extended Spark History Server: For completed applications, open the History Server from the application detail page. Use the Diagnosis tab for data skew, time skew, and executor usage analysis. The Executors tab provides per-executor log download. For long-running jobs (over one hour, or executor logs exceeding 16 MB), logs are automatically split into hourly segments for easier navigation.

  • Spark monitoring REST APIs: For programmatic or automated log retrieval, Fabric provides REST APIs for driver logs, executor logs, and application metadata. For more information, see Monitor Spark applications using Spark monitoring APIs.

  • VS Code: When using notebooks in VS Code, select View Recent Runs, select a run, then download logs including stdout, stderr, and Spark driver log.

For detailed instructions on accessing logs, viewing executor rolling logs for long-running jobs, and troubleshooting with logs, see Apache Spark application detail monitoring and Use extended Apache Spark history server to debug and diagnose Apache Spark applications.

Memory and executor failures

Spark MaxExecutorFailures

What does this error mean?

The error code Spark_Ambiguous_Executor_MaxExecutorFailures means your Spark application was terminated because too many executor processes crashed. Spark distributes work across executors; when one crashes, Spark retries it. But if executors keep failing past a threshold, Spark aborts the entire job.

Important

This error is always a symptom, not the root cause. The real question is: why are executors failing?

Typical messages you see:

ExecutorLostFailure (executor N exited caused by one of the running tasks)  
Reason: Container killed on request. Exit code is 137

Max number of executor failures (N) reached

Step 1: Find the exit code

In the Spark UI, select the Executors tab to review the exit codes of failed executors:

Exit Code Meaning Most Likely Cause
137 Killed by OS (SIGKILL) Out of memory: container exceeded its memory limit
143 Terminated (SIGTERM) Timeout, preemption, or node decommission
134 Aborted (SIGABRT) JVM crash or native memory corruption
1 General error User code exception, misconfiguration, or missing dependency
-100 Container preempted/lost The container was preempted or the node was lost

Step 2: Match your scenario

Scenario A — Exit code 137 (out of memory)

What you see: Driver logs show "Container killed on request. Exit code is 137".

Container killed by YARN for exceeding memory limits. 7.1 GB of 7 GB physical memory used.

Why it happens: The data processed by an executor exceeds its total memory (heap + overhead). Common triggers: data skew, large partitions, excessive caching, broadcast joins with large tables, PySpark UDFs, or insufficient disk space for shuffle spill operations.

What to do:

Important

Use %%configure, not spark.conf.set(), for resource configs: Settings for spark.executor.*, spark.driver.*, spark.network.*, and spark.yarn.* are read at session or executor launch and can't be changed mid-session with spark.conf.set(). Place these in a %%configure cell as the very first cell of your notebook (before any other code), or set them in your Fabric Environment. Only spark.sql.* settings (AQE, shuffle partitions, broadcast threshold, rebase modes) can be changed at runtime with spark.conf.set(). The %%configure cell must be the first cell and will restart the session when run.

  • Increase executor memory and overhead:

    spark.conf.set("spark.executor.memory", "<VALUE>")  # Small=4g, Medium=8g, Large=16g, XLarge=28g
    spark.conf.set("spark.executor.memoryOverhead", "<VALUE>") # Small=2g, Medium=4g, Large=6g, XLarge=8g
    
  • Repartition to create smaller, more uniform partitions:

    To choose a value for N, divide your estimated data size by 200 MB as a starting point (for example, 40 GB of data maps to repartition(200)). Aim for 128–256 MB per partition, and verify the actual task input sizes in the Stages tab of the Spark UI.

    df = df.repartition(N)  # Increase N to reduce per-partition size
    
  • Enable Adaptive Query Execution (AQE):

    Adaptive Query Execution is enabled by default in all Fabric runtimes. The useful levers are the sub-settings such as spark.sql.adaptive.skewJoin.enabled for handling skewed joins. If you have a skewed join, enabling AQE allows Spark to automatically detect and handle skew at runtime by splitting large partitions.

    spark.conf.set("spark.sql.adaptive.enabled", "true")  
    spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
    
  • Reduce caching: only cache DataFrames reused multiple times; call df.unpersist() when done.

  • Disable broadcast for large tables:

    spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
    
Scenario B — Exit code 143 (SIGTERM: timeout, scale-down, or preemption)

What you see: Driver logs show "Executor heartbeat timed out after 120000 ms" or "ExecutorLostFailure".

Why it happens: Exit code 143 is SIGTERM, a graceful termination signal. With dynamic allocation (the Fabric default), this code is often normal, because Fabric scales down idle executors by sending SIGTERM. If all your executors exit with 143 and the job completes, no action is needed.

If executors exit with 143 during active work, the cause is usually one of the following:

  • Heartbeat timeout, when an executor is stuck in garbage collection (GC) or processing a large task.
  • Node preemption or decommission.
  • Platform-initiated scale-down.

Investigate further only if the job fails or executors exit with 143 mid-stage.

What to do:

  • Increase heartbeat and network timeouts:

    spark.conf.set("spark.executor.heartbeatInterval", "60s")  
    spark.conf.set("spark.network.timeout", "800s")
    
  • If caused by GC pressure, the real issue is memory. Increase executor memory and overhead, repartition data to create smaller partitions, enable AQE, and reduce caching (see Scenario A for detailed steps).

  • Check if tasks are processing large partitions (repartition to smaller sizes).

Scenario C — Data skew (few executors fail repeatedly)

What you see: Most tasks finish quickly, but a few take far longer and fail. The same executors keep failing.

How to confirm: In the Spark UI, select the Stages tab, select a failed stage, and review the Duration and Input Size columns. If a few tasks have 10×–100× more input than others, you have data skew.

What to do:

  • Enable AQE skew join handling. Adaptive Query Execution is enabled by default in all Fabric runtimes. The useful levers are the sub-settings such as spark.sql.adaptive.skewJoin.enabled for handling skewed joins. Enabling AQE allows Spark to automatically detect and handle skew at runtime by splitting large partitions.

    spark.conf.set("spark.sql.adaptive.enabled", "true")  
    spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
    
  • Use salting to break up large partitions:

    from pyspark.sql.functions import rand  
    df = df.withColumn("salt", (rand() * N).cast("int"))  
    # Join/group on (key, salt), then aggregate without salt
    
  • Filter or process the heavily skewed key separately.

Scenario D — Storage / connectivity failures

What you see: Driver logs show java.io.IOException: ABFS operation failed, connection refused, HTTP 403/401 errors, or throttling (HTTP 429/503).

What to do:

  • Verify your storage account is accessible and permissions are correct.

  • Check if authentication tokens are still valid. Long-running jobs might see token expiry.

  • If throttled (429/503), reduce parallelism or spread the load over time.

  • Check network security groups / firewall rules.

Scenario E — User code exceptions (exit code 1)

What you see: Executors fail with exit code 1. Driver logs show a stack trace from your application code.

What to do:

  • Read the full stack trace: it points to the exact line of code.

  • Ensure your UDFs handle null values correctly.

  • Verify all required libraries/JARs are available on every executor.

  • Test on a small dataset first to isolate the problem.

Scenario F — PySpark / Pandas UDF crashes

What you see: Executors fail during Python UDF execution. Exit code 137 or messages about "worker exiting".

Why it happens: PySpark runs a separate Python process alongside the JVM. Both share the same node memory.

What to do:

  • Replace Python UDFs with built-in Spark SQL functions wherever possible.

  • Reduce Pandas UDF batch size:

spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000")
  • Increase memory overhead:
spark.conf.set("spark.executor.memoryOverhead", "<VALUE>")
Scenario G — Disk space exhaustion during shuffle

What you see: Executors fail with "No space left on device" or "IOException" during shuffle or sort operations.

Why it happens: When Spark can't fit data in memory, it spills to local disk. If the local disk fills up, the executor crashes.

What to do:

  • Reduce the amount of data shuffled: filter early, select only needed columns.

  • Increase the number of shuffle partitions to reduce per-partition size:

spark.conf.set("spark.sql.shuffle.partitions", "400")  # Default is 200
  • Scale up to nodes with more local disk space.

  • Check for data skew — a skewed partition spills disproportionately to one executor's disk.

Configuration quick reference

Memory and resources
Configuration Purpose
spark.executor.memory JVM heap memory per executor
spark.executor.memoryOverhead Off-heap memory for Python, native libs
spark.driver.memory JVM heap memory for the driver
spark.driver.memoryOverhead Off-heap memory for the driver
Failure tolerance
Configuration Purpose
spark.executor.maxNumFailures Max total executor failures before app is killed
spark.executor.failuresValidityInterval Time window for counting failures (default: unlimited)
spark.task.maxFailures Max retries per individual task (default: 4)

Important

Increasing failure tolerance does NOT fix the root cause. It only allows the job to survive more transient failures.

For long-running jobs, set spark.executor.failuresValidityInterval to a time window (for example, "1h"). This makes Spark count only failures within that window, so a job running for many hours won't be killed by occasional transient failures that occurred hours apart.

Network and timeouts
Configuration Purpose
spark.network.timeout General network timeout (default: 120s)
spark.executor.heartbeatInterval Heartbeat frequency (default: 10s)
spark.sql.adaptive.enabled Enables Adaptive Query Execution (dynamically optimizes shuffle partitions and join strategies at runtime). Adaptive Query Execution is enabled by default in all Fabric runtimes. The useful levers are the sub-settings such as spark.sql.adaptive.skewJoin.enabled for handling skewed joins. Enabling AQE allows Spark to automatically detect and handle skew at runtime by splitting large partitions.
spark.sql.shuffle.partitions Partitions after shuffle (default: 200)
Example: Applying via %%configure
%%configure  
{  
"conf": {  
"spark.executor.memory": "<VALUE>",  
"spark.executor.memoryOverhead": "<VALUE>",  
"spark.executor.maxNumFailures": "<VALUE>",  
"spark.network.timeout": "800s",  
"spark.executor.heartbeatInterval": "60s",  
"spark.sql.adaptive.enabled": "true", 
"spark.sql.adaptive.skewJoin.enabled": "true"  
}  
}

In Fabric, some configurations are managed by the platform based on your node size. For the complete list of Spark configuration properties, see Apache Spark Configuration.

Scaling options

Option When to Use
Scale Up (larger nodes) Each executor gets more memory/CPU (reduces OOM risk)
Scale Out (more nodes) Data is spread across more executors (reduces per-executor load)
Optimize first Adding resources to a skewed workload won't help: the oversized partition still lands on one executor

Quick-reference troubleshooting table

Observation Likely Cause First Action
All executors fail with exit code 137 OOM Increase executor memory/overhead; check for data skew
All executors fail with exit code 143 Heartbeat timeout Increase network timeout and heartbeat interval
Only a few executors fail repeatedly Data skew Enable AQE skew join; repartition data
Failures happen on the same node Faulty node Retry the job; if same node fails again, contact support
Failures correlate with I/O operations Storage connectivity Check storage access, firewall, token validity
Failures show user code stack traces Application bug Fix the code: null handling, missing libs
Failures during Python UDF execution Python process OOM Increase memoryOverhead; replace UDFs with SQL functions
Failures with "No space left on device" Disk space exhaustion Increase shuffle partitions; filter early; scale up node size

Exit code 137 / container killed on request

This section covers out-of-memory (OOM) errors in Microsoft Fabric Spark jobs indicated by exit code 137. YARN kills a container when it exceeds its assigned memory limit, producing exit code 137 (SIGKILL). This is the most common OOM signal in Spark.

What does this error mean?

Exit code 137 means YARN's container memory monitor terminated the executor (or driver) container because it exceeded its allocated memory limit. Your Spark application requires more memory than its container was assigned.

Note

The Linux OOM Killer can also produce exit code 137 (when the OS itself runs out of memory), but in Fabric the message "Container killed by YARN for exceeding memory limits" indicates YARN enforced the container limit, not the OS-level OOM Killer.

How container memory is calculated

Each executor runs inside a YARN container whose total memory is:

Container size = spark.executor.memory + spark.executor.memoryOverhead

If the combined memory usage of the JVM heap, off-heap buffers, Python processes, and native libraries exceeds this container size, YARN kills the container (exit code 137).

Fabric nodes come in sizes such as 32 GB, 64 GB, 128 GB, and 512 GB. In the Storage tab of the Spark UI, if Size in Memory approaches your node's total RAM, your application is at risk of OOM.

Important

In Fabric, spark.executor.memoryOverhead is set to a fixed 384 MB regardless of node size, unlike the open-source Spark default of max(384 MB, 0.1 × executor memory). For memory-intensive workloads such as PySpark UDFs, large shuffles, and native libraries, 384 MB is often insufficient. Set spark.executor.memoryOverhead explicitly to a higher value.

For detailed guidance on memory tuning, see Spark Tuning Guide: Memory Management.

Error messages to look for

java.lang.OutOfMemoryError: Java heap space

java.lang.OutOfMemoryError: GC overhead limit exceeded

Container killed on request. Exit code is 137  
Container exited with a non-zero exit code 137  
Killed by external signal

os::commit_memory failed; error='Cannot allocate memory' (errno=12)  
Native memory allocation (mmap) failed to map <N> bytes

Where to check

  • Spark UI, Executors tab: Check for failed executors and their exit codes

  • Spark UI, Storage tab: Check "Size in Memory" relative to your node size

  • Spark UI, Stages tab: Check for skewed tasks (one task processing far more data than others)

  • Driver logs (stderr): Search for OutOfMemoryError, exit code 137, or Cannot allocate memory

Common causes and fixes

1. Driver OOM from collect(), toPandas(), or display()

Symptom: The driver process runs out of memory. Often no Spark tasks are running at the time of the crash.

Cause: These operations pull the entire dataset from executors into driver memory.

What to do:

  • Add .limit(N) before collect() or toPandas() to restrict the rows returned.

  • Use .write to save results to storage instead of collecting to the driver.

  • Use display(df.limit(1000)) instead of display(df).

  • If you must use toPandas(), filter or aggregate the data first.

2. Executor OOM from data skew

Symptom: Most tasks complete quickly, but a few take long and fail with exit code 137.

Cause: Uneven data distribution causes a few executors to process more data than others.

What to do:

  • Identify skewed keys: inspect the Spark UI Stages tab for task duration variance.

  • Use salting to break up large partitions.

  • Enable AQE skew join handling. Adaptive Query Execution is enabled by default in all Fabric runtimes, so the key lever for skew is spark.sql.adaptive.skewJoin.enabled, which lets Spark detect and split large partitions at runtime.

spark.conf.set("spark.sql.adaptive.enabled", "true")  
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
3. Executor OOM from caching too much data

Symptom: Memory usage climbs over time as cached DataFrames accumulate.

Cause: Calling .cache() or .persist() on multiple large DataFrames without releasing them.

What to do:

  • Only cache DataFrames that are reused multiple times.

  • Unpersist when done: df.unpersist().

  • Use MEMORY_AND_DISK storage level instead of MEMORY_ONLY:

from pyspark import StorageLevel  
df.persist(StorageLevel.MEMORY_AND_DISK)
4. Executor OOM from too few partitions

Symptom: Tasks process large amounts of data per partition.

Cause: The DataFrame has too few partitions relative to the data size.

What to do:

  • Repartition to increase parallelism:
df = df.repartition(N)  # Choose N based on your data size
  • Aim for partitions around 128–256 MB each.

  • For writes, use coalesce() only to reduce partitions (never to 1 for large data).

5. Broadcast join OOM

Symptom: Driver or executor OOM during a join operation.

Cause: Spark broadcasts a table that is too large.

What to do:

  • Disable auto-broadcast for large tables:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
  • Or reduce the threshold:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")
6. PySpark UDF / Pandas UDF memory pressure

Symptom: Executor memory spikes during UDF execution. Exit code 137.

Cause: PySpark UDFs run in a separate Python process alongside the JVM executor. Both compete for the same node memory.

What to do:

  • Replace Python UDFs with built-in Spark SQL functions where possible.

  • For Pandas UDFs, reduce the batch size:

spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000")
  • Increase memory overhead:
spark.conf.set("spark.executor.memoryOverhead", "<VALUE>")
7. Native Execution Engine off-heap memory pressure

Symptom: Executors fail with exit code 137 even though your workload previously ran without issues, or the OOM occurs on queries that don't seem memory-intensive.

Cause: The Fabric Native Execution Engine enables off-heap memory by default with dynamic sizing. In some cases, this reserves a large portion of off-heap memory even when the native engine isn't actively processing your query, putting pressure on JVM heap memory and causing OOM.

What to do:

  • Try disabling the Native Execution Engine to confirm it is the cause:
spark.conf.set("spark.fabric.nativeExecution.enabled", "false")
  • If the OOM goes away, the native engine's memory allocation was the trigger. Run with it disabled as a workaround while you contact support.

  • If the OOM persists after disabling, the issue is a genuine memory shortage. Apply the other fixes in this section.

8. Driver OOM from large query plans (AQE)

Symptom: The driver crashes with OutOfMemoryError during query planning, not during data processing. The error might include "Required array length ... is too large".

Cause: Adaptive Query Execution (AQE) is enabled by default in all Fabric runtimes. When your query is very complex (many joins, unions, or cached DataFrames), Spark regenerates the query plan text on every plan change. Extremely large plan strings can exceed memory limits.

What to do:

  • Limit the plan string length:
spark.conf.set("spark.sql.maxPlanStringLength", "10000")
  • If the issue persists, disable AQE for this specific job:
spark.conf.set("spark.sql.adaptive.enabled", "false")
  • Simplify the query: break it into smaller steps with intermediate writes to storage.

General tuning options

Option A: Scale up (increase node size)

Increase your Spark pool's node size (for example, from Small to Medium or Large).

Option B: Scale out (add more nodes)

Increase the number of executors/nodes to distribute data across more nodes.

Option C: Reduce concurrent tasks per executor

Each executor runs multiple tasks in parallel (one per core). Reducing the number of concurrent tasks gives each task more memory, which can prevent OOM for memory-heavy operations.

spark.conf.set("spark.executor.cores", "2")  # Default varies by node size

Fewer concurrent tasks means slower throughput but more memory per task. Use this when individual tasks are memory-intensive (large aggregations, complex UDFs).

Option D: Adjust Spark configuration
Configuration Purpose
spark.driver.memory Increase driver heap memory
spark.executor.memory Increase executor heap memory
spark.driver.memoryOverhead Extra off-heap memory for the driver (default: 384 MB)
spark.executor.memoryOverhead Extra off-heap memory for executors (default: 384 MB)
spark.executor.cores Cores per executor (fewer cores = more memory per task)
spark.sql.adaptive.enabled Enables AQE auto-tuning (enabled by default in Fabric)
spark.sql.adaptive.skewJoin.enabled Auto-handle skewed joins (the key lever, since AQE is already on)
spark.sql.autoBroadcastJoinThreshold Control when tables are broadcast
spark.sql.shuffle.partitions Number of partitions after shuffle (default: 200)
spark.sql.maxPlanStringLength Limit query plan string length (prevents driver OOM on complex plans)
Option E: Optimize your code
Pattern to Avoid Better Alternative
df.collect() on large data df.write.parquet(path)
df.toPandas() on large data df.limit(N).toPandas() or save to storage
df.repartition(1) on large data df.coalesce(N) with reasonable N
.cache() everything Only cache DataFrames reused >1 time
Python UDFs Built-in Spark SQL functions
for row in df.collect(): ... Use Spark transformations (such as map or filter)

Spark_System_Executor_ExitCode137BadNode

What does this error mean?

This error code means an executor was killed with exit code 137 (out of memory), and the Fabric platform has identified that the failure occurred on a node that has been flagged as faulty. Unlike a regular exit code 137, this classification indicates the platform detected infrastructure-level problems with the specific node where your executor was running.

Error messages to look for

ExecutorLostFailure Container from a bad node: container_XXXX_0001_01_000046  
on host: vm-XXXXXXXX. Exit status: 137.  
Diagnostics: Container killed on request. Exit code is 137  
Container exited with a non-zero exit code 137.  
Killed by external signal

How is this different from regular exit code 137?

Aspect Exit Code 137 (Regular) ExitCode137BadNode
Root cause Your application exceeded the memory limit A faulty node caused the executor to crash
Whose fault? Typically user code or configuration Typically platform infrastructure
Retry behavior Same failure might recur on any node Retry usually succeeds on a healthy node
Action needed Tune memory, fix skew, optimize code Retry the job; contact support if persistent

What to do

Step 1: Retry your job. The platform typically avoids scheduling work on nodes it has flagged as faulty. In most cases, the next run succeeds on a healthy node.

Step 2: If the error recurs on the same node across multiple retries, contact support with the Spark Application ID and the node information from the Spark UI Executors tab.

Step 3: If the error recurs on different nodes, the root cause might be your application rather than the infrastructure. Check if your workload has genuine OOM issues by reviewing the Exit Code 137 section above.

A single occurrence of this error is usually transient and doesn't require any code changes. The platform automatically manages faulty node detection and removal.

Container from a bad node / exit status: 50

What does this error mean?

This error indicates that a Spark executor container was terminated because it was running on a node that the platform detected as unhealthy or decommissioned. Exit status 50 is a Fabric-specific signal indicating that the container was proactively killed due to node-level issues, not because of your application code.

Error messages to look for

Container from a bad node. Exit status: 50

ExecutorLostFailure (executor N exited caused by one of the running tasks)  
Reason: Container from a bad node. Exit status: 50

Why it happens

The Fabric platform continuously monitors node health. When a node is detected as unhealthy (due to hardware issues, disk failures, network problems, or other infrastructure faults), the platform terminates containers on that node to prevent data corruption or silent failures.

Common triggers include:

  • Hardware issues on the underlying compute node (disk, memory, CPU)

  • Node being decommissioned during a maintenance operation

  • Network connectivity loss between the node and the cluster manager

  • Node failing platform health checks

What to do

Step 1: Retry the job. This error is typically transient. The platform routes subsequent work away from the faulty node, and the next run should succeed.

Step 2: Check how many executors were affected. If only one or two executors failed with exit status 50 and the rest completed normally, Spark's built-in retry mechanism might have already recovered the job automatically.

Step 3: If the job failed because these container losses pushed the total executor failures past the MaxExecutorFailures threshold, increase the failure tolerance to allow more retries:

# Start 10–20 for production; 30–50 for long jobs with spark.executor.failuresValidityInterval="1h"
spark.conf.set("spark.executor.maxNumFailures", "20")  

Important

Increasing the failure tolerance is appropriate here because the root cause is infrastructure, not your code. Unlike OOM errors, allowing more retries for bad-node failures is a valid mitigation.

Step 4: If the error persists across multiple retries or affects many executors in the same run, contact support. Provide the Spark Application ID and the timestamps of the failures.

How to distinguish from OOM (exit code 137)

Signal Exit Code 137 (OOM) Exit Status 50 (Bad Node)
Error message "Container killed on request. Exit code is 137" "Container from a bad node. Exit status: 50"
Root cause Application exceeded memory limit Node infrastructure failure
Pattern Often affects multiple executors or recurs on retry Usually affects 1–2 executors; resolves on retry
Fix Tune memory, fix skew, optimize code Retry the job; increase maxNumFailures if needed

INCONSISTENT_BEHAVIOR_CROSS_VERSION

This error indicates your Spark application is producing different results, failing, or behaving differently after a runtime version change. The same code and data that worked on the previous version now produces unexpected output, errors, or performance degradation.

Fabric runtime compatibility matrix

Component Runtime 1.1 Runtime 1.2 Runtime 1.3
Apache Spark 3.3 3.4 3.5
Java JDK 8 JDK 11 JDK 11
Scala 2.12 2.12 2.12
Python 3.10 3.10 4.4.1
R 4.2 4.2 4.3
Delta Lake 2.2 2.4 3.2

Common categories

Category Examples
Datetime / Timestamp incompatibility Different parsing, Proleptic Gregorian vs Julian calendar
Query result differences Different row counts, values, or column ordering
New errors on existing code ClassNotFoundException, deprecated API removal
Performance regression Same job takes significantly longer
Delta Lake compatibility InvalidProtocolVersionException
Library / dependency mismatch Python package version changes, Scala/Java upgrade

Category A — Datetime and timestamp incompatibility

Why it happens: Spark 3.0+ switched from hybrid Julian/Gregorian to Proleptic Gregorian calendar. Parquet INT96 and datetime formats written with the old behavior might now be misinterpreted. Legacy datetime settings might not propagate correctly in High Concurrency mode.

Step 1: Identify if this affects you. Does your data/workflow involve:

  • Historical dates (pre-1900 or pre-1582)?

  • Parquet files/tables created before a recent upgrade?

  • Failures only in upgraded or high-concurrency environments?

  • Error logs containing INCONSISTENT_BEHAVIOR_CROSS_VERSION or READ_ANCIENT_DATETIME?

Step 2: Set Spark configuration for datetime rebase modes:

spark.conf.set("spark.sql.parquet.int96RebaseModeInRead", "CORRECTED")  
spark.conf.set("spark.sql.parquet.int96RebaseModeInWrite", "CORRECTED")  
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")  
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED")

Important

Validate before production use. Before applying CORRECTED mode to a production pipeline, test on a sample dataset first. Setting CORRECTED on data originally written with LEGACY behavior can cause silent date value shifts for historical dates (pre-1582). Run SELECT MIN(date_col), MAX(date_col) FROM my_table on a sample and compare the results between settings before you commit to a full pipeline run. If the results differ on historical dates, use LEGACY for existing data and plan a migration to CORRECTED for new data.

  • Use "CORRECTED" for new and consistent behavior across environments (recommended).

  • Use "LEGACY" only if you have data written with pre-upgrade runtimes that now fails to read back.

Or via %%configure:

%%configure  
{  
"conf": {  
"spark.sql.parquet.int96RebaseModeInRead": "CORRECTED",  
"spark.sql.parquet.int96RebaseModeInWrite": "CORRECTED",  
"spark.sql.parquet.datetimeRebaseModeInRead": "CORRECTED",  
"spark.sql.parquet.datetimeRebaseModeInWrite": "CORRECTED"  
}  
}

Important

In High Concurrency mode, settings must be applied at the notebook/session level; environment or cluster-wide settings might not propagate.

Step 3: Validate:

  • Rerun failed jobs/notebooks.

  • Verify the setting took effect:

print(spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead"))

Category B — Scala, Java, or Python version changes

Fabric Runtime Spark Java Scala Python
Runtime 1.1 3.3 JDK 8 2.12.15 3.10
Runtime 1.2 3.4 JDK 11 2.12.17 3.10
Runtime 1.3 3.5 JDK 11 2.12.18 3.11

What to do:

  • Rebuild custom JARs against the new Scala/Spark version. Use provided scope for Spark in Maven/SBT.

  • For ClassNotFoundException with third-party JARs, verify the JAR has the correct Scala suffix (for example, _2.12).

  • For Python ModuleNotFoundError, install missing packages explicitly:

%pip install pandas==2.0.3

Category C — Delta Lake protocol incompatibility

Why it happens: Delta Lake uses protocol versions to track table features. Protocol upgrades are irreversible.

Scenario Result
Enabled Deletion Vectors on Runtime 1.2, read on Runtime 1.1 Fails: Runtime 1.1 doesn't support the protocol
Created table with TimestampNTZ on Runtime 1.2 Requires reader version 3: Runtime 1.1 can't read
Table written externally with writer version 6 Might not be supported by the Fabric Delta runtime

What to do:

  • Move forward, not backward: use a runtime that supports the protocol.

  • Avoid mixing runtimes on the same Delta tables.

  • Check protocol before enabling new features:

DESCRIBE DETAIL my_table

Category D — Spark SQL behavioral changes

Why it happens: Spark versions change default behaviors (ANSI mode, cast rules, null handling).

What to do:

  • If ANSI mode causes stricter behavior:
spark.conf.set("spark.sql.ansi.enabled", "false")
  • For date/time parsing changes:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
  • For stricter INSERT type checking:
spark.conf.set("spark.sql.storeAssignmentPolicy", "LEGACY")

Using legacy settings is a short-term fix. Plan to update your code for the new behavior.

AnalysisException in Spark

An AnalysisException is thrown during Spark's query analysis phase, before any data is processed. Spark validates your SQL or DataFrame query and checks that all referenced tables, columns, functions, and types exist and are compatible. If something doesn't check out, Spark rejects the query immediately. This is almost always a user-side issue: a typo, a missing table, a schema mismatch, or an unsupported operation. Because it fails early, no compute resources are wasted.

Typical error patterns:

org.apache.spark.sql.AnalysisException: Table or view not found: my_table

org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION]  
A column or function parameter with name 'NotARealColumn' cannot be resolved.  
Did you mean one of the following? [Revenue, GrossRevenue, Rating, Branch, City]

org.apache.spark.sql.AnalysisException: Data type mismatch: ...

Step 1: Read the error message carefully

The AnalysisException message almost always contains:

  • What failed: the table, column, function, or operation

  • Why it failed: not found, type mismatch, ambiguous reference

  • What was available: the list of valid columns, tables, or types

Example of a column name typo:

AnalysisException: cannot resolve '`salery`' given input columns:  
[employee.name, employee.salary, employee.dept]

The error shows you typed "salery" when the column is actually called "salary".

Step 2: Match your error to a scenario

Scenario A — Table or view not found

Table or view not found: my_table
  • Typo in the table name: double-check spelling and case.

  • Wrong database/schema: use a fully-qualified name:

spark.sql("SELECT * FROM my_catalog.my_schema.my_table")
  • Temp view expired: if the session restarted, the view is gone. Re-create it:
df.createOrReplaceTempView("my_table")
  • Table not yet written—ensure the upstream notebook/cell has completed.

  • Lakehouse not attached—in Fabric, verify the lakehouse is attached to your notebook.

Scenario B — Column not found

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter  
with name 'X' cannot be resolved.
  • Typo in the column name: compare with the suggestions in the error.

  • Column was renamed or dropped upstream: check the schema:

df.printSchema()
  • Column exists in a different DataFrame: after a join, reference the correct source:
df1.join(df2, df1.id == df2.id).select(df1.id, df2.name)

Scenario C — Ambiguous column reference

[AMBIGUOUS_REFERENCE] Reference 'Quantity' is ambiguous,  
could be: [a.Quantity, b.Quantity]

What to do: Qualify the column with the table alias:

# SQL  
spark.sql("""
  SELECT a.id, b.name  
  FROM table_a a JOIN table_b b ON a.id = b.id  
""")

# DataFrame API  
df1.alias("a").join(df2.alias("b"), col("a.id") == col("b.id")) \
  .select("a.id", "b.name")

For a complete list of available SQL functions, see Spark SQL Built-in Functions.

Scenario D — Data type mismatch

Data type mismatch: differing types in '(col_a = col_b)': int vs string

What to do: Explicitly cast to a common type:

from pyspark.sql.functions import col  
df = df1.join(df2, df1["id"].cast("string") == df2["id_str"])

Scenario E — Function not found

Undefined function: 'my_function'
  • Typo: check the Spark SQL function reference.

  • UDF not registered:

spark.udf.register("my_function", my_function)
  • Function removed in a version upgrade: check the migration guide.

Scenario F — Schema mismatch on write / INSERT

[_LEGACY_ERROR_TEMP_DELTA_0007] A schema mismatch detected  
when writing to the Delta table...

What to do:

  • Check what the target expects:
spark.sql("DESCRIBE my_table").show()
  • Check what you're writing:
df.printSchema()
  • Align columns and types:
df = df.select("col_a", "col_b", "col_c")  
df = df.withColumn("col_a", col("col_a").cast("int"))
  • For Delta schema evolution:
df.write.format("delta") \
  .option("mergeSchema", "true") \
  .mode("append") \
  .save("/path/to/table")

Scenario G — Delta Lake AnalysisException

Error Cause Fix
Cannot write to table that requires reader/writer version N Delta protocol incompatibility Use a runtime that supports the required protocol
A schema mismatch detected when writing to the Delta table New data has extra/missing columns Enable schema merging or fix the schema
Incompatible format detected Writing to a Delta path with non-Delta format Ensure the target path is a Delta table
Operation not allowed: can't change partition columns Trying to alter partitioning Create a new table with the desired partitioning

For more information about Delta Lake schema evolution, table features, and protocol versions, see Delta Lake Documentation.

Scenario H — Path / file not found

Path does not exist: abfss://container@account.dfs.core.windows.net/my/path
  • Typo in the path: double-check container name, storage account, and file path.

  • File was deleted or moved: verify the file exists in your lakehouse/storage explorer.

  • Wrong storage account or workspace.

  • Permissions issue: the error sometimes shows "path not found" when it's actually "access denied."

Important

In Fabric notebooks, reading from the nbresource folder with Spark isn't supported. Use Python file I/O (open()) instead of spark.read for notebook resource files. Use .save() instead of .saveAsTable() when writing to an explicit path.

Scenario I — Unsupported operation

Unsupported operation: ALTER TABLE ADD COLUMNS ... for non-Delta tables

What to do: Check if the feature requires Delta format. Convert if needed:

from delta.tables import DeltaTable  
DeltaTable.convertToDelta(spark, "parquet.`/path/to/table`")

Debugging techniques

  • Print the Schema:
df.printSchema()  
spark.sql("DESCRIBE EXTENDED my_table").show(truncate=False)
  • List Available Tables:
spark.sql("SHOW TABLES").show()
  • List Available Columns:
spark.sql("DESCRIBE my_table").show()
  • Test Queries Incrementally — build step by step:
spark.sql("SELECT * FROM my_table LIMIT 5").show()  
spark.sql("SELECT col_a, col_b FROM my_table LIMIT 5").show()
  • Check Spark Configuration:
for k, v in sorted(spark.sparkContext.getConf().getAll()):  
    print(f"{k} = {v}")

Quick-reference troubleshooting table

Error Message Contains Likely Cause First Action
Table or view not found Missing table or wrong database Check spelling; use fully-qualified name
can't resolve + column name Missing or misspelled column Run df.printSchema() or DESCRIBE table
Reference ... is ambiguous Duplicate column name after join Qualify with table alias: a.id
Data type mismatch Incompatible types in comparison Cast columns to a common type
Undefined function Missing or unregistered UDF Check spelling; register UDF if custom
Cannot write incompatible data Schema mismatch on write Compare source/target schemas; cast/select
Path doesn't exist Wrong file path or deleted file Verify path in storage explorer
Cannot safely cast Strict type checking on INSERT Cast column explicitly before writing
DeltaAnalysisException Delta-specific schema/protocol issue See Delta Lake section above
Unsupported operation Feature not available for table format Check if Delta format is required

Session startup and submit errors

SparkContextInitializationTimedOut

Error: Spark_Ambiguous_ApplicationMaster_SparkContextInitializationTimedOut

Why it happens: The Spark context (driver) failed to initialize within the timeout period. Causes include insufficient cluster resources, network issues during startup, or custom library installation taking too long.

What to do:

  • Check if your cluster has sufficient resources—if other jobs are consuming capacity, wait or use a dedicated pool.

  • Review custom library/environment configurations—large or numerous libraries slow down initialization.

  • Check for network connectivity issues (virtual network configuration, private endpoints).

  • Remove or reduce custom library dependencies to isolate the issue.

SparkSubmit errors

Error Meaning
SparkSubmitProcessTimedOut spark-submit took too long to start the application
SparkSubmitProcessFailedExitCode1 spark-submit exited with error (bad config, missing JAR)
SparkSubmitProcessFailedExitCode143 spark-submit was killed (resource limit or platform timeout)
PersonalizationFailed Custom environment/library setup failed
ConfigPersonalizationFailed Custom Spark configuration failed to apply

What to do:

  • Timed out: Check if large custom libraries are causing slow environment setup. Reduce library count/size.

  • Exit code 1: Check driver logs for the actual error—typically misconfiguration or missing dependency.

  • Exit code 143: Process was killed—could be resource exhaustion. Retry; if persistent, contact support.

  • Personalization failed: Review your custom environment definition. Try removing custom packages one by one.

  • Config personalization failed: Check that Spark configuration keys are valid. Some configs are read-only in managed environments.

YARN application — KilledByTrustedServiceUser

Error: Spark_System_YARNApplication_KilledByTrustedServiceUser

What it means: Your Spark session failed during startup — the YARN application was killed before your code began executing. Exit code is typically 13.

Scenario 1 — Invalid Spark configuration

Why it happens: An incorrect or unsupported Spark configuration was passed, causing the session to crash on startup.

Common examples:

  • spark.rpc.message.maxSize set with a unit suffix (for example "512m") instead of a plain integer

  • spark.rpc.message.maxSize set above the 2047 MB maximum

  • spark.network.timeout set to a value smaller than spark.executor.heartbeatInterval

What to do:

  • Review all custom Spark configurations in your notebook %%configure cell or environment settings.

  • Remove any recently added config keys and re-run.

  • Ensure numeric configs use the expected units (some expect milliseconds, some expect plain numbers).

%%configure  
{  
"conf": {  
"spark.rpc.message.maxSize": "256",  
"spark.network.timeout": "800s",  
"spark.executor.heartbeatInterval": "60s"  
}  
}

Scenario 2 — ClassNotFoundException

Why it happens: A required Java/Scala class couldn't be found during session initialization. This can happen if a custom JAR is missing, corrupted, or built for a different Spark/Scala version.

What to do:

  • Check your custom JARs—are they compiled for the correct Spark and Scala version (for example, Spark 3.4 / Scala 2.12)?

  • If you recently added a library to the environment, remove it and retry.

  • Search driver logs for ClassNotFoundException to identify the missing class.

  • If the missing class belongs to Spark/Fabric internals (org.apache.spark.*)—retry; if it persists, contact support.

Scenario 3 — UnknownHostException (transient)

Why it happens: A transient DNS resolution failure during session startup. The cluster resource manager was briefly unreachable.

What to do:

  • Retry the job. This error is typically transient and resolves on the next attempt.

  • If it recurs repeatedly on the same Spark pool, contact support.

Scenario 4 — Container allocation failure

Why it happens: The cluster couldn't allocate containers for your application—usually due to resource exhaustion on the underlying infrastructure.

What to do:

  • Retry the job after a few minutes.

  • If you're running many concurrent sessions on the same pool, try reducing concurrency or scaling the pool.

  • If the error persists across multiple retries, contact support—this might indicate an infrastructure capacity issue.

Storage and connectivity errors

ABFS StorageAccountDoesNotExist

Error: Spark_Ambiguous_ABFS_StorageAccountDoesNotExist

Why it happens: The specified Azure storage account doesn't exist or isn't accessible.

What to do:

  • Verify the storage account name is spelled correctly:
abfss://<container>@<storage_account>.dfs.core.windows.net/<path>
  • Confirm the storage account exists in the Azure portal (it might have been deleted or renamed).

  • Check that the storage account isn't behind a firewall that blocks your Spark cluster.

  • Verify you have the correct permissions (Storage Blob Data Reader/Contributor) on the account.

ABFS storage operation failed

Error: Spark_System_ABFS_OperationFailed

What it means: An Azure Blob File System (ABFS) storage operation failed. This typically points to a storage connectivity, permission, or networking issue rather than a Spark code error.

Why it happens: Your request was denied because it didn't comply with private link settings. This occurs when Spark tries to access storage through a private endpoint that isn't properly configured.

What to do:

  • Verify that your workspace's private link and managed virtual network settings are correctly configured.

  • Ensure the private endpoint DNS records are intact and resolving correctly.

  • If using managed virtual network, confirm Data Exfiltration Protection (DEP) is enabled consistently.

Scenario 2 — 403 authorization / SAS failure

Why it happens: The generated SAS token or authorization header is invalid or expired, causing a 403 Forbidden error.

Example error messages:

  • "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly."

  • "AuthorizationPermissionMismatch" with HTTP 403

What to do:

  • If the storage account recently changed keys or access policies, ensure the Fabric workspace connection is updated.

  • Verify that the Lakehouse or warehouse shortcut has valid credentials.

  • If using a service principal, confirm it has the Storage Blob Data Contributor role on the target storage account.

  • Retry—token generation issues can be transient.

If the error includes "AccessDeniedException" on system staging paths (for example, _system/artifacts/), this is typically a platform-level issue. Retry first; if it persists, contact support.

Scenario 3 — Storage account connectivity

Why it happens: The Spark cluster can't reach the storage account due to firewall rules, virtual network restrictions, or the storage account being in a different region.

What to do:

  • Check that the storage account firewall allows access from "Trusted Microsoft services".

  • If using private endpoints, verify DNS resolution from within the workspace's virtual network.

  • Confirm the storage account exists and hasn't been deleted or renamed.

JDBC connection failed

Error: Spark_Ambiguous_JDBC_ConnectionFailed

Why it happens: The JDBC connection to the external database failed.

What to do:

  • Verify connection parameters: host, port, database name, username, password.

  • Test connectivity from outside Spark (for example, Python pyodbc) to isolate whether it's a Spark or network issue.

  • Check firewall rules — does the database allow connections from your Spark cluster's IP range?

  • Verify the database server is running and accepting connections.

  • Check JDBC driver version compatibility.

JDBC SQLServerException

Error: Spark_Ambiguous_JDBC_SQLServerException

Why it happens: A SQL Server-specific error occurred during a JDBC operation.

What to do:

  • Read the SQL Server error code in the stack trace.
Error Code Meaning Fix
18456 Login failed Check username/password
233 Connection closed Check firewall, server availability
1205 Deadlock victim Retry the operation; reduce parallelism
8115 Arithmetic overflow Check data types and values
  • Verify SQL Server permissions—your login needs appropriate rights.

  • For timeout errors, increase the query timeout:

df = spark.read.format("jdbc") \
    .option("queryTimeout", "300") \
    .option("url", url) \
    .load()

File and path errors

FileInput — FileNotFound

What does this error mean?

The error code Spark_User_FileInput_FileNotFound means your Spark job tried to read a file or directory that doesn't exist at the specified path. This is a user error — the path you provided is either incorrect, the file was deleted, or it hasn't been created yet.

Error messages to look for

org.apache.spark.sql.AnalysisException: Path does not exist: abfss://...

java.io.FileNotFoundException: No such file or directory

Input path does not exist: abfss://container@account.dfs.core.windows.net/...

Common causes and fixes

Incorrect path or typo
  • Double-check the container name, storage account, and file path for typos.

  • Verify the path exists using NotebookUtils:

notebookutils.fs.ls("abfss://container@account.dfs.core.windows.net/folder/")
File not yet created by upstream job
  • If your notebook depends on output from another pipeline or job, ensure the upstream job completed successfully before this job runs.

  • Add a dependency or checkpoint in your pipeline to wait for the file.

File was deleted or moved
  • Check if a retention policy, cleanup job, or another user deleted the file.

  • For Delta tables, check the transaction log to see if files were removed by VACUUM.

Partition path does not exist
  • When reading partitioned data, ensure the partition filter matches existing partitions:
df = spark.read.parquet("abfss://.../data/").where("date = '2024-01-15'")
  • List available partitions:
notebookutils.fs.ls("abfss://.../data/")
Case sensitivity
  • ABFS paths are case-sensitive. Ensure the casing matches exactly.

SQL — PathDoesNotExist

What does this error mean?

The error code Spark_User_SQL_PathDoesNotExist means a Spark SQL query referenced a path (table location, view, or external data source) that can't be found. This typically occurs when a table's underlying storage path has changed or been removed.

Common causes and fixes

Table's underlying storage was deleted or moved
  • The table metadata points to a path that no longer exists. Recreate the table or update its LOCATION.
-- Check the table location  
DESCRIBE EXTENDED schema_name.table_name

-- Recreate pointing to correct path  
CREATE TABLE schema_name.table_name USING DELTA LOCATION "abfss://..."
Workspace or Lakehouse was renamed
  • Renaming a workspace can break paths that were hardcoded. Use relative paths or notebookutils.fs to resolve paths dynamically.

Important

This is a common real-world trap specific to Fabric. If you recently renamed your workspace and tables stopped working, this is likely the cause.

Cross-workspace access without correct path
  • When accessing tables in another workspace, use the full ABFS path:
spark.read.format("delta").load("abfss://container@account.dfs.core.windows.net/Tables/tablename")
Shortcut or mount point broken
  • If using OneLake shortcuts, verify the shortcut target still exists and the connection is valid.

WASB — NoCredentials

Important

WASB (Windows Azure Storage Blob) is a legacy protocol. The primary fix is to migrate to ABFS (Azure Blob File System) paths using the abfss:// scheme, which is the modern and supported approach in Fabric. Use the steps below only if migration isn't immediately possible.

Error messages to look for

Spark_User_WASB_NoCredentials

No credentials found for account <storage_account>.blob.core.windows.net

WASB authorization failed

Resolution steps

1. Migrate to ABFS (recommended)

  • Convert your paths from wasb[s]:// to abfss://:
# Old (WASB): wasbs://container@account.blob.core.windows.net/path

# New (ABFS): abfss://container@account.dfs.core.windows.net/path

2. If migration is not possible, configure the storage account key

spark.conf.set("fs.azure.account.key.<account>.blob.core.windows.net", "<key>")

Important

Storing account keys in notebook code is a security risk. Use Fabric connections or Azure Key Vault instead.

Authentication and token errors

CustomTokenProvider Unauthorized

Error: Spark_Ambiguous_CustomTokenProvider_Unauthorized

Why it happens: The custom token provider encountered an authorization failure.

What to do:

  • Verify that authentication credentials are correct and not expired.

  • Check that the service principal / managed identity has the required role assignments.

  • Ensure OAuth tokens haven't expired (long-running jobs might outlast token lifetimes).

  • Review Microsoft Entra audit logs for specific authorization failures.

Unable to generate session token

Error: UNABLE_TO_GENERATE_SESSION_TOKEN_WITH_TOKEN_PROVIDER

What it means: Fabric couldn't generate the authentication token required to start your Spark session. The session fails before any user code executes.

What to do:

  • Verify that your Fabric workspace capacity is active and not paused.

  • Check that your Microsoft Entra tenant isn't experiencing authentication issues.

  • If you're using a service principal or managed identity, confirm it has the correct role assignments on the workspace.

  • Try opening a new browser session or clearing cached credentials.

  • If the error is intermittent, retry—token generation can have transient failures.

  • If persistent, check the Fabric admin portal for any capacity or tenant-level issues, then contact support.

ABFS unauthorized (403)

What does this error mean?

The error code Spark_User_ABFS_Unauthorized means your Spark job received a 403 Forbidden response when trying to access Azure Blob File System (ABFS) storage. Your identity or service principal doesn't have the required permissions.

Error messages to look for

Operation failed: "This request is not authorized to perform this operation using this permission."

StatusCode=403, ErrorCode=AuthorizationPermissionMismatch

StorageRequestFailedException: Status code: 403

Common causes and fixes

Missing Storage Blob Data role
  • Your Fabric identity needs at least Storage Blob Data Reader (for reads) or Storage Blob Data Contributor (for writes) on the storage account.

  • In the Azure portal, go to your storage account, select Access Control (IAM), and then select Add role assignment.

SAS token expired or insufficient permissions
  • If using a SAS token, check the expiry date and ensure it has the correct permissions (read, write, list).
Firewall or Private Endpoint blocking access
  • If the storage account has firewall rules, ensure the Fabric workspace IP ranges are allowed.

  • For Private Link, ensure the private endpoint is correctly configured and approved.

OneLake access not properly configured
  • For cross-tenant or cross-workspace access, verify sharing settings and permissions in Fabric admin.

Token provider user error

What does this error mean?

The error code TOKEN_PROVIDER_USER_ERROR means the token provider configured for your Spark session returned an error when trying to obtain an access token. This prevents your job from authenticating to downstream services.

Common causes and fixes

Service principal credentials expired
  • If using a service principal, check that the client secret hasn't expired.

  • Renew the secret in Microsoft Entra ID and update the configuration in Fabric.

Incorrect tenant, client ID, or client secret
  • Verify the values in your token provider configuration match the Microsoft Entra ID app registration.
  • Ensure the service principal has been granted the required API permissions and admin consent has been provided.
Linked service or connection misconfigured
  • If using a Fabric connection or linked service, recreate it and test the connection.

Delta Lake and streaming errors

DeltaLake DataTransformationException

Error: Spark_Ambiguous_DeltaLake_DataTransformationException

The full error code might include a user application exception class name rather than a Fabric-specific code.

Why it happens: A data transformation error occurred while processing data for a Delta Lake operation.

What to do:

  • Examine the full stack trace—it usually identifies the specific column or transformation that failed.

  • Check for data quality issues: null values in non-nullable columns, values exceeding column constraints.

  • Verify source data schema matches the target Delta table schema:

df.printSchema()  
spark.sql("DESCRIBE EXTENDED target_table").show(truncate=False)
  • Add data validation before the write operation:
df = df.filter(col("required_col").isNotNull())  
df = df.withColumn("col_a", col("col_a").cast("expected_type"))

Streaming query exception

Error: Spark_Ambiguous_DeltaLake_org.apache.spark.sql.streaming.StreamingQueryException

Why it happens: An exception occurred during Spark structured streaming operations.

What to do:

  • Read the full stack trace: it wraps the actual root cause (OOM, storage error, schema mismatch).

  • Verify source and sink availability.

  • Check streaming checkpoints are valid and accessible.

  • If the checkpoint is corrupted, you might need to restart the stream from scratch.

  • For OutOfMemoryError inside a streaming query, see the Memory Issues section.

Application code errors

UserApp NullPointerException

Error: Spark_Ambiguous_UserApp_NullPointer

Why it happens: A NullPointerException occurred in the user's application code.

What to do:

  • Read the full stack trace: identify whether the null pointer is in your code or in a Spark internal component.

  • Common causes:

  • Null values in DataFrame columns passed to UDFs:

@udf(returnType=StringType())  
def safe_upper(x):  
    return x.upper() if x is not None else None
  • Filter nulls before processing:
df = df.filter(col("my_col").isNotNull())
  • Avoid referencing non-serializable objects inside Spark transformations.

UserApp IllegalStateException

Error: Spark_Ambiguous_UserApp_IllegalStateException

Why it happens: An IllegalStateException occurred in the user's application code. An operation was called at an invalid time or in an invalid state.

What to do:

  • Read the stack trace to identify the exact location.

  • Don't call spark.stop() mid-notebook.

  • Avoid sharing mutable state across Spark tasks.

  • Spark iterators can only be traversed once—don't reuse them.

UserApp JobAborted

Error: Spark_Ambiguous_UserApp_JobAborted

Why it happens: A Spark job was aborted, typically because a stage failed after exhausting retries.

What to do:

  • Review the cause inside the "SparkException: Job aborted" message—it wraps the real error.

  • Common wrapped errors: TaskFailedException, FetchFailedException, FileNotFoundException.

  • In the Spark UI, select the Stages tab, select the failed stage, and review the task failure reason.

Non-JVM user app failures

Errors: Spark_Ambiguous_NonJvmUserApp_ExitWithStatus1, Spark_Ambiguous_NonJvmUserApp_FailedContainerLaunch

Why it happens: A Python, R, or other non-JVM application failed to start or exited with an error.

What to do:

  • ExitWithStatus1: Check driver logs (stderr) for the Python/R stack trace—SyntaxError, ModuleNotFoundError, and similar errors.

  • FailedContainerLaunch: Incompatible or corrupted custom library, or resource constraints.

  • Test your code locally or in a minimal notebook first.

  • Remove custom libraries one by one to isolate the issue.

UserApp ClassNotFound

What does this error mean?

The error code Spark_User_UserApp_ClassNotFound means your Spark job tried to load a Java/Scala class that doesn't exist in the classpath. This is typically caused by a missing library, incorrect import, or a version mismatch.

Common causes and fixes

Missing JAR dependency
  • Upload the required JAR to your Fabric environment or attach it to the session:
%%configure  
{"jars": ["abfss://container@account.dfs.core.windows.net/libs/my-lib.jar"]}
Incorrect class name or package path
  • Verify the fully-qualified class name matches the library version you're using.
Library version mismatch
  • The class might exist in a different version of the library. Check which version is installed:
# Check installed libraries  
spark.sparkContext.getConf().get("spark.jars")
Fat JAR not built correctly
  • If using a fat/uber JAR, ensure all transitive dependencies are included.

  • Check the JAR contents:

# From a terminal  
jar tf my-app.jar | grep ClassName

NonJvmUserApp TypeError

What does this error mean?

The error code Spark_User_NonJvmUserApp_TypeError means your PySpark code raised a Python TypeError exception. This occurs when an operation is applied to an object of an inappropriate type.

Common causes and fixes

  • Check for type mismatches in UDF return types—ensure your UDF return type annotation matches the actual return value.

  • Verify DataFrame column types before operations like joins, filters, or aggregations.

  • Use explicit type casting when needed:

from pyspark.sql.functions import col  
df = df.withColumn("amount", col("amount").cast("double"))
  • Check for None/null handling—PySpark UDFs receiving null values might cause TypeErrors if not handled.

UserApp KeyError

What does this error mean?

The error code Spark_User_UserApp_KeyError means your PySpark code raised a Python KeyError exception, typically when accessing a dictionary with a key that doesn't exist.

Common causes and fixes

  • Use .get() with a default value instead of direct dictionary access:
# Instead of: value = my_dict[key]  
value = my_dict.get(key, default_value)
  • Check for column name changes—if the upstream data schema changed, a previously valid key might no longer exist.

  • Add error handling in UDFs:

def safe_lookup(key):  
    try:  
        return lookup_dict[key]  
    except KeyError:  
        return None

UserApp AssertionError

What does this error mean?

The error code Spark_User_UserApp_AssertionError means your code raised a Python AssertionError. This happens when an assert statement fails, indicating a condition your code expected to be true was false.

Common causes and fixes

  • Review your assert statements—the condition being checked isn't met at runtime:
# This will raise AssertionError if df is empty  
assert df.count() > 0, "DataFrame is empty"
  • Add proper error handling instead of relying on assertions:
if df.count() == 0:  
    raise ValueError("No data to process")
  • Check for data quality issues—assertions often guard data integrity assumptions that might fail with new data.

UserApp AttributeError

What does this error mean?

The error code Spark_User_UserApp_AttributeError means your PySpark code tried to access an attribute or method that doesn't exist on an object.

Common causes and fixes

  • Check for API changes between Spark versions—a method that existed in Spark 3.3 might be renamed or removed in Spark 3.5. See Spark SQL Migration Guide for version-specific breaking changes.

  • Verify the object type. A common mistake is calling DataFrame methods on a Row, string, or None:

# Wrong: df.collect() returns a list, not a DataFrame  
result = df.collect()  
result.show() # AttributeError!  
  
# Correct:  
result = df.collect() # This is a list  
df.show() # Call show() on the DataFrame
  • Check for None values—calling methods on None objects causes AttributeError.

For the complete DataFrame API reference, see PySpark DataFrame API.

Library and environment errors

Conda PipFailed — library installation failure

What does this error mean?

The error code Spark_User_Conda_PipFailed means that a library installation (via pip or conda) failed during environment setup for your Spark session. Fabric creates a custom environment based on your configuration, and this error occurs when that setup fails.

Common causes and fixes

Package does not exist on PyPI/Conda

Verify the package name and version are correct:

  • Check on PyPI: https://pypi.org/project/<package-name>/
  • Or run locally: pip install <package-name>==<version>
Version conflict with pre-installed packages
  • Fabric environments come with pre-installed packages. Your requested version might conflict.

  • Check the Fabric runtime release notes for the list of pre-installed packages and their versions.

  • Try removing the version pin to let pip resolve a compatible version.

Package requires system-level dependencies
  • Some Python packages require C libraries or system packages that are not available in the Fabric environment.

  • Use pre-compiled wheels when possible, or choose a pure-Python alternative.

Network connectivity issue
  • If using a private endpoint or firewall, ensure the Fabric environment can reach PyPI or your private package feed.
Custom environment configuration error
  • Review your environment.yml or requirements.txt for syntax errors.

  • Test your environment locally before deploying to Fabric.

Platform and engine errors

Native Execution Engine — InvalidState

Error: Spark_System_NativeExecutionEngine_InvalidState

What it means: The Fabric Native Execution Engine encountered an internal error and couldn't process your query. This is a platform-level issue, not a code error.

What to do:

  • Retry the job—transient invalid state errors often resolve on the next run.

  • If the error is reproducible, try disabling the native execution engine to confirm it is the cause:

%%configure  
{  
"conf": {  
"spark.native.enabled": "false"  
}  
}
  • If disabling the native engine resolves the issue, your query hit an unsupported edge case. Run with it disabled as a workaround while you contact support.

  • Include the full error message and the query/code that triggered it in your support ticket.

Note

The Native Execution Engine accelerates many common operations but doesn't yet support all Spark SQL features. Complex UDFs, certain data types, or unusual query patterns might fall back to the standard JVM engine or fail.

MetaStore — HiveException

Error: Spark_System_MetaStore_HiveException

What it means: The Spark metastore (Hive-compatible catalog) encountered an error while processing a table or database operation.

Common causes

Cause Example Error Snippet
Table metadata corrupted or missing HiveException: Unable to fetch table ... Table not found
Concurrent DDL operations on the same table HiveException: ... lock acquisition timed out
Incompatible schema evolution HiveException: Unable to alter table ... column type mismatch
Catalog connectivity timeout HiveException: ... connection refused / read timed out

What to do:

  • Retry the job—catalog connectivity timeouts are often transient.

  • Avoid running concurrent DDL (ALTER, DROP, CREATE) on the same table from multiple notebooks.

  • If schema changes were recently applied, verify the table schema:

DESCRIBE EXTENDED my_database.my_table
  • If the table appears corrupted, try recreating it from the underlying data:
-- For Delta tables  
CREATE TABLE my_table USING DELTA LOCATION 'abfss://...'  
  
-- For Parquet tables  
CREATE TABLE my_table USING PARQUET LOCATION 'abfss://...'
  • If the error mentions "lock acquisition", wait a few minutes and retry—another session might be holding a metadata lock.

NotebookUtils EmptyString

Error: Spark_Ambiguous_MsSparkUtils_EmptyString

Why it happens: A notebookutils function received an empty string where a value was expected.

What to do:

  • Check that all parameters passed to notebookutils functions are non-empty:

    # Incorrect  
    notebookutils.fs.ls("")  
    
    # Correct  
    notebookutils.fs.ls("abfss://container@account.dfs.core.windows.net/path")
    
  • Verify variables are initialized and non-empty before use.

  • If using notebook parameters, ensure default values are provided.

When to contact support

If you've tried the relevant self-help steps and the issue persists, open a support ticket with:

  • Spark Application ID (for example, application_XXXXX_YYYY)

  • The exact error code and message from the Spark UI or driver logs

  • Full stack trace (copy from driver logs stderr)

  • Spark UI screenshots—Executors tab, Stages tab, Storage tab

  • Your Spark configuration—node size, node count, runtime version, any custom spark.conf.set() values

  • Approximate data size being processed

  • Whether the issue is reproducible, intermittent, or new (was it previously working?)

  • Any recent changes to data, code, environment, or runtime version

  • The SQL query or code that caused the error (if applicable)

For more information on monitoring and instrumenting Spark applications, see Spark Monitoring & Instrumentation.