Cannot create a spark session on Apache Spark Pool on Azure Synapse Analytics

Prashant 0 Reputation points
2024-11-25T10:25:09.9766667+00:00

We create spark session on one Apache Spark Pool on Azure Synapse Analytics. However, one day, none of the sessions were created and all of our requests were getting timed out (after 10 minutes as specified in our application code).

This was working absolutely fine for the past 1 month and it got fixed after we scaled the spark pool up and down but it kept failing for a period of 2-3 hours.

This was the message that came up when we checked the error details on the session:

Error details: This application failed due to the total number of errors: 1.

Error code 1

LIVY_SERVER_NOT_RESPONDING

Message: Failed to send request to Livy due to exception=[System.Threading.Tasks.TaskCanceledException].

Source: Dependency

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,045 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 25,291 Reputation points MVP
    2024-11-25T16:57:30.8+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    The issue you encountered with LIVY_SERVER_NOT_RESPONDING on your Azure Synapse Apache Spark Pool is likely due to a temporary infrastructure issue or resource saturation. Here’s a structured approach to diagnose, mitigate, and prevent such issues in the future:

    Root Cause Analysis:

    Infrastructure Issue:

    • Spark pools are hosted on shared infrastructure; transient issues can arise due to backend resource availability or service health.
      • The temporary resolution by scaling up/down suggests that a backend refresh may have resolved the issue.
      Resource Saturation:
      - If multiple users or processes were simultaneously using the Spark pool, it could have exceeded available resources, causing the Livy server (responsible for managing Spark sessions) to fail.
      
      **Service Health Issue:**
      
         - Azure Synapse services may experience temporary disruptions. Check the **Azure Service Health** dashboard for any ongoing issues during the time of failure.
      

    Steps to Mitigate and Resolve

    Scale Pool Up/Down:

    • As you already noted, scaling up and down the pool resolves the issue by redeploying the resources. However, this is a temporary fix.

    Check Service Health:

      - Visit Azure Service Health to confirm if there were service-wide issues.
      
      **Resource Monitoring:**
      
         - Use Azure Synapse monitoring tools to check the usage metrics of your Spark pool, including CPU, memory, and executor usage. This can reveal if resource saturation was the cause.
         
         **Session and Pool Logs:**
         
            - Review Livy and Spark pool logs for specific error details or patterns.
            
                  - Navigate to **Monitoring > Apache Spark Applications** in Synapse Studio.
                  
                        - Check the logs for the failing sessions to identify if they failed due to resource or backend connectivity issues.
                        
    

    Best Practices to Prevent Recurrence

    Configure Auto-Scale for Spark Pools:

    • Set up Auto-Scale on your Spark pool to handle fluctuating workloads dynamically. This ensures the pool scales up during high demand and scales down during idle times.

    Increase Timeout in Application Code:

      - If applicable, increase the timeout beyond 10 minutes for Spark session creation, as complex workloads or transient resource delays may require more time.
      
      **Optimize Spark Jobs:**
      
         - Analyze the Spark jobs being run on the pool to optimize resource utilization. This can help prevent bottlenecks caused by inefficient code.
         
         **Isolate Workloads:**
         
            - If multiple teams or workloads share the same Spark pool, consider creating separate pools for high-priority or critical workloads.
            
            **Livy Server Health Monitoring:**
            
               - Regularly test the Livy server's responsiveness by programmatically sending simple requests to check if it is functional.
               
               **Contact Azure Support:**
               
                  - If such issues persist, raise a support ticket with Microsoft to investigate the root cause. Provide them with session IDs, timestamps, and error logs.
                  
    

    Proactive Monitoring Setup

    • Azure Monitor:
      • Set up alerts on the Synapse Spark pool's health metrics (e.g., node availability, CPU usage, memory usage).
      • Custom Logging:
        • Integrate logs from failed sessions with tools like Log Analytics to analyze trends and proactively address potential issues.

    If the problem persists, sharing the logs or specific Spark pool metrics during the failure window will help narrow down the issue further.

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.