installing wheel packages on spark pool - Synapse DEP enabled workspaces

Abhiram Duvvuru 231 Reputation points Microsoft Employee
2024-07-05T23:13:00.32+00:00

Hi,

I have a Synapse workspace with DEP enabled, and since PyPI libraries cannot be installed directly in the Spark pool, I want to install azure-mgmt-kusto and azure-kusto-data on the pool. I tried the following methods but encountered the error below.

  1. Create Wheel file for that package: pip wheel --wheel-dir=./ azure-kusto-data
  2. Locate the wheel file on the disk
  3. Workspace Packages 3.1 Upload Wheel File under workspace packages User's image

3.2 Install Wheel Package on spark pool

User's image

3.3 Error

ProxyLivyApiAsyncError

LibraryManagement - Spark Job for xxxxxxxxxxxxxx in workspace xxxxxxxxxxxxxx in subscription xxxxxxxxxxxxxx failed with status:

{"id":9,"appId":"application_xxxxxxxxxxxxxx_0001","appInfo":{"driverLogUrl":"https://web.azuresynapse.net/sparkui/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/api/v1/workspaces/xxxxxxxxxxxxxx/sparkpools/systemreservedpool-librarymanagement/sessions/9/applications/application_xxxxxxxxxxxxxx_0001/driverlog/stderr","sparkUiUrl":"https://web.azuresynapse.net/sparkui/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/workspaces/xxxxxxxxxxxxxx/sparkpools/systemreservedpool-librarymanagement/sessions/9/applications/application_xxxxxxxxxxxxxx_0001","isSessionTimedOut":null,"isStreamingQueryExists":"false","impulseErrorCode":"Spark_Ambiguous_NonJvmUserApp_ExitWithStatus1","impulseTsg":null,"impulseClassification":"Ambiguous"},"state":"dead","log":["ERROR: No matching distribution found for ijson~=3.1","","","CondaEnvException: Pip failed","","24/07/05 21:50:15 ERROR b"Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.\nCollecting package metadata (repodata.json): ...working... done\nSolving environment: ...working... done\nPreparing transaction: ...working... done\nVerifying transaction: ...working... done\nExecuting transaction: ...working... done\nInstalling pip dependencies: ...working... Ran pip subprocess with arguments:\n['/home/trusted-service-user/cluster-env/clonedenv/bin/python', '-m', 'pip', 'install', '-U', '-r', '/usr/lib/library-manager/bin/lmjob/xxxxxxxxxxxxxx/condaenv.u3u2igfc.requirements.txt']\nPip subprocess output:\nProcessing ./wheels/azure_kusto_data-4.5.1-py2.py3-none-any.whl (from -r /usr/lib/library-manager/bin/lmjob/xxxxxxxxxxxxxx/condaenv.u3u2igfc.requirements.txt (line 1))\nRequirement already satisfied: python-dateutil>=2.8.0 in /home/trusted-service-user/cluster-env/clonedenv/lib/python3.10/site-packages (from azure-kusto-data==4.5.1->-r /usr/lib/library-manager/bin/lmjob/xxxxxxxxxxxxxx/condaenv.u3u2igfc.requirements.txt (line 1)) (2.9.0)\nRequirement already satisfied: requests>=2.13.0 in /home/trusted-service-user/cluster-env/clonedenv/lib/python3.10/site-packages (from azure-kusto-data==4.5.1->-r /usr/lib/library-manager/bin/lmjob/xxxxxxxxxxxxxx/condaenv.u3u2igfc.requirements.txt (line 1)) (2.31.0)\nRequirement already satisfied: azure-identity<2,>=1.5.0 in /home/trusted-service-user/cluster-env/clonedenv/lib/python3.10/site-packages (from azure-kusto-data==4.5.1->-r /usr/lib/library-manager/bin/lmjob/xxxxxxxxxxxxxx/condaenv.u3u2igfc.requirements.txt (line 1)) (1.15.0)\nRequirement already satisfied: msal<2,>=1.9.0 in /home/trusted-service-user/cluster-env/clonedenv/lib/python3.10/site-packages (from azure-kusto-data==4.5.1->-r /usr/lib/library-manager/bin/lmjob/xxxxxxxxxxxxxx/condaenv.u3u2igfc.requirements.txt (line 1)) (1.27.0)\nINFO: pip is looking at multiple versions of azure-kusto-data to determine which version is compatible with other requirements. This could take a while.\n\nfailed\n"","24/07/05 21:50:15 INFO Cleanup following folders and files from staging directory:","24/07/05 21:50:23 INFO Staging directory cleaned up successfully","24/07/05 21:50:23 INFO Waiting for parallel executions","24/07/05 21:50:23 INFO Closing down clientserver connection"],"registeredSources":null}

Spark UI Note: there is no requirements.txt file on spark pool

  1. Use Storage Account 4.1 Uploaded Wheel File on storage account 4.2 Code: spark.conf.set( "spark.jars.packages", "wasbs://test@tests.blob.core.windows.net/azure_kusto_data-4.5.1-py2.py3-none-any.whl" ) Error: AnalysisException Traceback (most recent call last) Cell In[5], line 1 ----> 1 spark.conf.set(
      2 "spark.jars.packages",       
    
      3 "wasbs://test@tests.blob.core.windows.net/azure_kusto_data-4.5.1-py2.py3-none-any.whl"
    
      4    )
    
    File /opt/spark/python/lib/pyspark.zip/pyspark/sql/conf.py:47, in RuntimeConfig.set(self, key, value)
     40 @since(2.0)
    
     41 def set(self, key: str, value: Union[str, int, bool]) -> None:
    
     42     """Sets the given Spark runtime configuration property.
    
     43 
    
     44     .. versionchanged:: 3.4.0
    
     45         Supports Spark Connect.
    
     46     """
    
    ---> 47 self._jconf.set(key, value) File ~/cluster-env/env/lib/python3.10/site-packages/py4j/java_gateway.py:1322, in JavaMember.call(self, *args) 1316 command = proto.CALL_COMMAND_NAME +\ 1317 self.command_header +\ 1318 args_command +\ 1319 proto.END_COMMAND_PART 1321 answer = self.gateway_client.send_command(command) -> 1322 return_value = get_return_value( 1323 answer, self.gateway_client, self.target_id, self.name) 1325 for temp_arg in temp_args: 1326 if hasattr(temp_arg, "_detach"): File /opt/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py:175, in capture_sql_exception.<locals>.deco(*a, **kw)
    171 converted = convert_exception(e.java_exception)
    
    172 if not isinstance(converted, UnknownException):
    
    173     # Hide where the exception came from that shows a non-Pythonic
    
    174     # JVM exception message.
    
    --> 175 raise converted from None
    176 else:
    
    177     raise
    

  AnalysisException: [CANNOT_MODIFY_CONFIG] Cannot modify the value of the Spark config: "spark.jars.packages".

              See also '

             [https://spark.apache.org/docs/latest/sql-migration-guide.html#ddl-statements']().

Thanks,

Abhiram

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,696 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 20,176 Reputation points
    2024-07-07T13:01:19.7866667+00:00

    You can follow 2 approaches :

    1st approach : Using Workspace Packages

    1. Create Wheel Files:
      • Run the following commands to create wheel files for the required packages:
             
             pip wheel --wheel-dir=./ azure-mgmt-kusto azure-kusto-data
             
        
    2. Upload Wheel Files:
      • Navigate to the Synapse Studio.
      • Go to Manage > Workspace packages.
      • Click on Upload and select the wheel files created in the previous step.

    2nd approach: Using Storage Account

    1. Upload Wheel Files to Storage Account:
      • Upload the wheel files to a container in your Azure Storage Account.
      • Note the path to the uploaded files (for example wasbs://<container>@<storage_account>.blob.core.windows.net/<wheel_file>.whl).
    2. Use Spark Configuration to Install Packages:
      • Use the following Spark configuration settings in your notebook or job to reference the wheel files:
        
             spark.conf.set("spark.jars", "wasbs://<container>@<storage_account>.blob.core.windows.net/azure_mgmt_kusto-<version>-py2.py3-none-any.whl,wasbs://<container>@<storage_account>.blob.core.windows.net/azure_kusto_data-<version>-py2.py3-none-any.whl")
        
             spark.conf.set("spark.pyspark.driver.python", "python3")
        
             spark.conf.set("spark.pyspark.python", "python3")
        
        
    3. Add Environment Variables (if required):
      • Set the environment variables for the Spark job to use the correct Python environment:
             
             spark.conf.set("spark.executorEnv.PYTHONPATH", "/path/to/wheel/files")
             
             spark.conf.set("spark.driverEnv.PYTHONPATH", "/path/to/wheel/files")
             
        
    0 comments No comments

  2. PRADEEPCHEEKATLA-MSFT 85,746 Reputation points Microsoft Employee
    2024-07-11T07:56:36.3066667+00:00

    @Abhiram Duvvuru - Thanks for the question and using MS Q&A platform.

    Firstly, I see that you have tried to upload the wheel file under Workspace Packages and install it on the Spark pool, but encountered an error. The error message suggests that there is no matching distribution found for ijson~=3.1. This could be because the required version of ijson is not installed on the Spark pool. You can try adding ijson~=3.1 to the requirements.txt file and upload it along with the wheel file under Workspace Packages.

    Alternatively, you can try uploading the wheel file to the Azure Data Lake Storage account that is linked with the Synapse workspace and then referencing it in the Spark pool configuration. However, I see that you encountered an error while setting the spark.jars.packages configuration. The error message suggests that you cannot modify the value of the Spark config spark.jars.packages. This could be because the configuration is read-only.

    To install custom libraries using the Azure DataLake Storage method, you must have the Storage Blob Data Contributor permissions on the primary Gen2 Storage account that is linked to the Azure Synapse Analytics workspace. Please make sure that you have the required permissions and try setting the configuration again.

    For more details, refer to Manage dependencies for DEP-enabled Azure Synapse Spark pools

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments