Using specific packages in Synapse spark pool...

Question

Using specific packages in Synapse spark pool...

Matty-2070 11

Hi Team,

Some of the default packages present in a Synapse spark pool 'out of the box' are not up to date enough for my needs, so I want to be able to download specific versions.

I have:

Created a new spark pool
Downloaded the specific *.whl file I want to use from pypi.org
Uploaded it to the 'workspace packages' area of Synapse
Attempted to install the package to the new spark pool via the 'select from workspace packages' section

But every attempt I have made to do this has resulted in a 'failed to apply settings' error, citing a message similar to:

 ...  
 INFO Running /usr/lib/miniforge3/bin/conda env update -p /home/trusted-service-user/cluster-env/clonedenv --file /usr/lib/library-manager/bin/lmjob/sparkpoolcustom/package_cleaned_environment.yml","Pip subprocess error:","ERROR: numpy-1.23.3-cp38-cp38-win32.whl is not a supported wheel on this platform.","","","CondaEnvException: Pip failed","","22/10/07 11:06:25 ERROR b\"Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.\\nCollecting package metadata (repodata.json): ...working... done\\nSolving environment: ...working... done\\nPreparing transaction: ...working... done\\nVerifying transaction: ...working... done\\nExecuting transaction: ...working... done\\nInstalling pip dependencies: ...working... Ran pip subprocess with arguments:\\n['/home/trusted-service-user/cluster-env/clonedenv/bin/python', '-m', 'pip', 'install', '-U', '-r', '/usr/lib/library-manager/bin/lmjob/sparkpoolcustom/condaenv.t1h1b1b3.requirements.txt']\\nPip subprocess output:\\n\\nfailed\\n\"","22/10/07 11:06:25 INFO Cleanup following folders and files from staging directory:","22/10/07 11:06:29 INFO Staging directory cleaned up successfully"],"registeredSources":null}

It seems at though the version of Python running on Synapse is 3.8.10, hence I downloaded the cp38 version of the package, but no joy. Can anyone shed any light as to what might be wrong?

Thanks,

Matty

Matty-2070 11 Reputation points

2022-10-10T09:47:04.447+00:00

Hi,

Thanks for responding.

The Requirements.txt file is very simple. It contains the only following:

scikit-learn==1.1.2

Which is all I understand is needed for things to work?

Thanks,

Matty
PRADEEPCHEEKATLA 91,656 Reputation points Moderator

2022-10-20T10:29:21.003+00:00
Hello @Matty-2070 ,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.

3 answers

Your answer

Matty-2070 11 Reputation points

2022-10-10T09:47:04.447+00:00

Hi,

Thanks for responding.

The Requirements.txt file is very simple. It contains the only following:

scikit-learn==1.1.2

Which is all I understand is needed for things to work?

Thanks,

Matty
PRADEEPCHEEKATLA 91,656 Reputation points Moderator

2022-10-20T10:29:21.003+00:00

Hello @Matty-2070 ,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.

Answer 1

In addition to the above, I have also tried to add a package via the 'Requirements files' option, using a Requirements.txt file. This has also failed. Message as follows:

ProxyLivyApiAsyncError  
LibraryManagement - Spark Job for sparkpoolcustom in workspace **** in subscription **** failed with status:  
{"id":9,"appId":"application_****","appInfo":{"driverLogUrl":"http://vm-****/node/containerlogs/container_****/trusted-service-user","sparkUiUrl":"http://vm-****/proxy/application_****/","isSessionTimedOut":null,"isStreamingQueryExists":"false","impulseErrorCode":null,"impulseTsg":null,"impulseClassification":null},"state":"dead","log":["Elapsed: -","","An HTTP error occurred when trying to retrieve this URL.","HTTP errors are often intermittent, and a simple retry will get you on your way.","'https://conda.anaconda.org/conda-forge/linux-64'","","","22/10/07 13:35:00 ERROR b\"Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.\\nCollecting package metadata (repodata.json): ...working... failed\\n\"","22/10/07 13:35:00 INFO Cleanup following folders and files from staging directory:","22/10/07 13:35:04 INFO Staging directory cleaned up successfully"],"registeredSources":null}

This is incredibly frustrating, especially given that it takes about 10 minutes for the spark pool to respond each time you try!

Cheers,

Matty

Answer 2

PRADEEPCHEEKATLA 91,656 Moderator

Hello @Matty-2070 ,

Thanks for the question and using MS Q&A platform.

Could you please share the content of the requirements.txt?

As per the error message: "ERROR: numpy-1.23.3-cp38-cp38-win32.whl is not a supported wheel on this platform. - which clearly says .whl is not supported.

As per the repro - I'm able to successfully install numpy pacakge using requirements.txt as shown below:

Above requirements.txt successfully installed on the Apache Spark pool:

Checkout the numpy package update from the previous version as shown below:

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is jhow you can be part of Q&A Volunteer Moderators

Matty-2070 11 Reputation points

2022-10-10T10:37:21.74+00:00

Hi,

Thank you for your detailed response.

I've tried to install numpy==1.23.3 exactly as suggested, but I get an error after around 10 minutes of it running.

The method I am using:

Requirements.txt file contains numpy==1.23.3

Then upload the file via the following method:

'Manage' > 'Apache Spark pools' > choose 'Packages' on the specific spark pool > upload Requirements.txt file via the 'Requirements files' section > 'Upload' > 'Apply'

Then wait patiently for 10 mins for it to work (or fail in this case).

Do you know what the issue might be? Are there any other configs that I might need to change to get things to work?

Cheers,

Matty
Matty-2070 11 Reputation points

2022-10-10T17:13:11.407+00:00

Hi,

Interestingly, when I try the above using a different Azure account, it works fine. So the issue is exclusively related to the specific account/Synapse instance.

If there's anything you can recommend I check, that would be appreciated.

Thanks,

Matty
PRADEEPCHEEKATLA 91,656 Reputation points Moderator

2022-10-11T04:16:55.3+00:00

Hello @Matty-2070 ,

Could you please try to create a new Apache Spark pool in the workspace and try to install the package and see if that works?

Answer 3

Matty-2070 11

Hi,

I have tried loading to a completely new spark pool this morning, but it failed again. Here's the error:

Error details  
Notifications  
ProxyLivyApiAsyncError  
LibraryManagement - Spark Job for sparkpooltest in workspace **** in subscription **** failed with status:  
{"id":18,"appId":"application_****","appInfo":{"driverLogUrl":"http://vm-****/node/containerlogs/container_****/trusted-service-user","sparkUiUrl":"http://vm-****/proxy/application_****/","isSessionTimedOut":null,"isStreamingQueryExists":"false","impulseErrorCode":null,"impulseTsg":null,"impulseClassification":null},"state":"dead","log":["Elapsed: -","","An HTTP error occurred when trying to retrieve this URL.","HTTP errors are often intermittent, and a simple retry will get you on your way.","'https://conda.anaconda.org/conda-forge/linux-64'","","","22/10/11 08:16:58 ERROR b\"Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.\\nCollecting package metadata (repodata.json): ...working... failed\\n\"","22/10/11 08:16:58 INFO Cleanup following folders and files from staging directory:","22/10/11 08:17:01 INFO Staging directory cleaned up successfully"],"registeredSources":null}

I am now wondering whether the issue is linked to how Azure has been configured within our corporate environment given that things are working fine when I use my personal Azure account, but I wouldn't know what to check. Any ideas?

Cheers,

Matty

PRADEEPCHEEKATLA 91,656 Reputation points Moderator

2022-10-12T06:48:00.357+00:00

Hello @Matty-2070 ,

This issue looks strange with your synapse workspace. For a deeper investigation and immediate assistance on this issue, if you have a support plan you may file a support ticket.
Matty-2070 11 Reputation points

2022-10-12T11:42:49.373+00:00
Hi,

I've noticed in the logs that exfiltration protection is set to true:

INFO Data exfiltration protection set to: true

I wonder if this is the issue? Link to the Microsoft article on this below:

Cheers,

Matty
PRADEEPCHEEKATLA 91,656 Reputation points Moderator

2022-10-17T09:55:10.6+00:00

Hello @Matty-2070 ,

Users can provide an environment configuration file to install Python packages from public repositories like PyPI. In data exfiltration protected workspaces, connections to outbound repositories are blocked. As a result, Python libraries installed from public repositories like PyPI are not supported.

As an alternative, users can upload workspace packages or create a private channel within their primary Azure Data Lake Storage account. For more information, visit Package management in Azure Synapse Analytics.
Matty-2070 11 Reputation points

2022-10-20T12:58:56.543+00:00

Thanks for responding - your explanation makes sense.

I will now have a look at uploading via the method described in order to overcome the data exfiltration protection that exists.

Cheers,

Matty
PRADEEPCHEEKATLA 91,656 Reputation points Moderator

2022-10-25T05:25:31.997+00:00
Hello @Matty-2070 ,

Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.

Share via

Using specific packages in Synapse spark pool...

3 answers

Your answer