Can't access files and python libraries on Azure Machine Learning user's workspace with Serverless spark pool or synapse spark pool

Question

Can't access files and python libraries on Azure Machine Learning user's workspace with Serverless spark pool or synapse spark pool

Nithin Gowrav 0

I have a set of utility modules and some configs in my workspace that I was able to access when using a personal compute in my notebook. But when using a Serverless spark compute or synapse spark pool as compute, I'm not able to access them. Followed the steps given in this link but did not workout - https://learn.microsoft.com/en-us/azure/machine-learning/interactive-data-wrangling-with-apache-spark-azure-ml?view=azureml-api-2#accessing-data-on-the-default-file-share

Nithin Gowrav 0 Reputation points

2024-01-04T14:50:48.51+00:00

Thanks for the reply @Konstantinos Passadis

I understand that we could add python libraries at different levels as you suggested but my question was related to accessing files in the AzureML fileshare where we could have data files or even some custom python scripts that we would want to refer within our notebooks connected to Synapse or serverless spark pools.
VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2024-01-04T22:24:28.9833333+00:00
Hello @Nithin Gowrav , Thanks for using Microsoft Q&A Platform.

We received your feedback that the answer provided on the thread was not helpful.

As mentioned in the documentation the default file share is mounted to both serverless Spark compute and attached Synapse Spark pools.

Have you tried the sample code snippet to access a file stored on the default file share. Are you receiving any error? https://learn.microsoft.com/en-us/azure/machine-learning/interactive-data-wrangling-with-apache-spark-azure-ml?view=azureml-api-2#accessing-data-on-the-default-file-share

import os import pyspark.pandas as pd from pyspark.ml.feature import Imputer abspath = os.path.abspath(".") file = "file://" + abspath + "/Users/<USER>/data/titanic.csv" print(file) df = pd.read_csv(file, index_col="PassengerId") imputer = Imputer( inputCols=["Age"], outputCol="Age").setStrategy("mean") # Replace missing values in Age column with the mean value df.fillna(value={"Cabin" : "None"}, inplace=True) # Fill Cabin column with value "None" if missing df.dropna(inplace=True) # Drop the rows which still have any missing value output_path = "file://" + abspath + "/Users/<USER>/data/wrangled" df.to_csv(output_path, index_col="PassengerId")

May I know the datastore type you are working on and checked if you have added appropriate roles ?
VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2024-01-08T23:48:49.5+00:00

@Nithin Gowrav , did you get a chance to check the response?
Nithin Gowrav 0 Reputation points

2024-01-08T23:56:11.9+00:00

Hi, Thank you all for the response. I have tried all the options and reviewed the permissions. Everything is fine but still no resolution. Engaged with Microsoft SMEs on this now. Will keep this thread updated

2 answers

Your answer

Nithin Gowrav 0 Reputation points

2024-01-04T14:50:48.51+00:00

Thanks for the reply @Konstantinos Passadis

I understand that we could add python libraries at different levels as you suggested but my question was related to accessing files in the AzureML fileshare where we could have data files or even some custom python scripts that we would want to refer within our notebooks connected to Synapse or serverless spark pools.
VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2024-01-04T22:24:28.9833333+00:00

Hello @Nithin Gowrav , Thanks for using Microsoft Q&A Platform.

We received your feedback that the answer provided on the thread was not helpful.

As mentioned in the documentation the default file share is mounted to both serverless Spark compute and attached Synapse Spark pools.

Have you tried the sample code snippet to access a file stored on the default file share. Are you receiving any error? https://learn.microsoft.com/en-us/azure/machine-learning/interactive-data-wrangling-with-apache-spark-azure-ml?view=azureml-api-2#accessing-data-on-the-default-file-share

import os import pyspark.pandas as pd from pyspark.ml.feature import Imputer abspath = os.path.abspath(".") file = "file://" + abspath + "/Users/<USER>/data/titanic.csv" print(file) df = pd.read_csv(file, index_col="PassengerId") imputer = Imputer( inputCols=["Age"], outputCol="Age").setStrategy("mean") # Replace missing values in Age column with the mean value df.fillna(value={"Cabin" : "None"}, inplace=True) # Fill Cabin column with value "None" if missing df.dropna(inplace=True) # Drop the rows which still have any missing value output_path = "file://" + abspath + "/Users/<USER>/data/wrangled" df.to_csv(output_path, index_col="PassengerId")

May I know the datastore type you are working on and checked if you have added appropriate roles ?
VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2024-01-08T23:48:49.5+00:00

@Nithin Gowrav , did you get a chance to check the response?
Nithin Gowrav 0 Reputation points

2024-01-08T23:56:11.9+00:00

Hi, Thank you all for the response. I have tried all the options and reviewed the permissions. Everything is fine but still no resolution. Engaged with Microsoft SMEs on this now. Will keep this thread updated

Answer 1

Hello @Nithin Gowrav !

Welcome to Microsoft QnA!

To access your utility modules and configs in your workspace when using a Serverless spark compute or synapse spark pool as compute, you can manage Spark pool level libraries for Apache Spark. You can install or remove them into a Spark pool and they will be available to all notebooks and jobs running on the pool. There are two primary ways to install a library on a Spark pool:

Install a workspace library that has been uploaded as a workspace package.

For updating Python libraries, provide a requirements.txt or Conda environment.yml environment specification to install packages from repositories like PyPI, Conda-Forge, and more.

You can read more about managing Spark pool level libraries for Apache Spark in Azure Synapse Analytics in this link.

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-pool-packages

I hope this helps!

The answer or portions of it may have been assisted by AI Source: Microsoft CoPilot

Kindly mark the answer as Accepted and Upvote in case it helped!

Regards

Answer 2

Hello @Nithin Gowrav !

Can you kindly verify that you have read the following ;

https://learn.microsoft.com/en-us/azure/machine-learning/interactive-data-wrangling-with-apache-spark-azure-ml?view=azureml-api-2

If you are trying to access the default File Share you can do it woth the code provided by @VasaviLankipalle-MSFT

Otherwise for Azure Storage :

The Azure Machine Learning datastores can access data using Azure storage account credentials

access key
SAS token
service principal

or provide credential-less data access. Depending on the datastore type and the underlying Azure storage account type, select an appropriate authentication mechanism to ensure data access. This table summarizes the authentication mechanisms to access data in the Azure Machine Learning datastores:

Expand table

Azure BlobNoAccess key or SAS tokenNo role assignments neededAzure BlobYesUser identity passthrough<sup>*****</sup>User identity should have appropriate role assignments in the Azure Blob storage accountAzure Data Lake Storage (ADLS) Gen 2NoService principalService principal should have appropriate role assignments in the Azure Data Lake Storage (ADLS) Gen 2 storage accountAzure Data Lake Storage (ADLS) Gen 2YesUser identity passthroughUser identity should have appropriate role assignments in the Azure Data Lake Storage (ADLS) Gen 2 storage accountIs there an Error Message you can share ?

I hope this helps!

Kindly mark the answer as Accepted and Upvote in case it helped!

Regards

Share via

Can't access files and python libraries on Azure Machine Learning user's workspace with Serverless spark pool or synapse spark pool

2 answers

Your answer