What packages are needed to use a Azure URI with pandas

Anonymous
2023-04-24T17:47:42.6233333+00:00

I am running a pipeline using the Azure ML Python SDK v2. For one of the pipeline steps, a .csv file in blob storage is being passed as input using InputOutputModes.DIRECT. In my understanding, this means that the pipeline step will be receiving a uri filepath azureml://[blah] . Within the pipeline step, I am calling pandas.read_csv() on the input, but am receiving the error protocol not known : azureml . This same function call works in a notebook using the Python 3.10 - SDK v2 kernel. So, my question is what packages need to be in the pipeline step's environment in order to be able to call pandas.read_csv() with the uri filepath? I've tried many different things, the most recent environment I tried is below. Any help is appreciated...

name: prs-env
channels:
  - conda-forge
dependencies:
  - python=3.7.6
  - pip
  - pip:
      - matplotlib~=3.5.0
      - psutil~=5.8.0
      - tqdm~=4.62.0
      - pandas~=1.3.0
      - scipy~=1.7.0
      - numpy~=1.21.0
      - ipykernel~=6.0
      - azureml-core==1.48.0
      - azureml-defaults==1.48.0
      - azureml-mlflow==1.48.0
      - azureml-telemetry==1.48.0
      - scikit-learn~=1.0.0
      - debugpy~=1.6.3
      - usaddress
      - fsspec
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,341 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,201 questions
Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
1,465 questions
{count} vote

Accepted answer
  1. Jimmy Briggs 101 Reputation points
    2023-04-24T19:21:54.19+00:00

    Use the azureml-fsspec package:

    pip install azureml-fsspec
    

    Note: The accepted URI format for the datastore URI is: azureml://subscriptions/([^/]+)/resourcegroups/([^/]+)/workspaces/([^/]+)/datastores/([^/]+)/paths/([^/]+)

    This should technically work:

    import azureml-fsspec
    import pandas as pd
    
    # credentials and variables
    subscription = '<subscription_id>'
    resource_group = '<resource_group>'
    workspace = '<workspace>'
    datastore_name = '<datastore>'
    path_on_datastore '<path>'
    file = '<myfile.csv>'
    
    # generate uri:
    uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}/{file}'
    
    # read via pandas
    df = pd.read_csv(uri)
    

    See Azure Machine Learning - Access Data from Azure Cloud Storage During Interactive Development for details.

    or you could try the AzureMachineLearningFileSystem class from the package:

    import pandas
    from azureml.fsspec import AzureMachineLearningFileSystem
    
    # instantiate file system using following URI
    fs = AzureMachineLearningFileSystem('azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastore/datastorename')
    
    fs.ls() # list folders/files in datastore 'datastorename'
    
    # use an open context
    with fs.open('./folder1/file1.csv') as f:
        # do some process
        df = pandas.read_csv(f)
    
    3 people found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.