What packages are needed to use a Azure URI with pandas

Dance, Cody R. (ALT) 20 Reputation points
2023-04-24T17:47:42.6233333+00:00

I am running a pipeline using the Azure ML Python SDK v2. For one of the pipeline steps, a .csv file in blob storage is being passed as input using InputOutputModes.DIRECT. In my understanding, this means that the pipeline step will be receiving a uri filepath azureml://[blah] . Within the pipeline step, I am calling pandas.read_csv() on the input, but am receiving the error protocol not known : azureml . This same function call works in a notebook using the Python 3.10 - SDK v2 kernel. So, my question is what packages need to be in the pipeline step's environment in order to be able to call pandas.read_csv() with the uri filepath? I've tried many different things, the most recent environment I tried is below. Any help is appreciated...

name: prs-env
channels:
  - conda-forge
dependencies:
  - python=3.7.6
  - pip
  - pip:
      - matplotlib~=3.5.0
      - psutil~=5.8.0
      - tqdm~=4.62.0
      - pandas~=1.3.0
      - scipy~=1.7.0
      - numpy~=1.21.0
      - ipykernel~=6.0
      - azureml-core==1.48.0
      - azureml-defaults==1.48.0
      - azureml-mlflow==1.48.0
      - azureml-telemetry==1.48.0
      - scikit-learn~=1.0.0
      - debugpy~=1.6.3
      - usaddress
      - fsspec
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,598 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,470 questions
Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
978 questions
{count} votes

Accepted answer
  1. Jimmy Briggs 101 Reputation points
    2023-04-24T19:21:54.19+00:00

    Use the azureml-fsspec package:

    pip install azureml-fsspec
    

    Note: The accepted URI format for the datastore URI is: azureml://subscriptions/([^/]+)/resourcegroups/([^/]+)/workspaces/([^/]+)/datastores/([^/]+)/paths/([^/]+)

    This should technically work:

    import azureml-fsspec
    import pandas as pd
    
    # credentials and variables
    subscription = '<subscription_id>'
    resource_group = '<resource_group>'
    workspace = '<workspace>'
    datastore_name = '<datastore>'
    path_on_datastore '<path>'
    file = '<myfile.csv>'
    
    # generate uri:
    uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}/{file}'
    
    # read via pandas
    df = pd.read_csv(uri)
    

    See Azure Machine Learning - Access Data from Azure Cloud Storage During Interactive Development for details.

    or you could try the AzureMachineLearningFileSystem class from the package:

    import pandas
    from azureml.fsspec import AzureMachineLearningFileSystem
    
    # instantiate file system using following URI
    fs = AzureMachineLearningFileSystem('azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastore/datastorename')
    
    fs.ls() # list folders/files in datastore 'datastorename'
    
    # use an open context
    with fs.open('./folder1/file1.csv') as f:
        # do some process
        df = pandas.read_csv(f)
    
    3 people found this answer helpful.

0 additional answers

Sort by: Most helpful