Read a TabularDataset in AML SDK v2

Jorge Lopez 36

When I try to load a TabularDataset (v1) in the new SDK v2 as input for a job, it raises an error. I have followed the instructions in the documentation

My TabularDataset is loaded from a Blostorage container with thousands of .csv files. And it still works when working with the SDK v1.

Here is my code:

from azure.ai.ml import command, Input, Output
from azure.ai.ml.constants import AssetTypes, InputOutputModes

data_asset = ml_client.data.get(name="tabular_dataset", version=1)

train_job = command(
    code='./',
    inputs = {"data": Input(type = AssetTypes.MLTABLE, path = data_asset, mode = InputOutputModes.DIRECT)},
    outputs = {"output_folder": Output(type=AssetTypes.MLFLOW_MODEL)},
    command = 'python conv_sdk_v2.py --data ${{inputs.data}} --output_folder ${{outputs.output_folder}}',
    environment = f'{experiment_env.name}:{experiment_env.version}',
    compute = compute_name_training,
    experiment_name='sdk-v2-experiment-train',
)
train_job_output = ml_client.create_or_update(train_job)

And the exception raised:

Exception: 


1) One or more fields are invalid

Details: 

(x) Could not parse creation_context:
  created_at: '2023-01-23T08:39:16.388014+00:00'
  created_by: ****
  created_by_type: User
  last_modified_at: '2023-01-23T08:39:16.388014+00:00'
  last_modified_by: ****
  last_modified_by_type: User
id: /subscriptions/****/resourceGroups/****/providers/Microsoft.MachineLearningServices/workspaces/****/data/tabular_dataset/versions/1
name: tabular_dataset
path: azureml://subscriptions/****/resourcegroups/****/workspaces/****/datastores/datastore_name/paths/*.csv/
properties:
  v1_type: tabular
tags: {}
type: mltable
version: '1'
. If providing an ARM id, it should start with a '/'.

Resolutions: 
1) Double-check that all specified parameters are of the correct types and formats prescribed by the ArmResource schema.
If using the CLI, you can also check the full log in debug mode for more details by adding --debug to the end of your command

Additional Resources: The easiest way to author a yaml specification file is using IntelliSense and auto-completion Azure ML VS code extension provides: https://code.visualstudio.com/docs/datascience/azure-machine-learning. To set up VS Code, visit https://docs.microsoft.com/azure/machine-learning/how-to-setup-vs-code

Alternative, I have tried to create a proper MLTable (SDK v2 native) from the Blobstorage, but I haven't achieved (there is lack of documentation or examples about it)

edit: typo

santoshkc 6,955 Microsoft Vendor

Hi @Jorge Lopez,

Thank you for reaching out to Microsoft Q&A forum!

As you mentioned that you are encountering errors when trying to load a TabularDataset, MLTable using the SDK v2, despite following the instructions in the documentation.

There are some changes in V1 and V2, as per the given code we did not have much info regarding your python script conv_sdk_v2.py. I tried with below code as given in the documentation and it's working:

from azure.ai.ml import command, Input, MLClient, UserIdentityConfiguration, ManagedIdentityConfiguration
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# ==============================================================
# Set the URI path for the data. Supported paths include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# We set the path to a file on a public blob container
# ==============================================================
path = "wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv"

# ==============================================================
# What type of data does the path point to? Options include:
# data_type = AssetTypes.URI_FILE # a specific file
# data_type = AssetTypes.URI_FOLDER # a folder
# data_type = AssetTypes.MLTABLE # an mltable
# The path we set above is a specific file
# ==============================================================
data_type = AssetTypes.URI_FILE

# ==============================================================
# Set the mode. The popular modes include:
# mode = InputOutputModes.RO_MOUNT # Read-only mount on the compute target
# mode = InputOutputModes.DOWNLOAD # Download the data to the compute target
# ==============================================================
mode = InputOutputModes.RO_MOUNT

# ==============================================================
# You can set the identity you want to use in a job to access the data. Options include:
# identity = UserIdentityConfiguration() # Use the user's identity
# identity = ManagedIdentityConfiguration() # Use the compute target managed identity
# ==============================================================
# This example accesses public data, so we don't need an identity.
# You also set identity to None if you use a credential-based datastore
identity = None

# Set the input for the job:
inputs = {
    "input_data": Input(type=data_type, path=path, mode=mode)
}

# This command job uses the head Linux command to print the first 10 lines of the file
job = command(
    command="head ${{inputs.input_data}}",
    inputs=inputs,
    environment="azureml://registries/azureml/environments/sklearn-1.1/versions/4",
    compute="cpu-cluster",
    identity=identity,
)

# Submit the command
ml_client.jobs.create_or_update(job)

As I can see in your code, the compute name should be under double quotes compute = "compute_name_training" and try to use the above code and also debug your python script, so that your issue can be resolved.

I hope you understand. Thank you!

Jorge Lopez 36 Reputation points

2023-11-02T13:32:40.65+00:00
Hi @santoshkc ,

I didn't include the .py script because it shouldn't be the source of the problem, but let's assume is something like this:

import argparse import mlflow import mlflow.pytorch import torch import pandas as pd def main(): parser = argparse.ArgumentParser() parser.add_argument("--data", help="input data") parser.add_argument("--output_folder", help="output model folder") args = parser.parse_args() mlflow.start_run() data_df = pd.read_csv(args.data) # main code for create and train the model # ... # register model mlflow.pytorch.save_model( pytorch_model=model, path=args.output_folder, ) mlflow.end_run() if __name__ == "__main__": main()

Regarding the compute name, you are right, it should include the quotes, but I create the compute_name_training variable with the proper string value out of the sample code.

While the code you present does indeed work, I'm afraid it doesn't help me with my specific problem which is to read a TabularDataset (SDK v1) to a SDK v2 experiment (I still didn't manage to solve it)

Anyways, thank you for your response, I appreciate your effort to help.

santoshkc 6,955 Microsoft Vendor

Hi @Jorge Lopez,

Thank you for your response.

To read V1 TabularDataset data entities in a V2. I hope you are following the same below code:

from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml import MLClient

ml_client = MLClient.from_config()

filedataset_asset = ml_client.data.get(name="<tabulardataset_name>", version="<version>")

my_job_inputs = {
    "input_data": Input(
            type=AssetTypes.MLTABLE,
            path=filedataset_asset,
            mode=InputOutputModes.DIRECT
    )
}

job = command(
    code="./src",  # Local path where the code is stored
    command="python train.py --inputs ${{inputs.input_data}}",
    inputs=my_job_inputs,
    environment="<environment_name>:<version>",
    compute="cpu-cluster",
)

# Submit the command
returned_job = ml_client.jobs.create_or_update(job)
# Get a URL for the status of the job
returned_job.services["Studio"].endpoint

If you are still facing the error, double-check the parameters and their values, types are correct. If possible, add --debug to the end of your command.

For more info: Access data in a job - Azure Machine Learning | Microsoft Learn

I hope this helps! Thanks.

santoshkc 6,955 Reputation points Microsoft Vendor

2023-11-04T10:31:09.95+00:00

Hi @Jorge Lopez,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others.

Thank you!

Jorge Lopez 36

I found the following workflow to load the TabularDataset using the mltable.load() function:

import mltable

data_asset = ml_client.data.get(name="dataset_name", version=1)
tbl = mltable.load(f'azureml:/{data_asset.id}')
df = tbl.to_pandas_dataframe()
df.head()

It works perfectly locally, but when I tried to run it in an experiment, it fails and give me the following error:

Traceback (most recent call last):
  File "/mnt/azureml/cr/j/98fb20f6ea2942a4aac87dd66e512b3c/exe/wd/conv_sdk_v2.py", line 79, in <module>
    main()
  File "/mnt/azureml/cr/j/98fb20f6ea2942a4aac87dd66e512b3c/exe/wd/conv_sdk_v2.py", line 38, in main
    tbl = mltable.load(f'azureml://subscriptions/***/resourceGroups/***/providers/Microsoft.MachineLearningServices/workspaces/***/data/dataset_name/versions/1')
  File "/azureml-envs/azureml_308ee13ef7636f225d73cb563e7a8f56/lib/python3.9/site-packages/azureml/dataprep/api/_loggerfactory.py", line 273, in wrapper
    return func(*args, **kwargs)
  File "/azureml-envs/azureml_308ee13ef7636f225d73cb563e7a8f56/lib/python3.9/site-packages/mltable/mltable.py", line 526, in load
    return _load(uri, storage_options, True)
  File "/azureml-envs/azureml_308ee13ef7636f225d73cb563e7a8f56/lib/python3.9/site-packages/azureml/dataprep/api/_loggerfactory.py", line 273, in wrapper
    return func(*args, **kwargs)
  File "/azureml-envs/azureml_308ee13ef7636f225d73cb563e7a8f56/lib/python3.9/site-packages/mltable/mltable.py", line 579, in _load
    _reclassify_rslex_error(ex)
  File "/azureml-envs/azureml_308ee13ef7636f225d73cb563e7a8f56/lib/python3.9/site-packages/azureml/dataprep/api/mltable/_validation_and_error_handler.py", line 88, in _reclassify_rslex_error
    raise err
  File "/azureml-envs/azureml_308ee13ef7636f225d73cb563e7a8f56/lib/python3.9/site-packages/mltable/mltable.py", line 563, in _load
    = _load_mltable_from_data_asset_uri(match, storage_options, enable_validate)
  File "/azureml-envs/azureml_308ee13ef7636f225d73cb563e7a8f56/lib/python3.9/site-packages/mltable/mltable.py", line 474, in _load_mltable_from_data_asset_uri
    mltable_string = data_asset.additional_properties['legacyDataflow']
KeyError: 'legacyDataflow'

santoshkc 6,955 Reputation points Microsoft Vendor

2023-11-10T06:57:41.24+00:00

Hi @Jorge Lopez,

Sorry for the trouble you are facing. I request you to raise a support case through Azure portal.

I hope you understand. Thank you!

Share via

Read a TabularDataset in AML SDK v2