In multi step pipeline execution, how to maintain the data type of the columns when pass the dataset to next step

Vinoth Kumar K 56 Reputation points
2022-02-25T16:08:24.397+00:00

i am building a pipeline with multiple steps.

  1. Step 1 - Read the data from tabular dataset(with proper data types) , apply transformation and create an output dataset which will be passed as input to the step 2. However when i opened this dataset from the pipeline run log, the datatype all become string instead of maintaining the original data types of the input tabular data set
  2. Step 2 - use the output dataset of step 1 as input and apply some more transformations. However i have some logic based on data types which doesn't work because intermediate data set does not maintain the same data structure

is there anyway we can maintain the original data types/schema structure in the intermediate datasets?

Here is some snippets on my code :

feature_work = (
OutputFileDatasetConfig(
name="data_enhanced_add_global_variables",
destination=(def_blob_store, "data/processed/output/1"),
)
.read_delimited_files()
.as_upload(overwrite=True)

feature_engineering_step_1 = PythonScriptStep(name = "1_feature_engineering",
#source_directory = experiment_folder,
script_name = "1_feature_engineering.py",
arguments = ['--input-data', data_aggregate_DS.as_named_input('raw_data'),
'--prepped-data', feature_work],
#outputs=[prepped_data_folder],
outputs=[feature_work],
compute_target = compute_name,
runconfig = pipeline_run_config,
allow_reuse = True)

Step 2

feature_engineering_step_2 = PythonScriptStep(name = "2_feature_engineering",
#source_directory = experiment_folder,
script_name = "2_feature_engineering.py",
arguments = ['--input-data', feature_work.as_input(name='raw_data'),
'--prepped-data', feature_work1],
outputs=[feature_work1],
compute_target = compute_name,
runconfig = pipeline_run_config,
allow_reuse = True)

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,729 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Ramr-msft 17,731 Reputation points
    2022-02-28T12:43:31.087+00:00

    @Vinoth Kumar K Thanks for the question. Can you please share the sample notebook that you are trying.
    Here is the notebook and doc that can help.
    OutputFileDatasetConfig as Tutorial: ML pipelines for batch scoring - Azure Machine Learning | Microsoft Learn as a means to pass data between pipeline steps.