In multi step pipeline execution, how to maintain the data type of the columns when pass the dataset to next step

Question

i am building a pipeline with multiple steps.

Step 1 - Read the data from tabular dataset(with proper data types) , apply transformation and create an output dataset which will be passed as input to the step 2. However when i opened this dataset from the pipeline run log, the datatype all become string instead of maintaining the original data types of the input tabular data set
Step 2 - use the output dataset of step 1 as input and apply some more transformations. However i have some logic based on data types which doesn't work because intermediate data set does not maintain the same data structure

is there anyway we can maintain the original data types/schema structure in the intermediate datasets?

Here is some snippets on my code :

feature_work = (
OutputFileDatasetConfig(
name="data_enhanced_add_global_variables",
destination=(def_blob_store, "data/processed/output/1"),
)
.read_delimited_files()
.as_upload(overwrite=True)

feature_engineering_step_1 = PythonScriptStep(name = "1_feature_engineering",
#source_directory = experiment_folder,
script_name = "1_feature_engineering.py",
arguments = ['--input-data', data_aggregate_DS.as_named_input('raw_data'),
'--prepped-data', feature_work],
#outputs=[prepped_data_folder],
outputs=[feature_work],
compute_target = compute_name,
runconfig = pipeline_run_config,
allow_reuse = True)

Step 2

feature_engineering_step_2 = PythonScriptStep(name = "2_feature_engineering",
#source_directory = experiment_folder,
script_name = "2_feature_engineering.py",
arguments = ['--input-data', feature_work.as_input(name='raw_data'),
'--prepped-data', feature_work1],
outputs=[feature_work1],
compute_target = compute_name,
runconfig = pipeline_run_config,
allow_reuse = True)

Answer

@Vinoth Kumar K Thanks for the question. Can you please share the sample notebook that you are trying.
Here is the notebook and doc that can help.
OutputFileDatasetConfig as Tutorial: ML pipelines for batch scoring - Azure Machine Learning | Microsoft Learn as a means to pass data between pipeline steps.

Share via

In multi step pipeline execution, how to maintain the data type of the columns when pass the dataset to next step

Step 2

1 answer