Azure Machine Learing - Batch Scoring with ParallelRunConfig output_action='summary_only'

Daniel Tudorache 1

Hello,

I have deployed a batch inferencing service and I want to save minibatch results in a json format. My understanding after reading the ParallelRunConfig documentation is that for output_action="append_row" you can return only list or pandas dataframe objects in the run() function.
I have tried to change output_action='summary_only' but nothing is saved into the datastore anymore.
I could not find any examples on how to use output_action='summary_only' except the below explanation, which does not give details on how to store the output:

'append_row' – All values output by run() method invocations will be aggregated into one unique file named parallel_run_step.txt that is created in the output location.
'summary_only' – User script is expected to store the output by itself. An output row is still expected for each successful input item processed. The system uses this output only for error threshold calculation (ignoring the actual value of the row).

Do you know how can I save the results of each minibatch of the run() function as a json into the datastore?

Thank you,
Daniel

YutongTie-MSFT 46,991 Reputation points

2022-02-04T00:50:58.18+00:00

Hello @Daniel Tudorache

Thanks for reaching out to us. The old package azureml-contrib-pipeline-steps, has been deprecated and moved to azureml-pipeline-steps. Could you please make sure you are using the latest package?

Then, there are sample code for how to set up the project here
https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/machine-learning-pipelines/parallel-run

I see one thing mentioned in the repo:
run(mini_batch) - The method to be parallelized. Each invocation will have one minibatch.
mini_batch: Batch inference will invoke run method and pass either a list or Pandas DataFrame as an argument to the method. Each entry in min_batch will be - a filepath if input is a FileDataset, a Pandas DataFrame if input is a TabularDataset.
run method response: run() method should return a Pandas DataFrame or an array. For append_row output_action, these returned elements are appended into the common output file. For summary_only, the contents of the elements are ignored. For all output actions, each returned output element indicates one successful inference of input element in the input mini-batch. User should make sure that enough data is included in inference result to map input to inference. Inference output will be written in output file and not guaranteed to be in order, user should use some key in the output to map it to input.

Could you please let us know if "append_row" works on your side?

Regards,
Yutong
Daniel Tudorache 1 Reputation point

2022-02-04T08:18:29.64+00:00

Hi @YutongTie-MSFT ,

Yes, I am using the new package: azureml-pipeline-steps and 'append_row' works fine. I have troubles with 'summary_only' option because I don't know how to save the output file into the datastore.

Thank you,
Daniel