How to run get scored data from an Azure ML pipeline to a csv file

Christopher Wilkinson 0 Reputation points
2024-03-14T11:53:52.1566667+00:00

Hi,

I am trying to run an Azure ML batch inference pipeline through ADF. The pipeline endpoint seems to run OK but when I check the datastamps on the files that the pipeline should have created, I notice that they have not been updated.

The experiment is a 2 class neural network algorithm and I need the scored probabilities, as well as the label prediction, output to a csv file on an ML blob storage account.

I created a batch inference pipeline that uses a data asset as the input. At the end, there is an export data component that puts the results in a csv file in the ML blob storage, which is then used in a PowerBI report. The data asset input is a csv file that get updated every 15 minutes.

I have published the batch inference pipeline to an endpoint. This is what I am executing in the ADF pipeline using the Machine Learning Execute Pipeline component. The pipeline runs in a couple of seconds and completes but it has not written to file named in the Export Data component of the experiment.

I realise that I may be doing this all wrong because I have had not prior training on Azure Machine Learning. What I am trying to do is get Azure ML to run a pipeline using a previously trained model with a csv file (although it could be parquet, the data actually comes from Azure Synapse SQL dedicated pool) as input that is updated every 15 minutes. Then I want the results to get output to another csv/parquet file that is used to power a Power Bi report.

If anyone could point to a tutorial that shows how to do this sort of thing. I know how to train a model and create a batch inference model, which works perfectly the first time I submit it, but thereafter it executes in a couple of seconds and I get no data output to my csv.

Any help would be appreciated.

Thanks in advance,

Chris

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,728 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. YutongTie-MSFT 48,586 Reputation points
    2024-03-15T04:30:27.5933333+00:00

    @Christopher Wilkinson Thanks for reaching out to us, there could be a few reasons as to why your pipeline is not writing the output to the CSV file. Here are a few steps you can take to diagnose and solve the issue:

    1. Check the Pipeline Output: After running the pipeline, check the output logs. The logs might contain information about any errors or issues that occurred while writing the data to the CSV file.
    2. Check the Permissions: Ensure that the Azure ML workspace has the necessary permissions to write data to the blob storage. You may need to set up a role assignment in the Azure portal to grant these permissions.
    3. Check the File Path: The path specified in the Export Data step should be correct. A common mistake is to give the path of the blob storage but not the specific file name. Make sure the path includes the file name and extension (e.g., /path/to/file.csv).
    4. Check the Data Format: The data you are trying to write should match the format of the CSV file. If the data format is not compatible with the CSV format, the write operation might fail.
    5. Use Python SDK: You can consider using Azure ML Python SDK to write the output data to a CSV file

    Please check on above and let me know how you think about it.

    Regards,

    Yutong

    0 comments No comments

  2. Christopher Wilkinson 0 Reputation points
    2024-03-17T22:36:20.22+00:00

    Hi,

    thanks for you help. The issue was that the output was not being regenerated each time but using cached data. The data asset name itself never changes but the underlying csv is updated every 15 minutes. I discovered that if I ticked the "Regenerate Output" option for certain components in the designer then that went perfectly.

    I did try to use some Python which worked fine when run in the ML workspace itself, but if I called an endpoint from Data Factory (in Synapse) I got authentication errors. The problem is that I need to import AzureCliAuthentication() into my Python script but it says that I need to install the Azure CLI. But I do not know how to do this. I know the command "pip install azure-cli" is what I need to enter in a terminal. However, I do not know where to run this command.

    Because I am running the Python script from ADF in Synapse, the script component needs to be run on a Compute Instance not a VM. I cannot find the terminal for the Compute Instance in order to install the Azure CLI - which I need to perform authentication. So please, how do I install python packages on a Compute Instance - to use in a script pipeline component?

    Also, I realise now, that I will need to know where (and how) to install the Azure CLI when running the script on a Kubernetes CLuster, which I wil need to do in the production environment.

    Thanks again for your help.