This article shows how to collect data from an Azure Machine Learning model deployed on an Azure Kubernetes Service (AKS) cluster. The collected data is then stored in Azure Blob storage.
Once collection is enabled, the data you collect helps you:
Make better decisions about when to retrain or optimize your model.
Retrain your model with the collected data.
Limitations
The model data collection feature can only work with Ubuntu 18.04 image.
Important
As of 03/10/2023, the Ubuntu 18.04 image is now deprecated. Support for Ubuntu 18.04 images will be dropped starting January 2023 when it reaches EOL on April 30, 2023.
The MDC feature is incompatible with any other image than Ubuntu 18.04, which is no available after the Ubuntu 18.04 image is deprecated.
In versions of the Azure Machine Learning SDK for Python earlier than version 0.1.0a16, the designation argument is named identifier. If you developed your code with an earlier version, you need to update it accordingly.
Prerequisites
If you don't have an Azure subscription, create a
free account before you begin.
An Azure Machine Learning workspace, a local directory containing your scripts, and the Azure Machine Learning SDK for Python must be installed. To learn how to install them, see How to configure a development environment.
You need a trained machine-learning model to be deployed to AKS. If you don't have a model, see the Train image classification model tutorial.
Use a docker image based on Ubuntu 18.04, which is shipped with libssl 1.0.0, the essential dependency of modeldatacollector. You can refer to prebuilt images.
Enable data collection
You can enable data collection regardless of the model you deploy through Azure Machine Learning or other tools.
To enable data collection, you need to:
Open the scoring file.
Add the following code at the top of the file:
from azureml.monitoring import ModelDataCollector
Declare your data collection variables in your init function:
CorrelationId is an optional parameter. You don't need to use it if your model doesn't require it. Use of CorrelationId does help you more easily map with other data, such as LoanNumber or CustomerId.
The Identifier parameter is later used for building the folder structure in your blob. You can use it to differentiate raw data from processed data.
Add the following lines of code to the run(input_df) function:
data = np.array(data)
result = model.predict(data)
inputs_dc.collect(data) #this call is saving our input data into Azure Blob
prediction_dc.collect(result) #this call is saving our prediction data into Azure Blob
Data collection is not automatically set to true when you deploy a service in AKS. Update your configuration file, as in the following example:
Add your storage account name and enter your storage key. You can find this information by selecting Settings > Access keys in your blob.
Select the model data container and select Edit.
In the query editor, click under the Name column and add your storage account.
Enter your model path into the filter. If you want to look only into files from a specific year or month, just expand the filter path. For example, to look only into March data, use this filter path:
Manage data ingestion and preparation, model training and deployment, and machine learning solution monitoring with Python, Azure Machine Learning and MLflow.