How to handle growing/changing datasets?

Gilles Ballegeer 1 Reputation point
2021-10-20T08:20:31.02+00:00

We are in the situation whereby we have datasets that are updated frequently so we have to retrain regularly on that new data. However it seems there is no way to expand a dataset and use the dataset versioning. This is what I'm currently testing, but there are some problems with it:

Create dataset from datastore and add new images to the datastore. This expands the dataset as we want and also updates the labeling job such that the new data can be labeled. This is handy since we don't have different labeling jobs for the same project. However if we want to export that dataset to use for training (Export > Export as Azure ML Dataset) it creates a new dataset, is it possible to export into a new version of a dataset? That way we can reuse the training code and the correct version is automatically stored.

Kind regards

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,332 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Ramr-msft 17,826 Reputation points
    2021-10-20T13:51:00.55+00:00

    @Gilles Ballegeer Thanks for the question. Here is the link to Data drift as described here:

    https://learn.microsoft.com/en-us/azure/machine-learning/how-to-monitor-datasets?tabs=python

    An Azure ML dataset does not version the underlying data (snapshot), but rather it points to the underlying source.
    142017-screenshot-229.png
    In this context, up versioning would be that you change the schema (add a column, etc) rather than underlying data.

    Version and track Azure Machine Learning datasets: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-version-track-datasets

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.