How Do I Create a ModelDirectory Type FileDataset

Lee Harper 31

I am trying to build a solution that automates part of the model deployment within the Azure ML designer. I am able to build a model with the designer, and then execute a python script block to extract the trained_model_outputs folder from the model training block. I have precisely matched the folder structure that Azure ML designer assigns to the model's FileDataset

When I register the trained_model_outputs as a FileDataset, it assigns it the type AnyDirectory. This is a problem, as when I try to build it into the inference pipeline, the designer rejects it, saying it must be a ModelDirectory, even though there shouldn't be any functional difference between the two.

I have seen that I can expose the ModelDirectory class as below, however I cannot find the API documentation online about this class anywhere, and I can't review it's source code as it isn't in the standard SDK:

from azureml.studio.core.io.model_directory import ModelDirectory

Can you provide a code snippet or similar that I can use to leverage this class when creating the FileDataset so that the model dataset gains the ModelDirectory type attribute?

1 answer

Ramr-msft 17,616 Reputation points

2021-05-18T13:56:26.373+00:00

@Lee Harper Thanks for the question. Can you please add more details about the use case.

OutFileDatasetConfig is a control plane concept to pass data between pipeline steps. PipelineData was intended to represent "transient" data from one step to the next one, while OutputDatasetConfig was intended for capturing the final state of a dataset. PipelineData always outputs data in a folder structure like {run_id}{output_name}. OutputDatasetConfig allows to decouple the data from the run and hence it allows you to control where to land the data (although by default it will produce similar folder structure). The OutputDatasetConfig allows even to register the output as a Dataset, where getting rid of such folder structure makes sense. From the docs itself: "Represent how to copy the output of a run and be promoted as a FileDataset. The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination".

Please follow the below link to use the upload API.
https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_factory.filedatasetfactory?view=azure-ml-py#upload-directory-src-dir--target--pattern-none--overwrite-false--show-progress-true-
Please sign in to rate this answer.
Lee Harper 31 Reputation points

2021-05-18T14:14:36.123+00:00

Hi @Ramr-msft , this doesn't answer my question - I am already outputting the model data from the designer pipeline into a know location as a FileDataset. I specifically want to know how I can define this dataset to have the ModelDirectory type, rather than the AnyDirectory type, when executing from a "Execute Python Code" block within the designer, so that I can use this user defined model dataset in a designer inference pipeline as the model input to the scoring block.

As things stand right now, the designer pipeline won't validate because the FileDataset has this incorrect sub-type. Since this directory is a direct clone of the trained_model_outputs standard ouptut folder from a "Tune Model Hyperparameters" block, there is no reason it won't work besides have the wrong metadata associated with it on registration.

The appropriate AML package to use here look like its this, but I cannot find online documentation for it:

azureml.studio.core.io.model_directory import ModelDirectory

Ramr-msft 17,616 Reputation points

2021-05-20T12:22:31.18+00:00

@Lee Harper Thanks for the details. We have forwarded to the product team to check.

Lee Harper 31 Reputation points

2021-05-21T14:03:06.467+00:00

@Ramr-msft that would be great - looking forward to connecting on that

Ramr-msft 17,616 Reputation points

2021-05-25T12:00:41.307+00:00

@Lee Harper Thanks for the details. It would be great if you can add more details about the usecase and also details about why you want to register the model by yourself and then use it in the inference pipeline?.

Register dataset assigned port with "Model Directory", this is internal method which is not planned to expose to customers in short term.

Lee Harper 31 Reputation points

2021-05-25T13:47:55.99+00:00

@Ramr-msft I will file a support ticket. To give some more context for the sake of this question that others might see - we are trying to bring the designer in line with enterprise change management principles, including (as much as is possible) code freezes, pipeline version control, automated CI/CD and enforced naming conventions for artifacts (models, transforms etc).

We have a training pipeline that is compliant, however the designer requires a high degree of manual manipulation to update the AKS in the deployment phase. If we can define fixed dataset objects which are updated in the training phase, then we can code freeze the inference pipeline, and updating the AKS webservice becomes a case of a couple of button pushes (since we can define the datasets to use the most recent dataset versions only), which could be automated through an RPA process.

Ramr-msft 17,616 Reputation points

2021-05-27T07:12:34.73+00:00

@Lee Harper Thanks for the update.

Dustin Reagan 1 Reputation point

2021-12-01T19:30:18.627+00:00

@Lee Harper Did you ever get a response from your support ticket? I have the exact same use-case in mind and have been scouring the documentation for how to create a DataSet with ModelDirectory type.

Did you ever come up with a work-around for this? I'm quite surprised that this isn't currently possible. In fact, I must be missing something, because I don't understand how users are expected to build automated training & inference pipelines using the designer without this sort of functionality.

Lee Harper 31 Reputation points

2021-12-02T01:16:08.323+00:00

@Dustin Reagan I managed to take this all the way to the Microsoft product development team, and the answer (paraphrased) was as follows. This ModelDirectory class is 100% abstracted from the users, and thus cannot be called from with a user defined block.

The group told me that the designer is not meant to be a production grade tool in it's current implemention - more of a prototyping tool - and it is not compatible with CI/CD processes like python scripts are. Automated retraining in the designer is not a supported scenario as of today either, since we can't create an API for the training pipeline. I can't speak to the roadmap though, but I did advocate that some method of enabling automation and MLOps in the designer would be highly desirable.

So I don't think you're missing anything. In the end we had to work with the team to convert the prototyped pipelines into python equivalents.

Dustin Reagan 1 Reputation point

2021-12-02T14:40:20.407+00:00

@Lee Harper

Ah, I see, thanks for the response.

Aside from CI/CD, it seems that the api & tooling around the designer (in conjunction with other Azure infrastructure) does (almost) allow an automated training pipeline (except, of course, for this one niggling issue we're discussing in this thread). For instance, I was able to prototype a set of Azure Functions that monitor a storage container for new training data and automatically run Inference & designer-built Training pipelines.

Lee Harper 31 Reputation points

2021-12-08T16:56:05.317+00:00

Totally agree - if this one thing were solved for then CI would be totally possible with the designer
Sign in to comment