I have to retrain every month or so a PyTorch Model trained on data obtained from processing tables sitting in Azure Data Lake Storage gen 1.
So far, I have the following building blocks:
- A Databricks notebook that does the ETL job of transforming the ADLS gen 1 tables into train/validation files that are written in blob storage
- Python scripts that I can execute locally to run in an AzureML workspace an experiment so to train the PyTorch model using a ScriptRunConfig + training script as in https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-pytorch mounting blob to get the training data.
How can I schedule steps 1. and 2. to be run in sequence in a pipeline? Azure Data Factory seems a possible way to go, but what should I use as activities in ADF?
I see a few alternatives:
- Stays surely a Databricks notebook
2a. Databricks python script calling the azureml-sdk classes (?)
Alternative for step 2a could be
2b. a Batch Service custom activity calling the azureml-sdk classes - seems overkill to me
2c. use a AzureML execute pipeline as ADF activity https://learn.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines (not sure how...)
2d. use a Python script Databricks activity train a PyTorch model with Databricks https://learn.microsoft.com/en-us/azure/databricks/applications/mlflow/tracking-ex-pytorch instead of calling the azureml-sdk classes
Can someone point me to the current best practice for this?