Azure ML - Reinforcement Learning: `Eperiment.submit` of Python SDK (azureml.contrib.train.rl) returns 530 with "unkown to cluster" message in response.

Rikki Goudarzi 1 Reputation point
2022-12-08T07:33:53.53+00:00

Context of where I am:

  • I'm using "azureml.contrib.train.rl" in Python SDK (current version).
  • I'm also using the PONG notebook as starter.
  • The ML workspace and two computes have been created using ARM/DevOps.

The issue:

  • It seems that when the training code calls Experiment.submit - internally, it ends up with reinforcement_learning_operations.py#94 - where a POST call is made to https://australiaeast.experiments.azureml.net/reinforcementlearning/v1.0/subscriptions/xxxxxxxx/resourceGroups/rg-xxx-xxx-ae-demo/providers/Microsoft.MachineLearningServices/workspaces/mlw-xxx-xxx-ae-demo/experiments/rllib-pong-multi-node-2/startrun/rllib-pong-multi-node-2_1670483726_01fc29d0
  • The response comes back as 530, and the response content is 'unkown to cluster'

Could you please review this issue and guide me through diagnosing it?

Thank you.

  • Rikki
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,688 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 43,656 Reputation points Microsoft Employee
    2022-12-08T09:43:12.267+00:00

    @Rikki Goudarzi Are you using ray_on_aml() which is now supported for reinforcement learning experiments with Azure Machine Learning?
    I think the job that is created should generate more logs to check what could be the underlying issue with the compute cluster. Are you following this document from the Azure ML RL documentation?

    If an answer is helpful, please click on 130616-image.png or upvote 130671-image.png which might help other community members reading this thread.

    1 person found this answer helpful.