Share via

Azure ML Job Submission Failure: "Unknown compute target" despite existing and Succeeded computes (Workspace: Azureml-SDK-WS01)

JustinChow-4155 0 Reputation points
2025-07-14T01:05:27.1266667+00:00

Problem Description:

I am unable to submit any Azure Machine Learning jobs to my workspace. All submission attempts consistently fail with a ValidationException or MlException indicating: "Operation returned an invalid status 'Not found compute with name azureml:<compute-name>'". This issue is specifically blocking my main pipeline job (220_pipeline_job.py) which also encounters this exact same 'Unknown compute target' error during submission.

This occurs despite ml_client.compute.list() showing the target compute resources as existing and in a 'Succeeded' state. No jobs are created in the Azure ML Jobs section.

Environment Details:

  • Azure Resource Group Name: azuremlRG01
  • Azure ML Workspace Name: Azureml-SDK-WS01
  • Region: centralus
  • Compute Instance Name (where JupyterLab is running): pipeline-cluster
  • Compute Cluster Names (existing): AML-SDK-D001, my-cluster-001
  • Azure ML SDK Version: azure-ai-ml==1.28.1
  • Python Version: 3.8 (within myenv_new_sdk Conda environment)
  • Authentication Method: DefaultAzureCredential (using config.json successfully)

Troubleshooting Steps Already Performed:

I have performed extensive troubleshooting to isolate this issue, ruling out common client-side problems:

Initial Pipeline Submission Errors (SDK API Compatibility):

  • Initially encountered TypeError: 'CommandJob' object is not callable (due to environment typo and then incorrect use of @command decorator).
  • Resolved by transitioning component definitions from @command decorator to YAML files.
  • Encountered and resolved various AttributeError and NameError issues related to load, load_component, and CommandComponent.from_yaml (due to SDK version 1.9.0's specific API surface).
  • Current component loading method: load_component(source="./component.yml"), which is now successful.

Persistent "Unknown Compute Target" Error:

  • After resolving component loading, all job submissions (both pipeline and single command jobs) began failing with "Operation returned an invalid status 'Not found compute with name azureml:<compute-name>'".
  • Verified Compute Target Existence (SDK): Executed ml_client.compute.list() which successfully lists all relevant compute targets:

--- Available Compute Targets in this workspace ---

  • Name: AML-SDK-D001, Type: amlcompute, State: Succeeded
  • Name: my-cluster-001, Type: amlcompute, State: Succeeded
  • Name: pipeline-cluster, Type: computeinstance, State: Succeeded
  • Verified Compute Target Existence (UI): Confirmed in Azure ML Studio UI that all listed compute clusters and instances exist and are in a 'Succeeded' state. Noted 'Unprovisioned nodes' status for clusters, which is understood as normal idle scaling.
  • Attempted Multiple Compute Targets: Tried submitting jobs to AML-SDK-D001, my-cluster-001, and pipeline-cluster. All failed with the exact same "Unknown compute target" error.
  • Verified Code Source Upload: The src directory (containing dummy_script.py) is successfully uploaded during the submission attempt, indicating the client-side packaging and initial communication are working.

Environment and SDK Integrity:

  • Performed a complete Conda environment rebuild from scratch (conda env remove, conda env create) to eliminate any potential corruption or dependency conflicts.
  • Successfully upgraded azure-ai-ml to 1.28.1 in the new environment.

Azure RBAC Permissions (Crucial Confirmation):

  • Confirmed in Azure Portal IAM that the submitting identity (my user account) has the "Owner" role assigned at the Subscription level. This grants full control over all resources, definitively ruling out insufficient permissions as the cause.

Conclusion:

Given that all client-side configurations, SDK versions, environment setups, and RBAC permissions have been thoroughly verified and appear correct (confirming the submitting identity has 'Owner' role at the Subscription level ), and the issue persists across multiple compute targets (including newly created ones if applicable), it strongly suggests an underlying issue with the Azure Machine Learning backend service in this workspace or subscription. This issue appears to prevent the service from correctly recognizing and linking to its own registered compute targets during job submission validation, leading to the 'Unknown compute target' error

Request:

What could be causing this persistent 'Unknown compute target' error during Azure Machine Learning job submission, despite all client-side troubleshooting steps being exhausted and confirmed compute targets appearing as 'Succeeded'? We are seeking guidance from the community and prioritized assistance from a support engineer to diagnose why the Azure ML backend service is not recognizing its own valid compute targets during job validation.

 

Relevant Code Snippets (from check_single_job.py that produced the error):

import os

from azure.ai.ml import MLClient, command, load_component

from azure.identity import DefaultAzureCredential

from azure.ai.ml.entities import Environment

from pathlib import Path

# Connect to workspace

try:

    credential = DefaultAzureCredential()

    ml_client = MLClient.from_config(credential=credential)

    print(f"Connected to workspace '{ml_client.workspace_name}' in resource group '{ml_client.resource_group_name}' via config.json.")

except Exception as e:

    print(f"Could not connect to workspace: {e}")

    print("Please ensure you have configured your workspace connection properly (e.g., config.json).")

    exit(1)

# Define the absolute path to your 'src' directory (assuming it's a sibling to this script)

project_root = Path(__file__).parent

source_code_path = project_root / "src"

# Define a simple environment

test_env = Environment(

    name="basic-test-env",

    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",

    conda_file={

        "name": "basic_env",

        "channels": ["conda-forge"],

        "dependencies": ["python=3.8", "pip"]

    }

)

# Define a very simple command job

test_job = command(

    code=str(source_code_path),

    command="python dummy_script.py",

    environment=test_env,

    compute="azureml:AML-SDK-D001", # This was changed to my-cluster-001, pipeline-cluster, and a new test cluster, all failed.

    display_name="TestComputeJob",

    experiment_name="ComputeTest"

)

print(f"Submitting test job to {test_job.compute}...")

try:

    returned_job = ml_client.jobs.create_or_update(test_job)

    print(f"Test job submitted: {returned_job.name}")

    print(f"View in AzureML Studio: {returned_job.studio_url}")

    print("Waiting for test job to complete...")

    ml_client.jobs.stream(returned_job.name)

    print(f"Test job completed with status: {returned_job.status}")

except Exception as e:

    print(f"Failed to submit test job: {e}")

    if hasattr(e, 'errors'):

        print(f"Validation errors: {e.errors}")

Full Error Traceback (example, the exact one from the sample run):

Failed to submit test job: (UserError) Unknown compute target 'azureml:pipeline-cluster'.

Code: UserError

Message: Unknown compute target 'azureml:pipeline-cluster'.

Additional Information:Type: ComponentName

Info: {

    "value": "managementfrontend"

}Type: Correlation

Info: {

    "value": {

        "operation": "bd6e319f2b978e30f75e45dec7f51639", # This correlation ID will change per attempt

        "request": "97dd836de832c3b0" # This request ID will change per attempt

    }

}Type: Environment

Info: {

    "value": "centralus"

}Type: Location

Info: {

    "value": "centralus"

}Type: Time

Info: {

    "value": "2025-07-13T00:44:29.3925051+00:00" # This timestamp will change per attempt

}Type: InnerError

Info: {

    "value": {

        "code": "BadArgument",

        "innerError": {

            "code": "UnknownTargetType",

            "innerError": null

      

Here is the main pipeline run python file that failed to run, which lead to this check_single_job.py to check for a simple command job run using the compute targets. Also includes the two YAML files that are used for component loading.

train_model_component.yml (main pipeline job - component YAML definition #1):

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_model_component
display_name: 02 Train the Model Component
version: 2 # You can increment this version if you make changes later
type: command
description: Trains the machine learning model
inputs:
  processed_data:
    type: uri_folder
outputs:
  sourced_data:
    type: uri_folder
    mode: rw_mount
code: .
command: "python '220_Training_Pipeline.py' --datafolder ${{inputs.processed_data}}"
environment: azureml://registries/azureml/environments/sklearn-1.5/versions/28

data_preparation_component.yml (main pipeline job - component YAML definition #2):

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: data_preparation_component
display_name: 01 Data Preparation Component
version: 2 # You can increment this version if you make changes later
type: command
description: Prepares raw data for model training
inputs:
  raw_data:
    type: uri_file
outputs:
  data_output:
    type: uri_folder
    mode: rw_mount
code: .
command: "python '220_Dataprep_Pipeline.py' --datafolder ${{outputs.data_output}} --raw_data ${{inputs.raw_data}}"
environment: azureml://registries/azureml/environments/sklearn-1.5/versions/28

220_pipeline_job.py (main pipeline job code that failed job submission and resulted in test using check_single_job.py):

import os

from azure.ai.ml import MLClient, Input, Output, load_component

from azure.ai.ml.dsl import pipeline

from azure.ai.ml.constants import AssetTypes

from azure.ai.ml.entities import Data, Environment # Removed CommandComponent here

from azure.identity import DefaultAzureCredential

# Ensure ml_client is defined and connected

try:

    credential = DefaultAzureCredential()

    ml_client = MLClient.from_config(credential=credential)

    print(f"Connected to workspace '{ml_client.workspace_name}' in resource group '{ml_client.resource_group_name}' via config.json.")

except Exception as e:

    print(f"Could not connect to workspace: {e}")

    print("Please ensure you have configured your workspace connection properly (e.g., config.json).")

    exit(1)

# Load components using the top-level 'load_component' function

# Ensure data_preparation_component.yml and train_model_component.yml are in the same directory

data_prep_component = load_component(source="./data_preparation_component.yml")

train_component = load_component(source="./train_model_component.yml")

print("Components loaded from YAML successfully.")

# Define the input data asset for the entire pipeline

pipeline_input_data = Input(

    type=AssetTypes.URI_FILE,

    path="azureml://datastores/workspaceblobstore/paths/LocalUpload/2f2e86f7f9db38ca9c5d9b8573724bc7/defaults.csv"

)

# Define the pipeline

@pipeline(

    description="My first SDK v2 ML pipeline",

    display_name="PipelineExp01-v2",

)

def my_ml_pipeline(raw_data_input: Input):

    # Now, call the component functions as steps in the pipeline

    data_prep_job_instance = data_prep_component(

        raw_data=raw_data_input

    )

    # --- FIX: Set compute for data_prep_job_instance ---

    data_prep_job_instance.compute = "azureml:AML-SDK-D001" # Use the name of your compute cluster

    train_job_instance = train_component(

        processed_data=data_prep_job_instance.outputs.data_output

    )

    # --- FIX: Set compute for train_job_instance ---

    train_job_instance.compute = "azureml:AML-SDK-D001" # Use the name of your compute cluster

    return {

        "trained_model_output": train_job_instance.outputs.sourced_data

    }

# Create the pipeline instance

pipeline_job = my_ml_pipeline(raw_data_input=pipeline_input_data)

# Set the experiment name for the pipeline run

pipeline_job.experiment_name = 'PipelineExp01-v2'

# Submit the pipeline job

print(f"Submitting pipeline job: {pipeline_job.display_name}")

returned_pipeline_job = ml_client.jobs.create_or_update(pipeline_job)

# Wait for completion

print("Waiting for pipeline job to complete...")

ml_client.jobs.stream(returned_pipeline_job.name)

print(f"Pipeline job completed. View in AzureML Studio: {returned_pipeline_job.studio_url}")

Azure Machine Learning

1 answer

Sort by: Most helpful
  1. Manas Mohanty 17,180 Reputation points Microsoft External Staff Moderator
    2025-07-16T02:42:35.26+00:00

    Hi JustinChow-4155

    Wish to connect on teams call once before I try to reach product group.

    You have used azure-ai-ml==1.28.1 which seems to be latest version.

    1. Could you roll back to 1.26.4 or 1.27.1 and let us know
    2. Please enable system assigned identity on compute cluster and add them storage blob reader and contributor over Storage account.
    3. Please add yourself as "Azure ML data scientist roles" as we have to add granular roles in spite of owner role on resources.

    Please provide the details in private message to proceed further.

    Relevant thread - https://learn.microsoft.com/en-us/answers/questions/2236796/why-compute-cluster-is-not-found-in-workspace

    Thank you.

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.