Problem Description:
I am unable to submit any Azure Machine Learning jobs to my workspace. All submission attempts consistently fail with a ValidationException or MlException indicating: "Operation returned an invalid status 'Not found compute with name azureml:<compute-name>'". This issue is specifically blocking my main pipeline job (220_pipeline_job.py) which also encounters this exact same 'Unknown compute target' error during submission.
This occurs despite ml_client.compute.list() showing the target compute resources as existing and in a 'Succeeded' state. No jobs are created in the Azure ML Jobs section.
Environment Details:
- Azure Resource Group Name: azuremlRG01
- Azure ML Workspace Name: Azureml-SDK-WS01
- Region: centralus
- Compute Instance Name (where JupyterLab is running): pipeline-cluster
- Compute Cluster Names (existing): AML-SDK-D001, my-cluster-001
- Azure ML SDK Version: azure-ai-ml==1.28.1
- Python Version: 3.8 (within myenv_new_sdk Conda environment)
- Authentication Method: DefaultAzureCredential (using config.json successfully)
Troubleshooting Steps Already Performed:
I have performed extensive troubleshooting to isolate this issue, ruling out common client-side problems:
Initial Pipeline Submission Errors (SDK API Compatibility):
- Initially encountered TypeError: 'CommandJob' object is not callable (due to environment typo and then incorrect use of @command decorator).
- Resolved by transitioning component definitions from @command decorator to YAML files.
- Encountered and resolved various AttributeError and NameError issues related to load, load_component, and CommandComponent.from_yaml (due to SDK version 1.9.0's specific API surface).
- Current component loading method: load_component(source="./component.yml"), which is now successful.
Persistent "Unknown Compute Target" Error:
- After resolving component loading, all job submissions (both pipeline and single command jobs) began failing with "Operation returned an invalid status 'Not found compute with name azureml:<compute-name>'".
- Verified Compute Target Existence (SDK): Executed ml_client.compute.list() which successfully lists all relevant compute targets:
--- Available Compute Targets in this workspace ---
- Name: AML-SDK-D001, Type: amlcompute, State: Succeeded
- Name: my-cluster-001, Type: amlcompute, State: Succeeded
- Name: pipeline-cluster, Type: computeinstance, State: Succeeded
- Verified Compute Target Existence (UI): Confirmed in Azure ML Studio UI that all listed compute clusters and instances exist and are in a 'Succeeded' state. Noted 'Unprovisioned nodes' status for clusters, which is understood as normal idle scaling.
- Attempted Multiple Compute Targets: Tried submitting jobs to AML-SDK-D001, my-cluster-001, and pipeline-cluster. All failed with the exact same "Unknown compute target" error.
- Verified Code Source Upload: The src directory (containing dummy_script.py) is successfully uploaded during the submission attempt, indicating the client-side packaging and initial communication are working.
Environment and SDK Integrity:
- Performed a complete Conda environment rebuild from scratch (conda env remove, conda env create) to eliminate any potential corruption or dependency conflicts.
- Successfully upgraded azure-ai-ml to 1.28.1 in the new environment.
Azure RBAC Permissions (Crucial Confirmation):
- Confirmed in Azure Portal IAM that the submitting identity (my user account) has the "Owner" role assigned at the Subscription level. This grants full control over all resources, definitively ruling out insufficient permissions as the cause.
Conclusion:
Given that all client-side configurations, SDK versions, environment setups, and RBAC permissions have been thoroughly verified and appear correct (confirming the submitting identity has 'Owner' role at the Subscription level ), and the issue persists across multiple compute targets (including newly created ones if applicable), it strongly suggests an underlying issue with the Azure Machine Learning backend service in this workspace or subscription. This issue appears to prevent the service from correctly recognizing and linking to its own registered compute targets during job submission validation, leading to the 'Unknown compute target' error
Request:
What could be causing this persistent 'Unknown compute target' error during Azure Machine Learning job submission, despite all client-side troubleshooting steps being exhausted and confirmed compute targets appearing as 'Succeeded'? We are seeking guidance from the community and prioritized assistance from a support engineer to diagnose why the Azure ML backend service is not recognizing its own valid compute targets during job validation.
Relevant Code Snippets (from check_single_job.py that produced the error):
import os
from azure.ai.ml import MLClient, command, load_component
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Environment
from pathlib import Path
# Connect to workspace
try:
credential = DefaultAzureCredential()
ml_client = MLClient.from_config(credential=credential)
print(f"Connected to workspace '{ml_client.workspace_name}' in resource group '{ml_client.resource_group_name}' via config.json.")
except Exception as e:
print(f"Could not connect to workspace: {e}")
print("Please ensure you have configured your workspace connection properly (e.g., config.json).")
exit(1)
# Define the absolute path to your 'src' directory (assuming it's a sibling to this script)
project_root = Path(__file__).parent
source_code_path = project_root / "src"
# Define a simple environment
test_env = Environment(
name="basic-test-env",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
conda_file={
"name": "basic_env",
"channels": ["conda-forge"],
"dependencies": ["python=3.8", "pip"]
}
)
# Define a very simple command job
test_job = command(
code=str(source_code_path),
command="python dummy_script.py",
environment=test_env,
compute="azureml:AML-SDK-D001", # This was changed to my-cluster-001, pipeline-cluster, and a new test cluster, all failed.
display_name="TestComputeJob",
experiment_name="ComputeTest"
)
print(f"Submitting test job to {test_job.compute}...")
try:
returned_job = ml_client.jobs.create_or_update(test_job)
print(f"Test job submitted: {returned_job.name}")
print(f"View in AzureML Studio: {returned_job.studio_url}")
print("Waiting for test job to complete...")
ml_client.jobs.stream(returned_job.name)
print(f"Test job completed with status: {returned_job.status}")
except Exception as e:
print(f"Failed to submit test job: {e}")
if hasattr(e, 'errors'):
print(f"Validation errors: {e.errors}")
Full Error Traceback (example, the exact one from the sample run):
Failed to submit test job: (UserError) Unknown compute target 'azureml:pipeline-cluster'.
Code: UserError
Message: Unknown compute target 'azureml:pipeline-cluster'.
Additional Information:Type: ComponentName
Info: {
"value": "managementfrontend"
}Type: Correlation
Info: {
"value": {
"operation": "bd6e319f2b978e30f75e45dec7f51639", # This correlation ID will change per attempt
"request": "97dd836de832c3b0" # This request ID will change per attempt
}
}Type: Environment
Info: {
"value": "centralus"
}Type: Location
Info: {
"value": "centralus"
}Type: Time
Info: {
"value": "2025-07-13T00:44:29.3925051+00:00" # This timestamp will change per attempt
}Type: InnerError
Info: {
"value": {
"code": "BadArgument",
"innerError": {
"code": "UnknownTargetType",
"innerError": null
Here is the main pipeline run python file that failed to run, which lead to this check_single_job.py to check for a simple command job run using the compute targets. Also includes the two YAML files that are used for component loading.
train_model_component.yml (main pipeline job - component YAML definition #1):
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_model_component
display_name: 02 Train the Model Component
version: 2 # You can increment this version if you make changes later
type: command
description: Trains the machine learning model
inputs:
processed_data:
type: uri_folder
outputs:
sourced_data:
type: uri_folder
mode: rw_mount
code: .
command: "python '220_Training_Pipeline.py' --datafolder ${{inputs.processed_data}}"
environment: azureml://registries/azureml/environments/sklearn-1.5/versions/28
data_preparation_component.yml (main pipeline job - component YAML definition #2):
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: data_preparation_component
display_name: 01 Data Preparation Component
version: 2 # You can increment this version if you make changes later
type: command
description: Prepares raw data for model training
inputs:
raw_data:
type: uri_file
outputs:
data_output:
type: uri_folder
mode: rw_mount
code: .
command: "python '220_Dataprep_Pipeline.py' --datafolder ${{outputs.data_output}} --raw_data ${{inputs.raw_data}}"
environment: azureml://registries/azureml/environments/sklearn-1.5/versions/28
220_pipeline_job.py (main pipeline job code that failed job submission and resulted in test using check_single_job.py):
import os
from azure.ai.ml import MLClient, Input, Output, load_component
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.entities import Data, Environment # Removed CommandComponent here
from azure.identity import DefaultAzureCredential
# Ensure ml_client is defined and connected
try:
credential = DefaultAzureCredential()
ml_client = MLClient.from_config(credential=credential)
print(f"Connected to workspace '{ml_client.workspace_name}' in resource group '{ml_client.resource_group_name}' via config.json.")
except Exception as e:
print(f"Could not connect to workspace: {e}")
print("Please ensure you have configured your workspace connection properly (e.g., config.json).")
exit(1)
# Load components using the top-level 'load_component' function
# Ensure data_preparation_component.yml and train_model_component.yml are in the same directory
data_prep_component = load_component(source="./data_preparation_component.yml")
train_component = load_component(source="./train_model_component.yml")
print("Components loaded from YAML successfully.")
# Define the input data asset for the entire pipeline
pipeline_input_data = Input(
type=AssetTypes.URI_FILE,
path="azureml://datastores/workspaceblobstore/paths/LocalUpload/2f2e86f7f9db38ca9c5d9b8573724bc7/defaults.csv"
)
# Define the pipeline
@pipeline(
description="My first SDK v2 ML pipeline",
display_name="PipelineExp01-v2",
)
def my_ml_pipeline(raw_data_input: Input):
# Now, call the component functions as steps in the pipeline
data_prep_job_instance = data_prep_component(
raw_data=raw_data_input
)
# --- FIX: Set compute for data_prep_job_instance ---
data_prep_job_instance.compute = "azureml:AML-SDK-D001" # Use the name of your compute cluster
train_job_instance = train_component(
processed_data=data_prep_job_instance.outputs.data_output
)
# --- FIX: Set compute for train_job_instance ---
train_job_instance.compute = "azureml:AML-SDK-D001" # Use the name of your compute cluster
return {
"trained_model_output": train_job_instance.outputs.sourced_data
}
# Create the pipeline instance
pipeline_job = my_ml_pipeline(raw_data_input=pipeline_input_data)
# Set the experiment name for the pipeline run
pipeline_job.experiment_name = 'PipelineExp01-v2'
# Submit the pipeline job
print(f"Submitting pipeline job: {pipeline_job.display_name}")
returned_pipeline_job = ml_client.jobs.create_or_update(pipeline_job)
# Wait for completion
print("Waiting for pipeline job to complete...")
ml_client.jobs.stream(returned_pipeline_job.name)
print(f"Pipeline job completed. View in AzureML Studio: {returned_pipeline_job.studio_url}")