Intermittent "Etag conflict" Error When Submitting Jobs via Azure ML Batch Endpoint

Vu Le 20 Reputation points
2024-10-21T07:38:35.9633333+00:00

Using Azure Data Factory to submit jobs (foreach loop) to Azure ML via a managed endpoint results in an intermittent error. This error appears to indicate a potential rate limiting issue, but confirmation is needed.

Here is the response from the Web activity:

{
  "error": {
    "code": "TransientError",
    "severity": null,
    "message": "Etag conflict on <masking> with etag \"6000764f-0000-1100-0000-6714e3750000\".",
    "messageFormat": null,
    "messageParameters": null,
    "referenceCode": null,
    "detailsUri": null,
    "target": null,
    "details": [],
    "innerError": null,
    "debugInfo": null,
    "additionalInfo": null
  },
  "correlation": {
    "operation": "526a07dd2b1f2e930741fec52db26d91",
    "request": "d1396966dbb97bce"
  },
  "environment": "uksouth",
  "location": "uksouth",
  "time": "2024-10-20T11:03:17.6615092+00:00",
  "componentName": "managementfrontend",
  "statusCode": 409
}
Azure Machine Learning
{count} votes

2 answers

Sort by: Most helpful
  1. romungi-MSFT 49,091 Reputation points Microsoft Employee Moderator
    2024-10-21T08:34:26.9+00:00

    @Vu Le It seems like the error code 409 is related to a conflict error i.e when an operation is already in progress, any new operation on that same online endpoint responds with a 409 conflict error. For example, if a create or update online endpoint operation is in progress, triggering a new delete operation throws an error. This error code is documented here for managed online endpoints of Azure ML.

    In your case, try to wait until the previous operation is complete and check if the error still occurs in the subsequent calls. Thanks!!

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments

  2. Girard Manon 0 Reputation points
    2025-03-14T11:36:20.3033333+00:00

    We have the same problem. To mitigate this type of error we have implemented a try, catch and retry mechanism for a certain number of attempts.

        # Pour se protéger des erreurs de type : TransientError lors de l'invocation du         max_try = int(os.getenv("MAX_TRY_INVOKE", "3"))
    
        for nb_try in range(max_try):
            try:
                batch_endpoint_pipeline_job = ml_client.batch_endpoints.invoke(
                    endpoint_name=endpoint.name,
                    experiment_name=experiment_name,
                    inputs={
                        "var": Input(type="string", default=f"{args.var}"),
                    },
                )
                ml_client.jobs.get(batch_endpoint_pipeline_job.name)
                ml_client.jobs.stream(name=batch_endpoint_pipeline_job.name)
                pipeline_job = ml_client.jobs.get(batch_endpoint_pipeline_job.name)
                mlflow_run_id = pipeline_job.name
    
                # ---------------------------------------------------------------
                # Instrumenter le state du pipeline en utilisant l'API de MLFlow
                # ---------------------------------------------------------------
                if pipeline_job.status == "Completed":
                    MLOpsStateHelper.log_state_info_pipeline(
                        pipeline_friendly_name_for_state, mlflow_run_id, args.mlops_context
                    )
                    sys.exit(0)
    
                elif pipeline_job.status == "Canceled" or pipeline_job.status == "CancelRequested":
                    message = "Le pipeline_job a été annulé"
                    mlops_context_step.statut = "err"
                    properties = {"custom_dimensions": mlops_context_step.to_dict()}  # pylint: disable=E1101
                    logger.info(message, extra=properties)
                    sys.exit(1)
    
            except ResourceExistsError as exc:
                message = f"Tentative no {nb_try} pour faire invoke endpoint.  Causé par {exc}"
                time.sleep(2)
    
            except Exception as exc:
                sys.exit(1)
    
        message = f"Impossible d'invoquer le endpoint après {max_try} essais."
        mlops_context_step.statut = "err"
        properties = {"custom_dimensions": mlops_context_step.to_dict()}  # pylint: disable=E1101
        logger.info(message, extra=properties)
        sys.exit(1)
    
    
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.