Azure AI Foundry Project Prompt Flow deployments fail for any new project or Hub
For the past week Prompt Flow deployments under AI Foundry Projects were failing for any new AI Foundry HUB or Project. Existing old projects and online endpoints work fine, but in any newly created HUB or project when deploying Prompt Flow, it never executes and deploys it successfully. Deployment might run over 30-40 minutes fail and retry and fail after another 30 minutes to 1hour.
Errors we get are the following:
This one comes after second failure in the ARM logs
2025-06-09T20:35:53Z Check envoy cert setting failed in MirSystemSetupTask. Please check the existence/validation of envoy cert.
This one comes after first failure
Conflict Status: 409 (Conflict) ErrorCode: Conflict Content: { "error": { "code": "Conflict", "message": "Conflict", "details": [ { "code": "InferencingClientCallFailed", "message": "\"Request could not be completed due to a conflict with the current state of the target resource, Please try again later. Already running method StartCreateDeploymentAsync with operation [a3832ec0-8111-4db0-869f-af67993a6511]. Can not perform StartUpdateDeploymentAsync.\"", "details": [], "additionalInfo": [] } ], "additionalInfo": [ { "type": "ComponentName", "info": { "value": "managementfrontend" } }, { "type": "Correlation", "info": { "value": { "operation": "901d37c8622f4a7b80139d937b9d338f", "request": "7ea73541cd9fa41b" } } }, { "type": "Environment", "info": { "value": "westeurope" } }, { "type": "Location", "info": { "value": "westeurope" } }, { "type": "Time", "info": { "value": "2025-06-11T11:03:32.7120707+00:00" } } ] } } Headers: Cache-Control: no-cache Pragma: no-cache x-ms-operation-identifier: REDACTED Request-Context: REDACTED x-ms-response-type: REDACTED Strict-Transport-Security: REDACTED X-Content-Type-Options: REDACTED azureml-served-by-cluster: REDACTED x-request-time: REDACTED x-ms-throttling-version: REDACTED x-ms-ratelimit-remaining-subscription-resource-requests: REDACTED x-ms-request-id: 8132bebc-2b6b-4840-8245-d342b7247551 x-ms-correlation-request-id: REDACTED x-ms-routing-request-id: REDACTED X-Cache: REDACTED X-MSEdge-Ref: REDACTED Date: Wed, 11 Jun 2025 11:03:32 GMT Content-Length: 1201 Content-Type: application/json; charset=utf-8 Expires: -1
For existing projects we can successfully create new Prompt lLows and deploy them to new or existing pre-created online endpoints. But even under the same HUB if we create a new project and try Prompt Flow deployment to new or existing pre-created online endpoint, we constantly get these errors that are not descriptive at all.
Were there some underlying changes to the service or some role assignment that should be done but not propagated that are affecting the service? This has been tested in multiple subscription under multiple tenants in various regions (Sweden Central, West Europe and North Europe) and the behavior is exactly the same everywhere.
Azure Machine Learning
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-11T19:13:25.2366667+00:00 Hi 78509818
I think below above two error are correlated.
Below error suggests a TLS handshake failure due to certification validation.
2025-06-09T20:35:53Z Check envoy cert setting failed in MirSystemSetupTask. Please check the existence/validation of envoy cert.
and Below error suggests previous deployment were stuck and not allowing new deployment to create.
"\"Request could not be completed due to a conflict with the current state of the target resource, Please try again later. Already running method StartCreateDeploymentAsync with operation [a3832ec0-8111-4db0-869f-af67993a6511]. Can not perform StartUpdateDeploymentAsync
I am testing in one of below region for any regression issue and shall update you on earliest.
(Sweden Central, West Europe and North Europe)
Can suggest you in test in non-European regions and let us know if it serves you as a workaround.
Thank you for bringing it up.
-
Yuri Tieto • 10 Reputation points
2025-06-11T19:19:48.2066667+00:00 Hi,
Thanks for the fast reply.
Regarding that TLS handshake - this is the services internals that we don't have control over, so if it fails then it is some issue with the deployment process itself.
Second error's explanation is logical.
We use only European regions for compliancy reasons and are not allowed to use any non-European ones.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-12T01:26:38.7666667+00:00 Hi 78509818
Have not replicated the issue yet. It might be related to regression on new prompt flow studio version.
Ideal solution to revert back to previous stable version of prompt flow instead of latest in advanced compute session setting.
https://github.com/microsoft/promptflow/releases
https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/prompt-flow-troubleshoot
Thank you.
-
Ivan Bok • 0 Reputation points • Microsoft Employee
2025-06-12T06:17:50.68+00:00 I want to use an older version, but am I supposed to build my own container? The github promptflow releases don't give any pointers to the base images
-
Yuri Tieto • 10 Reputation points
2025-06-12T06:31:23.0633333+00:00 Hi Manas Mohanty,The serverless compute works fine and I can run the promptflow. The problems come when I Deploy the prompt flow. And again it works for old Foundry projects and doesn't work for new projects. Also according to your link there were no new versions for the base image after January 9th (release 1.17.1) and the issues I'm describing are very recent
-
Yuri Tieto • 10 Reputation points
2025-06-12T07:41:46.8633333+00:00 I checked that the same base image promptflow-runtime:20250518.v1 is used for both successful and unsuccessful deployments, so I don't see problem with the compute instance itself. The problem should be with the new project setup and permissions propagation or with the deployment process itself.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-12T11:01:37.4266667+00:00 Hi Ivan Bok
We can use old prompt flow images in below format
mcr.microsoft.com/azureml/promptflow/promptflow-runtime:<newest_version>
All version details can be found here.
https://mcr.microsoft.com/v2/azureml/promptflow/promptflow-runtime/tags/list
Reference - https://mcr.microsoft.com/v2/azureml/promptflow/promptflow-runtime/tags/list
Thank you
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-12T18:39:28.8233333+00:00 Hello 78509818
Could you please provide the details requested in private chat for further assistance.
Thank you.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-12T19:54:59.1533333+00:00 Hello 78509818
Thank you for emphasizing that you are able to create deployment with serverless compute but not with VMs.
Could you please assign the Azure foundry project as ACR contributor to your container registry and let me know if it fixes the issue.
There seems to two product group ticket on similar context. Customer are facing issue on creating compute session with VM itself.
Trying to correlate the issue with this case.
Have requested more info to create separate tickets on your case.
Thank you.
-
Yuri Tieto • 10 Reputation points
2025-06-12T20:37:16.1133333+00:00 Just to be clear about my earlier comment - we are not able to deploy from both serverless compute and VM computes. We are only able to Run the prompt flow in the AI Foundry UI for testing if it works, but Deployment fails regardless of Serverless or VM Compute usage (also tried with an image from 2024).
We tried earlier Contributor permission for the Project's identity on ACR, but that didn't help. I'll try again.
Regarding Compute problems - we had it too during Tuesday. many people could't create VM Computes and also run serverless compute. It was failing without any meaningful error. Some people experienced same problems also today, but much less.
-
Yuri Tieto • 10 Reputation points
2025-06-12T20:42:14.9733333+00:00 Status message visible in Activity Logs for the failed deployments is this
"statusMessage": "{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"EnvoyCertError\",\"message\":\"Check envoy cert setting failed in MirSystemSetupTask. Please check the existence/validation of envoy cert.\",\"details\":[],\"additionalInfo\":[]}]}}"
-
Yuri Tieto • 10 Reputation points
2025-06-13T06:48:47.2166667+00:00 Hi Manas,
I tested twice with Contributor permission for the Project's identity over ACR and it didn't help. Any other ideas?
-
Everton Oliveira • 10 Reputation points
2025-06-14T21:55:43.25+00:00 I am having the same problem, exactly as described by the reporter, and I also get the same error message from the ARM log on my second deployment:
I am deploying a Promptflow to an Azure Foundry online endpoint using the Azure Developer CLI. The same template worked in two other subscriptions a while ago, however, now I'm facing issues to deploy the flow, I'm using hub-based project with managed virtual network connected to resources behind private endpoints, e.g. storage account, key vault, and container registry. The deployment of the flow to the online endpoint gets stuck for a long time in the 'Creating' state until it fails.
Suspected Cause: The online endpoint cannot reach the required services for some reason, such as Azure Container Registry or Azure OpenAI. The managed identity for the hub/ online endpoint has the ACRPull role, and network configuration appears correct. I also see the managed private endpoints from the hub to the services are all active.
-
Peter Hinterseer • 0 Reputation points
2025-06-16T08:48:21.8+00:00 We are running into the exact same issue since last Friday where we deployed our prompt flow online endpoint for the first time into our test-environment. We are using the azapi-terraform provider to deploy the resources through infra-as-code and the same code is running just fine in our dev-environment where we have been deploying prompt flow for a while already.
The managed identity we use to deploy the endpoint has the exact same roles as in dev, where everything is working fine.
Like Everton Oliviera in the comment above, we are also using a managed virtual network with outbound rules to connect to private resources, but it seems to fail much earlier anyway.
The ARM error message mentioned pops up immediately after starting the deployment, so it seems to fail very early on in the process and then just takes up to 2h to reach a "failed" state, probably through some kind of internal timeout.
-
Everton Oliveira • 101 Reputation points
2025-06-16T12:30:56.9333333+00:00 Hi anyone has any updates on this?
-
Yuri Tieto • 10 Reputation points
2025-06-16T12:44:43.29+00:00 No new updates yet. Microsoft is aware of the issue and I guess working on it right now to identify the problem.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-17T01:24:03.9233333+00:00 Hi 78509818
Could you add Azure AI Enterprise Network Connection Approver on few of users and try again.
Thank you.
-
Ivan Bok • 0 Reputation points • Microsoft Employee
2025-06-17T03:27:48.9066667+00:00 Hi all, for those using the managed network (restricted outbound), have you tried temporarily allowing all inbound on the Hub?
I know conceptually this is strange since I'm expecting the backend compute instances to only require the correct outbound in order for the endpoint to deploy correctly. However, I experienced a similar issue recently and I noticed that opening the inbound to public fixes things.
Of course, I'm assuming this is a development/testing environment where this is not a security concern. I just want to see whether it's the same issue. Regardless, opening inbound to public is not tenable long run, so I also have a support ticket going on at my end to figure out why this is needed.
-
Yuri Tieto • 10 Reputation points
2025-06-17T06:28:39.8166667+00:00 Hi Manas,
To which users should the role be added and in scope of which resource?
We have that role for the HUB's Managed identity over resources outside of Resource Groups where HUB is for HUB to be able to create managed private endpoints towards external private resources(like AI Foundry Services). So if you refer to this, then we have it in place and tried it.
Also my reported setup has AI Foundry Service in the same Resource group where the HUB is, so it should have all the rights to approve private endpoints. Plus my AI Foundry Service is a publicly available.
-
Everton Oliveira • 101 Reputation points
2025-06-17T11:52:57.6566667+00:00 I see it stated in the docs that from the 30th April 25 the MI of the hub requires the role Azure AI Enterprise Network Connection Approver, also the user initiating the deployment. The role must be assigned on the target dependant services. I have both, hub and myself assigned to it. I also tested opening hubs network access, as well as acr, storage and key vault. None of it helped.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-18T17:42:38.9133333+00:00 Hi 78509818 and Everton Oliveira
As per latest investigation, there has been change in backend on how DNS resolution is done for AI foundry project causing issue in deploying the prompt flow as endpoints
Shall keep you posted as we progress.
Thank you.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-20T00:15:50.01+00:00 Hi 78509818
I went through a product group ticket with similar error trace. They pointed out networking misconfiguration in their case (Customer was doing a public deployment on protect workspace)
Could you confirm you are selecting Virtual network while deploying the endpoint. Public deployment in private workspace with public access disabled is not supported scenario.
(Have requested JIT access on your support ticket to verify networking setup on your deployments)
Thank you.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-20T21:33:12.2333333+00:00 Hi 78509818
Verified with support ticket owner that public access of deployment is triggered by bad FQDN resolution.
Product group has acknowledged the same with all trials data collected from Guillaume Fourrat and others and debugging further to fix the issue.
Appreciate your patience.
Thank you.
-
Everton Oliveira • 101 Reputation points
2025-06-21T11:33:37.8966667+00:00 Thanks for keeping us updated @Manas Mohanty .
-
Yuri Tieto • 10 Reputation points
2025-06-23T12:23:31.8266667+00:00 Hi Manas Mohanty,
Can you clarify what you mean by "Public deployment in private workspace with public access disabled is not supported scenario."?
Do you mean that if AI Foundry HUB setup is private, you can't have a public online endpoint and deploy a Prompt Flow on it?
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-23T16:53:09.1133333+00:00 Yes Yuri Tieto
You are correct, we cannot have public online endpoint on private workspace.
Thank you
-
Yuri Tieto • 10 Reputation points
2025-06-23T17:35:54.14+00:00 Is there a documentation describing this? Because we have tested this earlier and we could create public endpoints in private HUB/Projects and deploy prompt flows to them and tested that it works. Can you provide some reference documentation?
-
Everton Oliveira • 10 Reputation points
2025-06-23T17:46:51.7633333+00:00 I don't think this is correct @Manas Mohanty , you can actually have a public online endpoint with a private private hub/project. @Yuri Tieto , this is my current setup, private hub/private + public ip restrictions. Reference documentation here: MS Docs
"You can use IP network rules to allow access to your secured hub from specific public internet IP address ranges by creating IP network rules"
Btw - my environments deployed with this setup still work, just new deployments that are still not working, and it doesn't matter if it's all private or mixed.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-23T17:53:01.72+00:00 Hi Yuri Tieto
I am reviewing with our FTEs here for reference documentation, Quote of "Public deployment in private workspace with public access disabled is not supported scenario." is from internal records in PG group.
Thank you.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-23T17:54:45.22+00:00 Yes Everton Oliveira and Yuri Tieto
Whitelisting few public ips is definitely possible. I agree with you on below.
"You can use IP network rules to allow access to your secured hub from specific public internet IP address ranges by creating IP network rules"
Update on PG ticket -product group has suggested for a hot fix on product on 2025-06-23 21:30 GMT+530
Shall keep you posted.
Thank you
-
Yuri Tieto • 10 Reputation points
2025-06-24T07:21:26.92+00:00 Hi Manas Mohanty,
At least according to this documentation you can have both public and private managed online endpoints regardless if your HUB is public or private.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-26T19:35:40.5966667+00:00 Hi Yuri Tieto
PG has not given any particular ETA on the hotfix/rollback considering the effort in tracing the changes on multiple area of AI foundry hub product.
But shall sync with them again and update you once they provide an update.
Thank you.
-
Manas Mohanty • 5,700 Reputation points • Microsoft External Staff • Moderator
2025-06-30T02:03:40.8833333+00:00 Hi Yuri Tieto
We have not heard any update from product group for possible ETA. Have pinged the respective service owner today to have traction on this case.
Thank you.
Sign in to comment