Problem
All Microsoft-hosted pipelines in our Azure DevOps organization are completely blocked. No new pipeline runs can acquire an agent — they sit at "Acquiring an agent from the cloud" indefinitely. This has been ongoing for 3+ hours with no self-recovery.
We have 1 free Microsoft-hosted parallel job with 1800 min/month and only 41 minutes used. Billing and parallelism are not the issue.
Root Cause
We traced this to orphaned job requests stuck in the agent pool queue. These requests belong to builds that were successfully canceled (status: completed, result: canceled), but their final stages — which used condition: always() — were never terminated by the orchestrator. The job requests for those stages remain in the queue with no result and no way to clear them.
The deadlock cycle
- Several pipeline runs were canceled via UI and REST API
- Builds transitioned to
completed/canceled at the build level
- However, final stages with
condition: always() remained state: inProgress in the timeline — the orchestrator queued agent jobs for them even though the build was canceled
- The only agent in the pool is
offline/Deallocated (normal for hosted — it provisions on demand)
- The dispatcher assigns orphaned requests to this offline agent with a ~45-minute lease
- Lease expires, dispatcher moves to the next orphaned request
- With multiple orphaned requests cycling, the single parallelism slot is blocked indefinitely
- No new agent VMs are ever provisioned
Key observations
-
resourceusage API shows usedCount: 0 — billing sees no active jobs
- But the job dispatcher considers the slot occupied by the orphaned lease
- One request was assigned to the agent 71 minutes after its parent build was already canceled
-
jobCancelTimeoutInMinutes: 5 on the pipeline definition is not honored for stages waiting for agent provisioning
What we tried (everything fails)
| Action |
Result |
| Cancel builds via UI |
Build shows canceled, but stuck stage remains inProgress |
Cancel builds via REST API (PATCH status=cancelling) |
200 OK, but stage jobs not released |
Force complete builds via API (PATCH status=completed) |
200 OK, but job requests persist |
| Delete builds via UI |
Blocked — "has active jobs" |
| Delete builds via API |
403 Forbidden |
| DELETE or PATCH job requests via API |
405 Method Not Allowed |
| Delete agent pool via UI |
Blocked — active jobs |
| Delete the offline agent via API |
403 Forbidden |
| Disable/re-enable the agent |
No effect on queue |
| Disable/re-enable the pool |
No effect on queue |
| PATCH timeline records to force-complete stages |
405 Not Supported |
| POST JobCompleted event to orchestration plan |
Requires agent-scoped token, not available to admins |
| Switch pipelines to a different hosted pool name |
All hosted pools share the same dispatcher and parallelism slot |
| Wait for lease expiry |
Lease expires but dispatcher just cycles to next orphaned request |
| Set parallelism to 0 and back to 1 |
API returns 405 on pool modification |
There is no administrator-accessible way to clear orphaned job requests. The entire organization's Microsoft-hosted pipelines are dead with no self-service recovery path.
Bugs identified
- Canceled builds should release all pending job requests immediately. When a build transitions to
completed/canceled, any queued job requests should be terminated. Currently they persist indefinitely.
- The dispatcher should not assign jobs for canceled builds. We observed a job assigned 71 minutes after its parent build was canceled.
- No API exists to cancel orphaned job requests.
DELETE and PATCH on _apis/distributedtask/pools/{poolId}/jobrequests/{requestId} both return 405. Organization admins have zero ability to clear stuck requests without Microsoft intervention.
- Stages with
condition: always() create an unrecoverable deadlock when canceled. The stage waits for an agent, the agent won't provision because the build is canceled, the job request can't be released because the stage is "in progress", and jobCancelTimeoutInMinutes doesn't apply to stages waiting for provisioning.
- No circuit breaker exists. A single bad cancellation can permanently block all Microsoft-hosted pipelines for an entire organization with no timeout or automatic recovery.
Prevention
We have since changed our pipeline's notification stage from condition: always() to condition: not(canceled()) to prevent this from recurring. However, the current deadlock remains unresolvable without Microsoft support.
Ask
- Immediate: Is there any way for organization admins to clear orphaned job requests from a Microsoft-hosted pool? We have exhausted every API endpoint and UI option.
- Long-term: Please add an API or UI option to force-cancel job requests, and fix the orchestrator to properly clean up jobs when builds are canceled.
Environment
- Azure DevOps Services (cloud)
- Free tier, 1 Microsoft-hosted parallel job
- Multi-stage YAML pipelines