Intermittent 502 from Application Gateway to ACA backends despite Healthy backend pool, no ACA ingress/app logs

Question

Intermittent 502 from Application Gateway to ACA backends despite Healthy backend pool, no ACA ingress/app logs

Socotec Admin Xian Zhang 0

Hello,

We experienced an intermittent outage where Azure Application Gateway returned 502 errors for traffic routed to Azure Container Apps backends.

Setup

Azure Application Gateway in East US 2
Backend targets are Azure Container Apps FQDNs
Public listener hostnames route to the ACA backends through Application Gateway
Backend pools were shown as Healthy during the incident

Impact

From approximately 06:00-09:00 UTC, access through two Application Gateway listener hostnames was unavailable or intermittently returning 502.

The affected backends were ACA FQDN targets similar to:

<container-app-1>.<generated-env-id>.eastus2.azurecontainerapps.io
<container-app-2>.<generated-env-id>.eastus2.azurecontainerapps.io

Observed behavior

Application Gateway returned 502 responses for live traffic.
Backend health continued to show Healthy.
The Azure Container Apps did not receive corresponding ingress/application logs for the failed requests.
This suggests the failed requests did not reach the container apps.
The issue was mitigated by clearing/removing the ACA backend FQDN/name from the backend pool and re-adding it. After that, traffic recovered.

Example sanitized Application Gateway access log pattern

ERRORINFO_NO_ERROR <aca-backend-fqdn>.eastus2.azurecontainerapps.io /example/api/path/ <status-or-port-value> <public-listener-hostname> AGWAccessLogs 0 0.004

ERRORINFO_NO_ERROR <aca-backend-fqdn>.eastus2.azurecontainerapps.io /example/api/path/ <status-or-port-value> <public-listener-hostname> AGWAccessLogs 0 1.332

Some entries also showed:

TLSv1.3 TLS_AES_256_GCM_SHA384

Questions

Can Application Gateway have stale backend DNS resolution or backend connection state for ACA FQDN targets while backend health remains Healthy?
Does removing and re-adding the backend FQDN force Application Gateway to refresh DNS/backend connection state?
Are there known cases where App Gateway health probes succeed but live traffic to Azure Container Apps returns 502 and never reaches ACA ingress/app logs?
What App Gateway backend HTTP setting should be used for Azure Container Apps FQDN backends?
- Should "pick host name from backend target" be enabled?
- Should SNI be enabled?
- Should the probe use the same host behavior as live traffic?
What additional App Gateway logs or ACA logs should be checked to distinguish App Gateway data-plane/DNS issues from ACA ingress issues?

Any guidance on how to debug or prevent this would be appreciated.

0 comments

2 answers

Your answer

Answer 1

Hi @ Socotec Admin Xian Zhang,

Welcome to Microsoft Q&A Platform.

When Azure Application Gateway uses FQDN-based backends and retains cached DNS resolution or existing backend connections, even while health probes continue reporting the backend as healthy.

Can Application Gateway retain stale DNS or connection state for ACA FQDN backends while probes still show healthy?

Yes. Application Gateway caches DNS resolution for backend FQDNs based on the DNS TTL and also maintains pooled TCP/TLS connections to backend IPs.

If the Azure Container Apps (ACA) environment changes backend IPs during scaling events, infrastructure updates,ingress recycling,or platform maintenance,live traffic may continue attempting to use stale backend connections or cached IPs until Application Gateway refreshes them.

In some situations: lightweight health probes may still succeed,while live client requests fail with intermittent 502 responses.

Does removing and re-adding the backend force a refresh?

Yes. Removing and re-adding the backend target forces Application Gateway to: refresh DNS resolution,rebuild backend connection pools,and establish new backend sessions.

This aligns with your observation that traffic recovered immediately after re-adding the backend.

Known cases where probes pass but live traffic never reaches ACA?

Common scenarios are:

Backend IP changed (scale-up/down, platform upgrade) during the cached DNS window.
TLS handshake errors because the SNI/Host header in live traffic didn’t match what the CA certificate expects.
HTTP/2 connection reuse glitches on v2 gateways.

Recommended HTTP settings for Azure Container Apps FQDN backends

Pick host name from backend target: Yes. This ensures the Host header equals your ..eastus2.azurecontainerapps.io domain.
SNI: Enable. Container Apps uses a TLS certificate that’s valid for the generated FQDN, so you need SNI so the correct cert is presented.
Probe host behavior: Use a custom health probe that also “Pick host name from backend target.” Point it at a lightweight endpoint (e.g. /health or /). This makes the probe path and Host header match your real traffic.

What logs/metrics to collect to differentiate DNS/AGW data-plane issues from ACA ingress issues?

On the Application Gateway side: • Access logs (ensure you’re logging the backend status code, request time, host and port). • Enable the “502 error origin” diagnostic in the portal (AppGw502StatusCodeAzurePortalInsight) to see if the 502 is truly coming from AGW vs. the backend. • Metrics: FailedRequests (500–599), UnhealthyHostCount, ConnectionErrors.
- On the Container Apps side: • Ingress logs (Envoy): see if any request reached the mesh. • App logs / Container stdout. • Azure Monitor metrics for HTTP 4xx/5xx and any throttling.
DNS angle: if you use custom/private DNS, check your DNS server’s query logs or enable Azure DNS analytics to see if the FQDN is resolving to the expected IP at the time of failure.

Reference links for more troubleshooting :

Troubleshoot bad gateway (502) errors in Application Gateway https://docs.microsoft.com/azure/application-gateway/application-gateway-troubleshooting-502

Backend health troubleshooting in Application Gateway https://docs.microsoft.com/azure/application-gateway/application-gateway-backend-health-troubleshooting

Application Gateway access log reference https://docs.microsoft.com/azure/application-gateway/monitor-application-gateway-reference#access-log-category

High-traffic scaling guide for Application Gateway https://docs.microsoft.com/azure/application-gateway/high-traffic-support

DNS/probe caching behavior notes https://learn.microsoft.com/azure/application-gateway/application-gateway-probe-overview (see DNS TTL section)

Pleaseand “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Answer 2

Yes, Application Gateway issues can occur even when backend health shows Healthy.

Backend health only reflects the result of the configured health probe. If:

The probe path/host/port differs from live traffic, or
The probe is allowed by NSG/UDR/DNS while normal traffic is blocked or misrouted,

then probes can succeed while user requests fail with 502. The documentation explicitly notes that if all instances in the BackendAddressPool are unhealthy, 502s occur, but it also calls out separate causes such as NSG/UDR/DNS problems and misconfigured custom probes that affect real traffic differently from probes.

Removing and re-adding the backend FQDN can effectively reset backend state.

While the documentation does not describe DNS cache behavior directly, removing and re-adding a backend pool member forces Application Gateway to recreate the backend configuration. This can indirectly:

Re-evaluate DNS for FQDN backends.
Rebuild backend HTTP settings/probe associations.

The observed recovery after removing/re-adding the ACA FQDN is consistent with clearing a stale or incorrect backend configuration (for example, routing or DNS resolution issues) even though the health probe still reported Healthy.

Yes, there are known patterns where probes succeed but live traffic returns 502.

The 502 troubleshooting guidance lists several causes where:

Health probes succeed or are configured differently, but
Live traffic fails and returns 502:

Examples from the documentation:

NSG/UDR/custom DNS issues: if NSG/UDR/DNS selectively affects user traffic paths or FQDN resolution, probe traffic may still succeed while user requests fail, resulting in 502 errors. Network Security Group, User Defined Route, or Custom DNS issue
Invalid or improper configuration of custom health probes: if the probe host/path/port differ from live traffic, the probe can show Healthy while the actual application path or host header used by clients fails and returns 502. Invalid or improper configuration of custom health probes

In such cases, the backend pool can appear Healthy while user traffic never reaches the application (no ingress/app logs) and 502s are returned.

Backend HTTP settings for ACA FQDN backends.

From the available guidance, the key points for backend HTTP settings and probes are:

Health probe behavior:
- Default probe uses <protocol>://127.0.0.1:<port> and only HTTP 200–399 are considered healthy.
- If a different protocol, host name, path, or status code is needed, a custom probe must be configured and associated with the HTTP settings.
- For single-site configurations, the default host for probes is 127.0.0.1 unless overridden. For multi-site, the Host property should match the site host name.
Recommended alignment for ACA:
- Use HTTPS to the ACA FQDN.
- Enable SNI when the backend is using TLS with SNI (typical for multi-tenant FQDN backends).
- Ensure the Host header used by Application Gateway matches the ACA FQDN expected by the ingress.
- Configure the custom probe so that:
  - Protocol, host, and path match what live traffic uses (for example, same FQDN and a health endpoint on the same app), and
  - The probe URL https://<host>:<port><path> returns HTTP 200.

Specifically regarding the options:

“Pick host name from backend target”: For FQDN backends, this should be used so that the Host header sent to ACA matches the backend FQDN, which is typically required by ACA ingress.
SNI: For HTTPS to ACA FQDNs, SNI should be enabled so the TLS handshake matches the host name.
Probe host behavior: The probe should use the same host behavior as live traffic (same FQDN in Host header) to avoid the situation where probes succeed against a different host/path than the real application.

Additional logs to distinguish Application Gateway vs ACA issues.

From the Application Gateway side, the documentation highlights:

Backend health view: shows per-backend status and probe details. The Details column provides the reason for unhealthy status and is the first place to check when 502s occur.
502 troubleshooting guidance: focuses on NSG/UDR/DNS, request time-out, empty/unhealthy backend pools, and custom probe misconfiguration. These areas should be validated:
- NSG/UDR/DNS:
  - Check NSGs on the Application Gateway subnet and backend subnet to ensure traffic to ACA is not blocked.
  - Check UDRs to ensure traffic is not routed away from ACA endpoints.
  - Check custom DNS on the VNet to ensure the ACA FQDN resolves correctly.
- Request time-out:
  - Confirm BackendHttpSetting RequestTimeout is appropriate; if ACA responses are slow, increase it.
- BackendAddressPool:
  - Ensure the pool is not empty and that all instances are healthy and have the application deployed.

To separate App Gateway data-plane/DNS issues from ACA ingress issues:

If Application Gateway access logs show 502 with zero bytes sent/received and ACA ingress/app logs show no corresponding requests, focus on:
- NSG/UDR/DNS configuration as described in the 502 troubleshooting article.
- Custom probe configuration vs live traffic (host/path/port alignment).
If ACA ingress logs show the requests but respond with 4xx/5xx, the issue is in the ACA application or ingress configuration rather than Application Gateway.

In addition, checking effective NSG and route tables on backend NICs (for VM-based backends) is recommended in the documentation; for ACA, the equivalent is verifying that there are no network policies or DNS configurations in the VNet that would block or misresolve the ACA FQDN from the Application Gateway subnet.

References:

Share via

Intermittent 502 from Application Gateway to ACA backends despite Healthy backend pool, no ACA ingress/app logs

2 answers

Your answer