Front Door responds with OriginTimeout after ~4 seconds despite larger timeout setting

Sergey Stoma 30

Hi!

We have 90 seconds timeout configured as the origin timeout setting. However, from time to time Front Door fails just after 4 seconds with OriginTimeout setting, something like below. It looks like these requests don't even reach the origin. What could be causing the timeout on the FD side?

"timeToFirstByte": "3.997",

"timeTaken": "3.997",

"httpStatusCode": "504",

"httpStatusDetails": "504",

"pop": "CHG",

"cacheStatus": "CONFIG_NOCACHE",

"errorInfo": "OriginTimeout",

"ErrorInfo": "OriginTimeout",

RevelinoB 3,505 Reputation points

2023-07-17T04:09:15.09+00:00
Hi Sergey,

If Azure Front Door is timing out and giving you a 504 error, it means that it's not able to reach the origin server within the configured timeout period. This could happen due to a few reasons:

Network connectivity issues: The timeout could be caused by network connectivity problems between Azure Front Door and the origin server. This could be due to network congestion, firewall settings, or routing issues. You can check the network connectivity between Front Door and the origin server to ensure there are no connectivity problems.

High load on the origin server: If the origin server is under high load or experiencing performance issues, it may not be able to respond to requests within the configured timeout period. This can result in Front Door timing out and returning a 504 error. Monitoring the performance and resource utilization of the origin server can help identify if this is the cause. Origin server configuration: The origin server may be configured with a timeout period that is shorter than the Front Door timeout setting. In this case, even if Front Door is configured with a 90-second timeout, the origin server may respond with a timeout error before that time. Review the configuration of the origin server and ensure it allows sufficient time for processing requests. Firewall or security settings: If the origin server has a firewall or security settings that block or restrict access from Front Door, it could cause timeouts. Ensure that the necessary firewall rules or security settings are in place to allow traffic from Front Door. Front Door configuration: Review the Front Door configuration, specifically the timeout settings, to ensure they are correctly set. Check if any custom rules, policies, or routing configurations are affecting the request flow and causing the timeout.

It's recommended to investigate the logs and monitor the network and server performance to identify the underlying cause of the timeouts.

I hope this helps?
GitaraniSharma-MSFT 49,581 Reputation points Microsoft Employee

2023-07-17T10:34:46.1466667+00:00
Hello @Sergey Stoma ,

Welcome to Microsoft Q&A Platform. Thank you for reaching out & hope you are doing well.

I understand that you've configured 90 seconds timeout as your origin timeout setting. However, from time-to-time Azure Front Door fails just after 4 seconds with HTTP 504 - OriginTimeout error.

I've seen this issue before, but it is mostly a backend issue. This issue requires a deeper investigation, so if you have a support plan, I request you file a support ticket, else please do let us know, we will try and help you get a one-time free technical support.

In case you need help with a one-time free technical support, I would request you to send an email with subject line "ATTN gishar | Front Door responds with OriginTimeout after ~4 seconds despite larger timeout setting" to AzCommunity[at]Microsoft[dot]com with the following details, I will follow-up with you.

Reference this Q&A thread

Your Azure Subscription ID

Note: Do not share any PII data as a public comment.

We will post a summarized answer once the issue is resolved.

Regards,

Gita
GitaraniSharma-MSFT 49,581 Reputation points Microsoft Employee

2023-07-26T10:35:50.4666667+00:00

Hello @Sergey Stoma , Could you please provide an update on this post? Please send us an email as requested above, in case you need help with a one-time free technical support to troubleshoot this issue further.
Dylan James 1 Reputation point

2024-03-21T15:59:31.6833333+00:00

Hello @Sergey Stoma you are describing almost exactly the same scenario I am seeing. Did you resolve this somehow? If so could you share the fix?

Thanks!
Sergey Stoma 5 Reputation points

2024-03-21T16:43:37.4233333+00:00

Unfortunately we haven't found a fix either. It "kinda" went away after a while, or at least is now below the noise floor that it is not as a much of a concern for now.

A support ticket back then didn't yield much either, essentially only a suggestion that it could have been a bad POP.
Dylan James 1 Reputation point

2024-03-21T16:59:46.9966667+00:00

Thanks a ton for the reply. That is unfortunate to hear. I'm going to try going the support ticket route and see if anything comes from it.
John Patounas 5 Reputation points

2024-03-22T11:24:40.3033333+00:00

We are facing exactly the same issue.

Randomly page elements timeout within about 4secs while the FD timeout is set to the maximum 240 secs. On FD Audit logs it is either 503 or 504 status codes.

Tried removing compressions, adding rule to ignore Accept-Encoding but the issue persists more than a month now.
Opened a ticket and no solution found yet.

PLEASEEEEEE ANYONE NEED ASSISTANCE ON THIS !!!
Sergey Stoma 5 Reputation points

2024-03-22T18:07:55.8+00:00

Pulled down the log from today and there was a single instance of timeout with the same symptoms - FD reports timeout, 504, and of course the request never made it to the origin. So this is still happening, though less frequently. At least it failed faster than 4 seconds :D

"timeToFirstByte": "2.832",

"timeTaken": "2.832",

"httpStatusCode": "504",

"httpStatusDetails": "504",

"pop": "MNZ",

"cacheStatus": "N/A",

"errorInfo": "OriginTimeout",

"ErrorInfo": "OriginTimeout",
John Patounas 5 Reputation points

2024-03-22T21:15:39.1533333+00:00

The issue is for items that are in the cache as it seems.

I tested the same endpoint on 2 domains, one with caching on FD and the other without. On the one that I had the caching enabled I had the 504 timeouts in the audit log while on the other domain did not.

Also noticed that for every time 504 happens in the log there are 2 entries at the same time. One with sni_s my hostname and another with sni_s == "originshield|parentcache|https|tier2"

Still have ticket with Microsoft and hope to have a solution on this. Will keep you posted.
John Patounas 5 Reputation points

2024-03-23T15:31:16.51+00:00

New update. Since last night I have tried different settings on FD. The issue happens only when caching is enabled. Disabled caching and purged cash on the domain and stopped happening. Will continue with the ticket with MS since caching is needed for sure. Once I have any new updates I will comment.
John Patounas 5 Reputation points

2024-03-27T16:26:25.9766667+00:00

New Update.

My Setup tested:

Azure Frontdoor --> Fortigate (Policies with Virtual IPs) --> VM on Azure (IIS)

Setup 1) FD Caching enabled --- ERRORS 504

Setup 2) FD Caching disabled ---- NO ERRORS 504

New Setup Tested bypassing Fortigate Firewall:

Azure Frontdoor --> VM on Azure (IIS)

Setup 1) FD Caching enabled --- NO ERRORS 504

Setup 2) FD Caching disabled ---- NO ERRORS 504
Sergey Stoma 5 Reputation points

2024-03-27T17:08:15.52+00:00

Interesting! In our case, we didn't have a firewall, it was FD -> LB pool -> VM
J K 1 Reputation point

2024-06-25T01:28:34.3666667+00:00

We have had the same 4 second timeout problem with Front Door. Caching is disabled. No firewall inbetween FD and origins.

The problem occurs across multiple origin types: App Service, Linux VMSS + public LB, Windows VMSS + private link LB, Cloud Services (Extended Support).

We have a pending ticket with support since June 10.
Joshua Waddell 10 Reputation points Microsoft Employee

2024-06-26T00:29:44.9133333+00:00
@GitaraniSharma-MSFT I'm working with the customer who made the last response to this thread. Is this something we are seeing with more customers broadly? Thanks in advance for your time.
GitaraniSharma-MSFT 49,581 Reputation points Microsoft Employee

2024-06-28T12:32:34.3833333+00:00

Hello @Joshua Waddell , yes, we are seeing this issue with more customers.

It looks like an upgrade in few of the backend environments has caused this issue and a support request is required to fix it from the backend.

Regards,

Gita
Sergey Stoma 5 Reputation points

2024-06-29T01:07:03.0066667+00:00

@GitaraniSharma-MSFT If we reach out to the support about this issue, is there an existing KB #/ticket #/service bulletin to refer to, so we don't have to reexplain the entire thing again?
Asjana , Yusef 0 Reputation points

2024-07-02T22:54:52.43+00:00

@GitaraniSharma-MSFT We are having the same issue too where the request is timing out at 4 seconds. Is there a solution to this?
GitaraniSharma-MSFT 49,581 Reputation points Microsoft Employee

2024-07-03T12:50:36.31+00:00
Hello @Sergey Stoma , there is an internal incident (not shareable publicly).

@Asjana , Yusef , for the recent issue/bug which started in April last week, the Product Group team shared the below RCA:

Incident Summary

Starting April 24th at 8:50 UTC, several customers started reporting a dip in their backend health status. The problem continued till May 10th as we investigated the root cause.

Upon investigation it was observed that health probes from AFD to backends were facing intermittent connection failures, which started showing up as a dip in the backend health metric. Even though there were health probe failures, there was no impact on the regular traffic of customers.

Root Cause

Starting from the last week of April, a scheduled OS upgrade was being rolled out to several AFD environments. Due to a bug in the newer OS version, a system configuration parameter was not set to an appropriate high value.

This parameter is usually set to an appropriate high value in anticipation of the higher load the health probe service (in available machines) will face when a subset of AFD machines is temporarily taken offline during OS upgrades.

But because of the new default lower value set by the bug, the health probe service started hitting the limits when machines were being taken offline for the upgrade. This caused intermittent connection failures in the health probes among a subset of customer origins.

Mitigation

To mitigate the problem, we overrode the lower default value introduced by the bug and set the appropriate high limit needed for the health probe service. We will roll out a permanent fix for the bug by the end of June.

Next steps and repair items

We deeply apologize for this incident and for any inconvenience it has caused. In our continuous efforts to improve platform reliability, we will be performing below repairs

Identify system level settings that had lower default values in the new OS version and configured more appropriate values for them.

Add metrics to track connections usage in health probe service and monitoring based on it.

Roll out the fix for the bug in the latest OS version. [rollout completion end of June]

Add monitoring to detect changes in configured limits in subsequent OS updates [end of June]

The above fix was supposed to be rolled out by end of June, but I will check with the Product Group team for any changes in the ETA.

Regards,

Gita
Asjana , Yusef 0 Reputation points

2024-07-03T16:22:22.09+00:00

HI @GitaraniSharma-MSFT, thanks for the update. In our case the Health probes are disable for this backend pool and there is only one backend setup for it.
J K 1 Reputation point

2024-07-04T04:08:36.4966667+00:00

Hi @GitaraniSharma-MSFT we are still experiencing this 4 second timeout issue across multiple origins and front door instances. This problem occurs on origin groups where Health Probes are disabled.

On July 3rd over 24 hours, 0.09% of requests to front door failed with a 504 response at 4 seconds.

On July 2nd over 24 hours, 0.1% of requests to front door failed with a 504 response at 4 seconds.

We obtain these percentages by querying the Front Door logs and comparing the total # requests to the # requests where ('properties.httpStatusCode'=="504" AND tonumber('properties.timeToFirstByte') < 4.1)) (this is splunk query syntax).

Is it possible for the Front Door Product Group analyze for this behavior across customers to determine how wide spread the problem is? And to understand whether changes cause issues?

Also from the RCA:

Even though there were health probe failures, there was no impact on the regular traffic of customers.

This is not correct. Our customers are seeing 504 gateway timeout issues in their web browsers and our Pingdom checks are going down frequently.
GitaraniSharma-MSFT 49,581 Reputation points Microsoft Employee

2024-07-04T13:21:49.23+00:00

Hello @Asjana , Yusef and @J K ,

The issue that began in the last week of April (for which I provided an RCA above) has been confirmed fixed by the Product Group team. All Azure Front Door edge servers and environments have been updated and fix rollout is complete.

As per the Azure Front Door health probe document,

If you have a single origin in your origin group, you can choose to disable health probes to reduce the load on your application. NOTE: If there is only a single origin in your origin group, the single origin will get very few health probes. This may lead to a dip in origin health metrics, but your traffic will not be impacted.

However, if you are experiencing an actual impact on your traffic, further investigation is required, including the collection of backend logs. Therefore, I recommend opening a support request.

Regards,

Gita
Yusef Asjana 0 Reputation points

2024-07-08T18:19:48.2633333+00:00

Hi @GitaraniSharma-MSFT, thanks for the update. We have a support ticket opened with MS and just like @J K mentioned we as well have our Health Probes disabled for this origin but we continue to see the issue. This doesn't seem like an isolated issue given that there are other customers seeing this behavior. Could you please escalate this issue to the Product Group team.Thanks
J K 1 Reputation point

2024-07-09T04:31:11.76+00:00
Hi @GitaraniSharma-MSFT appreciate the update.

Similar to @Yusef Asjana the issue is still ongoing for my company.

We have a support ticket open since June 10. Front Door Product Group said that this fix would resolve the issue but the issue is still ongoing.

Unfortunately we aren't getting any traction in our support request to investigate the root cause of the issue. I don't understand how a proxy can return a 504 timeout error after 4 seconds? Any clues as to the root cause would be helpful, if there's something we can try on our end?

We do not have health probes enabled on origins. Example configuration for an origin:

This issue is occurring for multiple origins, examples:

Apache on Azure VMSS (Linux)

Varnish container on Azure Kubernetes Service

IIS on Azure VMSS (Windows)

This issue is occurring across multiple POPs, example of the 4 second timeout requests logged in the past 3 hours:

Top 10 Values Count % DFW 4 18.182% AKL 3 13.636% YVR 3 13.636% AM 2 9.091% FRA 2 9.091% SIN 2 9.091% ATL 1 4.545% BN 1 4.545% CH 1 4.545% DM 1 4.545%
GitaraniSharma-MSFT 49,581 Reputation points Microsoft Employee

2024-07-09T13:09:33.1133333+00:00

Hello @Yusef Asjana and @J K ,

I've reached out to the Azure Front Door Product Group team for any latest updates on this issue. Will keep you posted.

In the meantime, could you please let me know which SKU of Azure Front Door you are using? Is it Classic or Standard/Premium?

Also, @J K , could you please share the support request number, if possible?

Regards,

Gita
Yusef Asjana 0 Reputation points

2024-07-09T16:10:11.4566667+00:00

Hi @GitaraniSharma-MSFT , we are using the Classic Front Door.
J K 1 Reputation point

2024-07-09T23:23:38.0533333+00:00

Hi @GitaraniSharma-MSFT

We are using Azure Front Door Premium.

Our support request number is TrackingID#2406080030000792
GitaraniSharma-MSFT 49,581 Reputation points Microsoft Employee

2024-07-12T12:35:11.64+00:00

Hello @Yusef Asjana and @J K ,

Thank you for the details.

I've reached out to the Azure Front Door Product Group team regarding this issue and awaiting their updates. Will keep you posted.

Meanwhile, could you please check the KeepAliveTimeout configuration in your origins and Origin response timeout configuration in your Azure Front Door (The ability to configure Origin response timeout is only available in Azure Front Door Standard/Premium).

Refer: https://learn.microsoft.com/en-us/answers/questions/1329800/intermittent-503s-on-azure-front-door-from-origini

https://learn.microsoft.com/en-us/azure/frontdoor/troubleshoot-issues#503-or-504-response-from-azure-front-door-after-a-few-seconds

Also, check if the client is sending byte range requests with Accept-Encoding headers.

Regards,

Gita
Søren Jensen 1 Reputation point

2024-09-25T11:42:18.1933333+00:00

We have seen the same issue since this morning - timeTaken 4sec and timeToFirstByte 4sec
Jack Bond 0 Reputation points

2024-10-13T20:04:21.07+00:00

Hello @GitaraniSharma-MSFT

We are also experiencing this issue for roughly 1% of customer traffic.

The exact same issues that are being mentioned in this this thread with OriginTimeout error codes, alongside roughly 4 second timeouts.

We've changed the origin timeout within Front Door to 230 seconds, which hasn't seen any improvements. We've also applied the fix to remove the Accept-Encoding header that you suggested and nothing has changed.

Is there any further developments since the last update for what is causing this, and also any potential fixes?
Jack Bond 0 Reputation points

2024-10-13T20:14:03.0033333+00:00

I can confirm we are also seeing this issue.
Seems to be affecting roughly 1% of traffic for our customers, which is extremely problematic.

@GitaraniSharma-MSFT is there any update on this?

1 answer

Shawn Myers 0 Reputation points

2024-09-09T16:40:23.3+00:00

Same problem with the 4 second timeout. We had the OriginTimeout at 240 and change to 230 and problem went away. I'm wonder if one of Front Door servers didn't have out settings.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Front Door responds with OriginTimeout after ~4 seconds despite larger timeout setting

1 answer

Your answer