Managing Redis Connection Loss on Linux App Service

Question

Managing Redis Connection Loss on Linux App Service

Florian Pierrel 20

We are experiencing undetected connection losses with Azure Cache for Redis once or twice a month on our .NET 6 application hosted with Docker on a Linux App Service. This leads to Redis Timeout Exceptions, and the number of clients connected to the Redis server drops sharply. The problem occurs at the same time on about half the instances. All applications hosted on the same instance experience the same problem at the same time.

The problem resolves itself after 15 minutes. We believe it's due to the TCP parameter net.ip4v.tcp_retries2, as mentioned in the documentation.

Is this the case? Is this frequency normal? Could the problem be Azure Redis?

How can we mitigate or solve this issue? We've tried using ForceReconnect and updating the StackExchange.Redis library to the last version, but it's not working. Unfortunately, we can't modify the TCP parameter net.ip4v.tcp_retries2 on the Linux App Plan.

Accepted answer

1 additional answer

Your answer

Answer 1

GeethaThatipatri-MSFT 29,542 Microsoft Employee Moderator

Hi @Florian Pierrel When you experience this undetected connection loses, that sounds consistent with our monthly patching. Ultimately, when we patch the cache all clients are disconnected but can then immediately reconnect. The linked article explains the reasoning behind that.
As you’ve found there is an issue with linux settings being too generous with your default values causing reconnects to fail for ~15 mins or so.

Since it doesn’t appear you can set the OS-level setting TCP parameter net.ip4v.tcp_retries2 in App Service, here is a snippet from the comment linked above that offers a few client-side options you could configure in stackexchange.redis such as adjusting keepalive settings and using the ForceReconnect pattern.

As you found, there are TCP settings you can change on the client machine to force it to timeout the connection sooner and allow for reconnect. In addition to tcp_retries2, you can try tuning the keepalive settings as discussed here: lettuce-io/lettuce-core#1428 (comment). It should be safe to reduce these timeouts to more realistic durations machine-wide unless you have systems that actually depend on the unusually long retransmits.

An additional approach is using the ForceReconnect pattern recommended in the Azure best practices. If you're seeing issues like this, it's perfectly appropriate to trigger reconnect on RedisTimeoutExceptions in addition to RedisConnectionExceptions. Just don't be too aggressive with it because an overloaded server can also result in persistent RedisTimeoutExceptions. Recreating connections in that situation can cause additional server load and a cascade failure.

I hope this information helps.

Regards

Geetha

Florian Pierrel 20 Reputation points

2023-11-13T18:02:49.9233333+00:00

Hi Geetha,

Thanks for this clarification, it reinforces the idea of taking RedisTimeoutException exceptions into account in the ForceReconnect pattern.

Concerning the monthly patching, does it take into account the settings defined on the "Schedule updates" page, or are other elements also patched, which may also lead to connection losses? The mention "You can only schedule maintenance for Redis. This does not cover any maintenance done by Azure for updating the underlying platform." on the page is not very clear.

Regards

Florian
GeethaThatipatri-MSFT 29,542 Reputation points Microsoft Employee Moderator

2023-11-13T20:30:22.6066667+00:00

Hi @Florian Pierrel Yes, the patching events will almost always fall within the windows you have configured in the ‘scheduled updates’ page. There are times when the underlying team we rely on will do updates as well and those may fall outside of those windows, but they aren’t as common.

Regards

Geetha
GeethaThatipatri-MSFT 29,542 Reputation points Microsoft Employee Moderator

2023-11-14T20:30:19.89+00:00

@Florian Pierrel If this answers your question, please consider accepting the solution by hitting the Accept as Solution ‌‌as it helps the community look for answers to similar questions.

Regards

Geetha
Florian Pierrel 20 Reputation points

2023-11-15T14:59:39.6533333+00:00

@GeethaThatipatri-MSFT Thanks for clarifying the use of ForceReconnect and Azure maintenance operations.

Florian

Answer 2

SAMITSARKAR_MSFT 791 Microsoft Employee

Hello Florian,

Welcome to Microsoft Q&A Platform. Thank you for reaching out & hope you are doing well.

I understand that application hosted with docker on linux appservice is getting Timed Out.

Can you please share more insights about the Redis timed out Exception with the complete stack trace to identify the issue?

You can also leverage to the article https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts for more insights.

Thanks

Florian Pierrel 20

Hello Samit,

Thank you for your reply. Here is an example of stacktrace:

Timeout awaiting response (outbound=0KiB, inbound=15KiB, 5984ms elapsed, timeout is 5000ms), command=HMGET, next: EXPIRE ecentred119300e-be34-276a-2255-d347328600d2, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 223371, async-ops: 753868, serverEndpoint: xxxx:6380, conn-sec: 8596.01, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: 3a734420707e(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=250,Max=1000), WORKER: (Busy=3,Free=32764,Min=250,Max=32767), POOL: (Threads=19,QueuedItems=0,CompletedItems=9254999,Timers=256), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) 

StackExchange.Redis.RedisTimeoutException:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Extensions.Caching.StackExchangeRedis.RedisCache+<GetAndRefreshAsync>d__39.MoveNext (Microsoft.Extensions.Caching.StackExchangeRedis, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Extensions.Caching.StackExchangeRedis.RedisCache+<GetAsync>d__27.MoveNext (Microsoft.Extensions.Caching.StackExchangeRedis, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Session.DistributedSession+<CommitAsync>d__31.MoveNext (Microsoft.AspNetCore.Session, Version=6.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)

Here's the evolution of the number of clients connected to Redis, we can see the loss of connection of about half of the instances before manually restarting them:

User's image

We've already looked at this article. Over this period there were no CPU peaks, high network utilization or large cache values.

SAMITSARKAR_MSFT 791 Reputation points Microsoft Employee

2023-11-11T08:49:19.0033333+00:00

Hello Florian,

Thanks for the update.

With timeout set as 5s, since the call couldn't be completed in 5 sec, it led to exception. This Exception mainly comes from client library trying to connect to Azure Cache for Redis. In case there is something wrong on the Redis side, some congestion on the application side or some issue on networks, the client library won't be able to connect/perform operation on the Redis and will give exception having Redis Keyword in exception which is not in this case.

Now referring this https://github.com/StackExchange/StackExchange.Redis/issues/1848 this apparently to be a known issue and ForceReconnect pattern might mitigate this.

Hope this helps.

Thanks.
Florian Pierrel 20 Reputation points

2023-11-12T14:34:39.0066667+00:00

Hello Sammit,

This confirms our intuition. We've already tried using the ForceReconnect pattern by activating the switch Microsoft.AspNetCore.Caching.StackExchangeRedis.UseForceReconnect without improving the situation.

I understand that this is a long-standing problem on Linux servers and that the cause is external to our application. Do you have any information about the "normal" frequency of this problem?

This happens once or twice a month, which is not compatible with a production environment, and we're surprised that a problem that has been known for so long still hasn't been corrected. Our applications are very classic, so this must concern a large number of customers.

Thanks

Share via

Managing Redis Connection Loss on Linux App Service

1 additional answer

Your answer