Hello.
I've tried to raise a support request, and it won't allow me.
We are using Durable Functions Apps with Hybrid Connection. The Function App is Windows-based.
SOME CONTEXT -- We have seen a problem with Durable Functions where the memory keeps going up, and up, and our application grinds to a halt. And after some research, other people have also reported a known memory leak with Durable Functions.
So to help with this, we have made a Logic App, which restarts our Function App every day at 5 AM.
We have noticed recently that, after the automatic restart happens, we get thousands of errors in Application Insights saying that it can't connect to the API endpoints we are trying to connect to via Hybrid Connections.
We then restart the Function App manually through the UI (by pressing the "Restart" button), and then all the errors stop and everything starts to work again.
No code changes. We just press the "Restart" button.
Image one:
1 - Function App Audit Log.png
This image shows the audited automatic restart at 5 AM of the Function App. And then above it shows us restarting it manually at 9:15 on the 14th of Jan 26 (after we came into the office in the morning and saw everything erroring and had ground to a halt).
Image two:
2 - Application Insights for Function App.png
Image two shows the error starting (via Application Insights) at the same time the Function App was restarted. And then the errors stopped when we restarted it manually at 09:15 AM.
Image 3:
3 - Azure Relay (Hybrid Connection) Graph and Metrics.png
This shows the correlation between restarting the function app and all of the Hybrid Connections "Sender" and "Throughput" going right down. However, you'll see that the Hybrid Connections themselves do not go "Offline".
This can be seen with the last 'green' metric of "Active Listeners".
So, in summary, my theory is this is an internal network routing problem with Azure, between the Function App and the Azure "Relay".
This is not a code problem.
This has been happening for the last couple of months, and this event on the 14th was the biggest one we've had so far. (So, time to raise a ticket.).
We've had customers complain and threaten to leave because they're relying on our system for Evacuation and Roll Call data, so this stuff needs to work!
Please can someone from Microsoft escalate this for someone in the network team to investigate, and explain what is going on!
Regards,
Callum Woodward.
Ops Manager
Thinking Software.