MS Graph API: High Frequency of Unique Error Responses During User and Chat Operations
Hello, in the past few months I have used the MS Graph API extensively, that included fetching users, managing various resources' change notifications, fetching chats, events, online meetings, adding and removing applications to chats, and more.
the following list describes in detail the actions I perform on daily basis:
fetch users
subscribe/unsubscribe/renew/get subscriptions of user chat messages
subscribe/unsubscribe/renew/get subscriptions of user chats
subscribe/unsubscribe/renew/get subscriptions of user online meeting
subscribe/unsubscribe/renew/get subscriptions of user calendar
install/remove teams bot application from chats
get chats/events/online meetings by id
I use the most recent version of ms-graph-sdk and ms-graph-sdk-beta python packages to perform those actions, mostly the non-beta version.
over the past months I have experienced a set of odd errors which I have no way to handle other than retrying the same request hoping it wouldn't happen.
those errors occur on my local environment (and my colleagues) and also on Kubernetes negating the possibility that it's environment related
overall my system is working but only with the help of retries, but it is becoming very difficult to manage.
the status codes are ranging from 500-504, indicating server-side issues beyond my application's control.
the following list describes the errors I experience (I have omitted some data that may be marked as sensitive):
error=MainError(additional_data={}, code='ExtensionError', details=None, inner_error=InnerError(additional_data={}, client_request_id='', date=, odata_type=None, request_id=''), message='Operation: Update; Exception: [A task was canceled.]', target=None))
error=MainError(additional_data={}, code='ExtensionError', details=None, inner_error=InnerError(additional_data={}, client_request_id=', date=, odata_type=None, request_id=''), message='Operation: Update; Exception: [Status Code: InternalServerError; Reason: Failed to execute backend request.]', target=None))
error=MainError(additional_data={}, code='ExtensionError', details=None, inner_error=InnerError(additional_data={}, client_request_id='', date=datetime.datetime(2025, 6, 4, 15, 7, 9), odata_type=None, request_id=''), message='Operation: Update; Exception: [Status Code: BadGateway; Reason: ]', target=None))
error=MainError(additional_data={}, code='UnknownError', details=None, inner_error=InnerError(additional_data={}, client_request_id='', date=, odata_type=None, request_id=''), message='Bad Gateway', target=None)
error=MainError(additional_data={}, code='UnknownError', details=None, inner_error=InnerError(additional_data={}, client_request_id='', date=datetime.datetime(2025, 6, 4, 14, 50, 46), odata_type=None, request_id=''), message='Service Unavailable', target=None)
error=MainError(additional_data={}, code='BadGateway', details=None, inner_error=InnerError(additional_data={}, client_request_id='', date=, odata_type=None, request_id=''), message='Failed to execute backend request.', target=None)
error=MainError(additional_data={}, code='ExtensionError', details=None, inner_error=InnerError(additional_data={}, client_request_id='', date=, odata_type=None, request_id=''), message='Operation: Read; Exception: [Status Code: BadGateway; Reason: ]', target=None)
HTTP Request: GET https://graph.microsoft.com/v1.0//chats/{chat_id}"HTTP/2 504 ==Gateway== Timeout"
error: MainError(additional_data={}, code='UnknownError', details=None, inner_error=InnerError(additional_data={}, client_request_id='', date=, odata_type=None, request_id=''), message='You do not have permission to view this directory or page using the credentials that you supplied.', target=None)
the errors mentioned above happen statistically, at random times, without any correlation to actions I perform, some errors appear more when renewing subscriptions, some appear when trying to add the bot to chat and some appear when fetching users or even chats, my application permissions are untouched for months, they have been consented to and my system is being tested daily, I have reviewed my configurations and have not identified any recent changes that could be causing these issues.
it is also important to mention that since yesterday I have started receiving "Service Unavailable" 502 at a high frequency, this hinders my system to a point where it may become unusable.
the amount of errors grew over time in an unpredictable manner which slowly prevents my system from functioning correctly, therefore I am seeking support in finding the cause for those errors, handling them correctly and be less dependent on retrying mechanisms.
thank you in advance