
Thanks for the tips, yes I've done a full crawl. What I noticed today is the securitytokenservice is not working on the server that fails, I'm guessing this might be a bigger problem than I thought. The security token service app is giving 500 server errors in the event logs - System.Net.WebException: The remote server returned an error: (500) Internal Server Error.
I'm considering creating a new topology that skips that server, but obviously I want to fix that server at some point too. Most of the ideas I've seen so far involve rebuilding UPS which is more invasive than I want to do. I did try resetting the application pool and I validated the files in the endpoint folder are unchanged for several years.
Here's the screenshot of the previous topology: