Handling emergencies - When worst case == current case

There is a much parodied line from Rudyard Kipling’s poem, "If". The parody runs "If you can keep your head when all about you are losing theirs, it's just possible you haven't grasped the situation". This may be true but the first thing to do in any computer crisis is to step back and take an objective look at the situation. That may be scary but it is better to know.

Here, I present a few points that may be of value if ever the balloon goes up.

1. Locate your contingency plan. If you do not have a contingency plan then you have identified the first thing that you need to do when the current crisis is over. Write it on a whiteboard. You really don’t want to forget it.

2. If you do have a plan for emergencies, give it a very quick review. Does it seem feasible? Most companies have never tested their emergency plan. Most of the companies that have tested it found that it didn’t work. You should trust an untested plan as much as you trust untested code. However, if it is all that you have then you will have to go with it. If you have limited faith in the plan, my advice is to fork your efforts. Give the plan to a competent administrator and have them try to carry it out. Put your best people on trying to work out what went wrong.

3. Ok, so, has this system ever worked? If not then you shouldn’t be facing an emergency as this is a system still under development. However, it is not an uncommon scenario. The system has to go live on such-and-such a date. It has been fine in test with 100 users. The powers-that-be decided that the technical staff are being overcautious and put the brand new app live. It falls over. Repeatedly and messily. If you are in this situation then you have just entered the field of extreme debugging. Welcome to my world. You are probably going to need all the help that you can get here so engaging expert assistance is not a bad idea at this point. It can be expensive but sometimes it costs more to do it wrong. I can’t recommend any companies but there are people who offer a SWAT team debugging service.

4. If the system has worked but no longer does then something has changed. Again, this is a common scenario. Maybe it will be something that you know about and maybe it won’t. If it isn’t then you want to find out about it pretty quickly

Let us assume that we are in case 4 here. Everything used to work but now it doesn’t. This is not good. When you are looking to find out what has changed, it is important to remember that multiple things may have changed at around the same time. It is human nature to stop when you find the first thing that has changed and call that the problem. I would advise looking for a little longer. This is especially true with servers which are managed in a data centre where changes tend to be applied in batches. If you assume that the first answer is the only answer then you risk running quickly in to a dead end.

What sort of changes are likely to have happened? Oh, you may have noticed that I tend to think in lists. I am sorry but it is the way that I am. Anyway, let’s have a list of common changes:

a. Volume of work has increased. This is a very common one. In this scenario, server applications tend to degrade relatively gracefully. Relative in the sense that they tend to get slow, restart and fail some requests rather than ceasing to function completely. You have hit a scalability issue and the first thing to do is work out what you are bottlenecking on. Response times and sometimes resources often show a non-linear growth and this can come as an unpleasant surprise if your scalability testing has been incomplete. I have discussed scalability at some length in previous blogs so you might want to review those among other resources.

b. Something in the infrastructure has broken. This is more common than I would have expected. This can be something as simple as a network router going down or something more subtle such as a bunch of accounts being migrated from one domain to another. You probably are not going to fix this in your application.

c. Some components in the system have changed. Maybe some new tool has been installed on the system. Maybe a security hotfix has been installed. If you think that this has happened, then my recommendation is that you should back up the current state of the system to a nice fresh backup tape and then go back to a known state. You will need the backup for investigating what went wrong later. As I have mentioned before, I am a big fan of having a test server that is identical to the real server. If you follow this procedure then you can just swap the systems over and have things run while you investigate the failure in relative leisure. If you do swap the systems over and it works, then I recommend breaking the fingers of anyone who tries to update the server until you have a root cause analysis. It is perhaps a little harsh, I know.

d. Your luck ran out. Strange to report, sometimes systems work because of good fortune. I know one system that worked well for years despite corrupting memory. As it happened, a particular sequence of operations could happen in any order but in practice tended to happen the same every time. That meant that the memory was only corrupted after it was not being actively used. That, in turn, meant that the app didn’t crash. One day, it happened the other way around and the app crashed. There wasn’t a new bug. It was just bad luck that it had never shown up in testing

e. External factors have changed. Maybe you have started to get requests from Firefox rather than IE. Maybe someone is launching a denial of service attack against you. Maybe a society for people named Smith have signed up for your service and your name searches are much less efficient. There isn’t anything that you can do about these except deal with them on a case by case basis.

So, those are some ideas that I have on the topic but if you have different ones then I would like to hear your thoughts. Feel free to post anything that you think would be helpful

Signing off

Mark