Operations: I Did Not Say You Could Do That!
Bill Barnes and Duke McMillin
November 2007
Revised February 2008
Summary: When they are developing systems, architects must keep operations in mind. (4 printed pages)
If history repeats itself, and the unexpected always happens, how incapable must Man be of learning from experience?
–George Bernard Shaw (1856–1950), Irish dramatist and socialist
Contents
Introduction
The Patriot Missile Incident
Conclusion
Lessons Learned
Critical-Thinking Questions
Sources
Introduction
If history repeats itself, and the unexpected always happens, how incapable must Man be of learning from experience?
–George Bernard Shaw (1856–1950), Irish dramatist and socialist
You might find this to be a shock, but the operators of the system that you just developed probably did not get a copy of the use cases that you used to create your design. At worst, your use cases were constructed by Bob in Marketing without any actual input from a customer. At best, they were derived from extensive customer input, but still bear little resemblance to how someone might actually operate your system. Is it the software architect's fault, if there are failures that are caused by operators doing things that you were told they would never do?
From working with my father to watching Lt. Commander Montgomery Scott in Star Trek V, the Final Frontier, I can recall hearing, "How many times do I have to tell you? Use the right tool for the right job!" As I tried to use a hammer to help remove a part from a car—spraying shards of metal everywhere, and making a huge mess of things—my father would come to the rescue. Impatience gave way to taking the time to spray with WD-40 what was not moving and waiting a minute before attempting to remove the part again. It might have taken a little more time, but the amount of energy that was expended to get the part off was less; also, it was safer, and it likely saved money. I wondered if other types of work—oh, for example, software development—could have issues with using the right tool for the right job...
The Patriot Missile Incident
During the 1991 Gulf War, an MIM-104 "Patriot" missile failed to hit an incoming Scud missile at Dhahran air base. As a result, many people were killed or injured. After some analysis, the cause was traced to a known bug.
It could be argued that this was actually not a bug at all, because this "bug" was only manifested when the system was being used in a way for which it was never intended. To use another auto-repair example: The operators were in need of a hammer, but all that they had was a screwdriver. Well, a screwdriver can be a passable hammer, and a mobile antiaircraft missile system can be a passable stationary antiballistic missile system. However, the use cases that were used to create a "screwdriver" or a mobile antiaircraft missile system were the basis for any architectural decisions that were made during development. So, what use case was not anticipated?
What was it like for those software architects, the people who wrote the software that controlled such a complex system? I imagine that things were looking good, at first. Based on their use cases, their design, and their user docs, they had a great product—or so they thought. Maybe they should have known better. Maybe they should have asked the customer and the operations support group more about how they were planning to use the system. How could anyone plan for all of the corner cases that "could" come up? However, when a corner case becomes the standard mode of operation, only one thing could have occurred: disaster!
In the case of the missile system, the Patriot maintained a "time since last boot" timer in a single-precision floating-point number. Time, which is critical to navigation and system accuracy, was computed from this number. The Patriot system uses a 100-millisecond time base. This 1/10-of-a-second number cannot be exactly represented by a floating-point number. With 24-bit precision, after about 8 hours of operation, enough error—about .0275 seconds, enough to yield a 55-meter error—accumulates to degrade navigational accuracy. After 100 hours of operation, the time error increased to a third of a second—the equivalent of 687 meters of targeting inaccuracy!
It turns out that the original use case for this system was to be mobile and to defend against aircraft that move much more slowly than ballistic missiles. Because the system was intended to be mobile, it was expected that the computer would be periodically rebooted. In this way, any clock-drift error would not be propagated over extended periods and would not cause significant errors in range calculation. Because the Patriot system was not intended to run for extended times, it was probably never tested under those conditions—explaining why the problem was not discovered until the war was in progress. The fact that the system was also designed as an antiaircraft system probably also enabled the inclusion of such a design flaw, because slower-moving airplanes would be easier to track and, therefore, less dependent upon a highly accurate clock value.
The system worked well, when it was used as designed; but the customer used the system in a way that was not foreseen by the software architect, and the result was a loss of life. The Patriot missile failure has been a case study in how complex systems can fail in ways that nobody expected, because of a series of seemingly unrelated events. However, it also shows that operators of any complex system can be very "creative" and are likely to do things that they just ought not to do. This is an extreme example of what happens in any enterprise, every day; someone in operations is just trying to get something done by using your system in a way that you probably did not even know was possible.
Conclusion
So, what are you to do? You cannot design a system that works under every conceivable use case. However, you can make a system that has very well-defined limits of operation and fails in known (and easily understandable) ways, when it operates outside of those limits. One way to help get there is to include, at a very early stage, actual system operators in your design meetings. They will, of course, help with the use cases that are supposed to be supported, but they can also provide some interesting insight into how the system might be used outside of those use cases. You could throw up your hands and just say, "Don't do that"; or you could just understand the reality of the operational environment, and try to make your system robust enough to survive the unexpected—where "survive" can mean failing in a known way. Your screwdriver is eventually going to be used as a hammer.
Lessons Learned
· Ask many people who have a variety of roles in the company to review your use cases, to get a variety of perspectives and inputs.
· Review your plan with the operations support team, before you start writing the production code.
· Understand and document how the system might behave under use-case scenarios that are known not to be supported.
Critical-Thinking Questions
· How could this product be used in ways in which I never intended it to be used?
· Under what conditions will the system fail? Consider all conditions, not just those that show up in an official use case.
· Someone from the Patriot manufacturer must have known that the customer had decided to use this system in a nonstandard way. How do you foster a relationship with operations that would allow this situation to be communicated back to the software architects?
Sources
· Hughes, David. "Tracking Software Error Likely Reason Patriot Battery Failed to Engage Scud." Aviation Week and Space Technology, June 10, 1991.
· Ganssle, Jack G. "Embedded Systems Programming: Disaster!" Embedded.com Web site. May 1998. (Accessed January 9, 2007.)
· Marshall, Eliot. "Fatal Error: How Patriot Overlooked a Scud." Science, March 13, 1992.
· Toich, Shelley. "The Patriot Missile Failure in Dhahran: Is Software to Blame?" shelley.toich.net/projects Web site. February 9, 1998. (Accessed January 9, 2007.)
About the authors
Bill Barnes has been involved in WAN and enterprise network engineering and customer support since 1995. He has worked for companies such as NorthWestNet, Verio, Internap Network Services, Lexis Nexis, and Boeing.
Duke McMillin has been working in IP networking since 1995 in a variety of customer-facing network-engineering and support roles, including the management of engineering development and capacity-planning organizations.
This article was published in Skyscrapr, an online resource provided by Microsoft. To learn more about architecture and the architectural perspective, please visit skyscrapr.net.