I Scream, You Scream, We're All Testing!

Scream testing is an interesting concept, not so much applied to finding defects, but instead to clean up your processes.  In the engineering world, we have a habit of creating a lot of things, yet have a very difficult time retiring things, whether those things are projects, hardware, automated test cases, etc.  I guess to some degree engineers can be hoarders.  Now, you could just shut things down or delete things when you are done, but somehow that never happens and honestly, if someone else is using “it”, shutting it down would make work difficult for someone else.  And if you are responsible for broad items like servers in a data center, automated test case that test components of the OS, or a website full of information, the last thing you want to do is just turn those off.  If people are utilizing them and are expecting to see results from them, then turning them off will make more problems for you than leaving them on and continuing to maintain them.  And that is why most people keep things around for a very long time in corporations.  Who knows what you will break if you turn “it” off.  It’s much safer to just leave it on.  But this becomes costly and inefficient.  So what do you do to clean up the artifacts left running just because everyone is afraid to turn them off or delete them?

We incorporated a new method in my lab that I think is great.  We call it a scream test.  We have many servers in our lab and lots are questionable on if they are truly being used.  They moved owners, went through reorgs, been upgraded (or not), and eventually ended up in a state where we don’t believe they are being used, but we just aren’t sure.  So we put these servers into a scream test environment where they still have access to our corpnet, but users are now limited from doing just about anything on the machines except logging in.  If someone does log onto a machine to try something (like running tests, installing other software, etc.), a dialog pops up telling them that this machine is in a scream test and they need to contact my lab managers if they want access back for this server, otherwise it will be retired in so many days.  We usually put a set of servers in a scream test for 2-4 weeks.  Some people on the team will scream profusely and we are happy about that.  And they become happy about it too when they find out that we didn’t uninstall anything or lose any of their data, we just limited access and once they scream to the lab managers, we give all permissions back AND record them as the owner of the server (the piece of info we were missing that made the utilization of this server questionable).  Some would say, why not just monitor CPU/RAM usage and we could, but many times dev or test servers will sit for a while because we are not actively deploying new apps but only maintaining existing ones and those don’t need daily activities on our servers, yet they are still needed machines.

You can apply this same Scream Test approach to anything you want to retire.  For example, I was recently speaking with someone running tests in Windows.  I remember back when I had to run automated tests in Windows and some teams like the Graphics team would generate over 1 million test results a day.  If just a small percentage of those would fail, testers would spend a ton of time investigating these failures.  But the reason there were so many automated tests is that they were developed over the course of many years (since the beginning of Windows NT) so how do you know if those tests are still worthwhile and needed.  How about applying the Scream Test method?  Turn some of them off, maybe for some runs early in the product cycle, and see if bugs are being missed and if testers are starting to scream.  Should you turn off the more recent automated tests or the more legacy ones?  Well, at first I’d say the legacy ones, but what indication are you going to use to trigger that scream.  If the trigger is bugs found, yet that test is so old nobody would truly runs through that scenario within the course of testing for a month or more while that automated test is disabled, then your scream test will never trigger a scream.  (Ok, maybe I should be working at Monsters, Inc.)  It may be ok that nobody screamed about your disabled tests, but you should expect some of your items that you put into a scream test to actually cause a scream.  Otherwise, you don’t really know if your scream test is working.

Can you apply this approach to other things?  Sure.  What about a tool or website that you think needs to be retired?  Redirect them to another tool or website and ask them the click a button if they aren’t able to do the same actions as they would expect with the old tool.  Every time someone clicks to give feedback that they aren’t happy, you count it as a scream.  How many screams do you need to turn the service back on?  Or the more interesting question, how long do you wait for any feedback or scream before determining that nobody is using the service and it can be turned off?