Testing In Production
One of the most useful resources for a test engineer dealing with web services is the production environment. This is the live environment that exposes the product to end users. Some of the challenges that the production environment provides us are the following:
- How do we know that software that works in a developer box or on a test lab will work in a production environment?
- What information can we gather from production that will help us release a higher quality product?
- How do we detect and react to issues found after a software upgrade?
In this blog post, we will look at some of the strategies that can be used to improve quality by incorporating the production environment and production data into our testing.
Smoke Testing in production
Some bugs appear more readily in production due to discrepancies between the test and live environments. For example, the network configuration in a test environment might be slightly different from the live site, causing calls between datacenters to fail unexpectedly. One possible way to identify issues like this would be to perform a full test pass on the production environment for every change that we want to make. However, we don't want to require running a full suite of tests before every upgrade in the live environments since this would be prohibitively time-consuming. Smoke tests are a good compromise, as they give us confidence that the core features of the product are working without incurring too high of a cost.
A smoke test is a type of test that performs a broad and shallow validation of the product. The term comes from the electronics field, where after plugging in a new board, if smoke comes out, we cannot really do any more testing. During daily testing, we can use smoke tests as a first validation that the product is functional and ready for further testing. Smoke tests also provide a quick way to determine if the site is working properly after deploying an update. When we release an update to our production environment we generally perform the following steps to validate that everything went as planned:
- Prior to updating the site, we run some tests against the current version. Our goal is to make sure that the system is healthy and our tests are valid before starting the upgrade.
- We then update a subset of the production site. Preferably, this portion will not be available to end users until we complete the smoke test.
- Next, we run the tests against the updated portion of the site. It is important to have clarity on which version the tests are running against. We should have a clean pass of the smoke tests before we proceed. If we encounter problems, we can compare the results pre- and post- upgrade to help focus the troubleshooting investigation.
- Continue the rollout to the rest of the production environment.
- Finally, run the tests again to validate the entire site is working as expected.
The tests used for smoke testing should have the following qualities:
- Smoke tests need to be very reliable, as a false positive may cause either unnecessary false alarms or a loss of trust in the smoke test suite.
- Smoke tests need to be very fast. The main point of smoke testing is quickly identify problems, and long running tests can either delay updates to the site, or potentially allow users to access buggy code before the tests catch it.
- Smoke tests need to be good at cleaning up after running. We need to avoid having test data mixed in with real customer data since tests can potentially create data that isn’t intended to be processed in production.
Windows Live uses an automated smoke test tool which is able to do a validation of the service within a few minutes. The same utility is used in developer boxes, test environments and production, and is consistently updated as new features are added to the system.
Reacting to issues through data collection and monitoring
Even though we may have done thorough functional validation, shipping a new feature to production always implies a risk that things may not work as intended. Logging and real-time monitoring are tools that help us in this front. Before shipping a new feature to production, try to answer the following questions. This will give you a sense of readiness for handling issues:
- How will you know that users are having issues with the feature? Will you mostly rely on user feedback, or will you be able to detect and measure failures yourself? Will the people running the site be able to tell something is wrong?
- If a user raises an issue, what are the resources that you will have available to investigate? Will you require the user to collect detailed logs?
- In the event of an issue, are your test libraries prepared for quickly building up a test scenario based on a user's feedback? The ability to craft tests based on logs and the user’s repro steps generally indicates how long it will take for someone to reproduce the issue and validate a fix, which has a direct impact on the time to resolution.
Some of the strategies that windows live uses for allowing quicker reaction to issues are the following:
- We allow user initiated log collection, both on the client and on the server side. Taking the product group out of the critical path when collecting data saves the team a significant amount of time and effort.
- We support using the user logs to craft tests for reproducing an issue. Our tools take logs, remove any actual reference of the users data contents, and replay the traffic.
Using production data as input for tests
The involvement of Test in production should be limited to releasing a new feature or investigating an issue. Production contains a wealth of data that helps us better define what to test. The higher priority tests are those that map to the core customer calling patterns, and for existing scenarios, production data is the best source. Some of the interesting questions that production data analysis is able to answer are the following:
- What are the different kinds of users in the environment? What are the characteristics that identify them?
- What are the most common calling patterns? Which ones most frequently cause errors?
- Do the site’s traffic patterns indicate changes in user behavior?
Gathering and analyzing data to answer the above and other questions is often non-trivial, but the resulting data is invaluable, particularly when deciding which areas should have a bigger focus when testing.
Within Windows Live, we have used this approach to understand both user scenarios and calling patterns. We measure some of the characteristics of the data (like how many folders a SkyDrive has, or how many comments photos typically have) to identify both common scenarios and outliers. This data lets us focus efforts like performance testing and stress on the most common scenarios, while ensuring that we have coverage on the edge cases.
When using production data in testing, the approach to privacy is extremely important and needs to be figured out before starting the work. Our tools only interact with abstractions of user data, with all actual user content and identity removed. We care about what the data looks like, not specifically what the data is.
In conclusion, the effectiveness of a test engineer can be enhanced by using production as a source of information. It may be by making sure that all the core scenarios work as expected through smoke testing, by creating a quick mechanism for reacting to issues, or by harvesting data to feed into test tools and plans.
References