The Insapience of Anti-Automationism

Last weekend, I attended the 12th annual Workshop on Teaching Software Testing (WTST 2013). This year, we focused on high-volume automated testing (and how to teach it).

High-volume automated testing (HiVAT) refers to a family of testing techniques that enable the tester to create, run and evaluate the results of arbitrarily many tests.

We had some great presentations and associated discussions. For example:

  • Harry Robinson showed us some of the ways that the Bing team evaluated its search recommendations. There are no definitive oracles for relevance and value but there are many useful ideas for exposing probably-not-relevant and probably-not-valuable suggestions. These are partial oracles. The Bing team used these to run an enormous number of searches and gain good (if never perfect) ideas for polishing Bing’s suggestion heuristics.
  • Doug Hoffman and I gave presentations that emphasized the use of oracles in high volume automated testing. Given an oracle, even a weak oracle, you can run scrillions of tests against it, alerting a human when the test-running software flagged a test result as suspicious. Both of us had uncharitable things to say about run-till-crash oracles and other oracles that focus on the state of the environment (e.g. memory leaks) rather than the state of the software under test. These are the types of oracles most often used in fuzzing, so we dissed fuzzing rather enthusiastically.
  • Mark Fioravanti and Jared Demott gave presentations on the use of high-volume test automation which included several examples of valuable information that could be (and was) exposed by fuzzing with simplistic oracles. Fioravanti illustrated context after context, question after question, that were well-addressed with hard-to-set-up-but-easy-to-evaluate high-volume tests. Doug and I ate our humble pie and backed away from some of our previous overgeneralizations.
  • Tao Xie showed us a set of fascinating ideas for implementing high-volume testing at the unit test level. I have a lot more understanding and interest in automated test data generation (and a long reading list).
  • Rob Sabourin and Vadym Tereshchenko talked about testing for multi-threading problems using Fitnesse.
  • Casey Doran talked about how to evaluate open source software, to determine whether it would be a good testbed for evaluating a high-volume test automation tool in development and Carol Oliver led a discussion of quality criteria for reference implementations (teachable demonstrations) of high-volume test automation tools.
  • Finally, Thomas Vaniotis led a discussion of the adoption of high-volume techniques in the financial services industries.

Personally, I found this very instructive. I learned new ideas, gained new insights, stretched the bounds of what I see as high-volume test automation and learned about new contexts in which the overall family can be applied (and how those contexts differ from the ones I’m most familiar with).

A few of times during the meeting, we were surprised to hear echos of an antiautomation theme that has been made popular by a few testing consultants. For example:

  • One person commented that their clients probably wouldn’t be very interested in this because they were more interested in “sapient” approaches to testing. According to the new doctrine, only manual testing can be sapient. Automated tests are done by machines and machines are not sapient. Therefore, automated testing cannot be sapient. Therefore, if a tester is enthusiastic about sapient testing (whatever that is), automated testing probably won’t be of much interest to them.
  • Another person was sometimes awkward and apologetic describing their tests. The problem was that their tests checked the program’s behavior against expected results. According to the new doctrine, “checking” is the opposite of “testing” and therefore, automated tests that check against expectations are not only not sapient, they are not tests. The campaign to make the word “checking” politically incorrect among testers might make good marketing for a few people, but it interferes with worthwhile communication in our field.

I don’t much care whether someone decides to to politicize common words as part of their marketing or, more generally, to hate on automated testing. Over time, I think this is probably self-correcting. But some people want to make the additional assertion that the use of (allegedly) non-sapient (allegedly) non-tests conflicts with the essence of context-driven testing. It’s at that point that, as one of the founders of the context-driven school, I say “hooey!”.

  • All software tests are manual tests. Consider automated regression testing, allegedly the least sapient and the most scripted of the lot. We reuse a regression test several times–perhaps running it on every build. Yes, the computer executes the tests and does a simple evaluation of the results, but a human probably designed that test, a human probably wrote the test code that the computer executes, a human probably provided the test input data by coding it directly into the test or by specifying parameters in an input file, a human probably provided the expected results that the program uses to evaluate the test result, and if there appears that there might be a problem, it will be a human who inspects the results, does the troubleshooting and either writes a bug report (if the program is broken) or rewrites the test. All that work by humans is manual. As far as I can tell, it requires the humans to use their brains (i.e. be sapient).
  • All software tests are automated. When you run a manual test, you might type in the inputs and look at the outputs, but everything that happens from the acceptance of those inputs to the display of the results of processing them is done by a computer under the control of a program. That makes every one of those tests automated.

The distinction between manual and automated is a false dichotomy. We are talking about a matter of degree (how much automation, how much manual), not a distinction of principle.

I think there is a distinction to be made between tests that testers use to learn things that they want to know versus tests that some people create or run even though they have no plan or expectation for getting information value from them. We could call this a distinction of sapience. But to tie this to the use of technology is misguided.

Thinking within the world of context-driven testing, there are plenty of contexts that cry out for HiVAT support. If we teach testers to be suspicious of test automation, to treat it as second-class (or worse), to think that only manual tests have the secret sauce of sapience, we will be preparing those testers to fail in any contexts that are best served with intense automation. That is not what we should teach as context-driven testing.

And then there is that strange dichotomization of testing and checking. As I understand the notion of checking, when I run a test I can check whether the program’s behavior conforms to behavior I expect in response to the test. Thus I can say, Let’s check whether this adds these numbers correctly and Let’s check the display and even Let’s check for memory leaks. Personally, I think these look like things I might want to find out while testing. I think testing is a search for quality-related information about a product or service under test, and if I execute the program with the goal of learning some quality-related information, that execution seems to me to be a test. Checking is something that I often do when I am doing what I think is good testing.

Suppose further that I have a question about the quality of the software that is well-enough formed, and my programming skills are strong enough, that I can write a program to do the checking for me and tell me the result. That is, suppose that on my instruction, my program runs the software under test in order to quickly provide an answer to my question. To me, this still looks like a test, even though it is automated checking.

But not everyone agrees. Instead, some people assert that checking is antithetical to testing (checking versus testing). They say that testing is a sapient activity done by humans but that checking is not testing.

If comparison of the program’s behavior to an anticipated result makes something a not-test, then what if we check whether the program’s behavior today is consistent with what it did last month? Is that a not-test? What if we check whether a program’s calculations are consistent with those done by a respected competitor? Or consistent with claims made in a specification? These are just more examples of cases in which testers might (often) have a reasonably clear set of expectations for the results of a test. When they have those expectations, they can assess the test result in terms of those expectations. Clearly, this is checking. But in the land of Sapient Testing, these particular expectations are enumerated as Consistency Heuristics that sapient testers rely on (e.g. http://www.developsense.com/blog/2012/07/few-hiccupps/). Apparently, at least some sapient testing is checking.

Let me try to clarify the distinction:

  • A test is checking to the extent that it is designed to produce results that can be compared, without ambiguity, to results anticipated by the test designer.
  • A test is not-checking to the extent that it can produce a result that is not anticipated by the test designer but is noticeable and informative to the person who interprets the test results.

From this, let me suggest a conclusion:

Most tests are checking. Most tests are also not-checking. Checking versus testing is another false dichotomy.

The value of high-volume automation is that it gives testers a set of tools for hunting for bugs that are too hard to hunt without tools. It lets us hunt for bugs that are very hard to find (think of calculation errors that occur on fewer than 0.00000001% of a function’s inputs, like Hoffman’s MASPAR bug). It lets us hunt for bugs that live in hard-to reach places (think of race conditions). It lets us hunt for bugs that cause intermittent failures that we (incorrectly) don’t think are possible or have no idea how to hunt with traditional methods (including manual exploration).

We do almost all of the high-volume test work with tools. In most cases, we compare test results with expected values. And yet, this is purposeful work being done by humans to answer quality-related questions. It is automated. It is checking. It is testing.

Tests are tests whether they are automated or not, whether they are checking or not. Tests can be excellent tests, and fully appropriate for their contexts, whether they are automated or not, whether they are checking or not. And tests can be worthless, or completely unsuited to their context, whether they are automated checks or intentionally-designed manual tests.

The basic premise of context-driven testing is that a test approach that works well in one context may fail utterly in some other context. There are no universal best practices (not even manual exploratory testing organized into testing sessions).

Years ago, one of the common pieces of advice we gave to testers who were interested in context-driven testing was to look for three counter-examples. If you were thinking about a really good practice, look for three situations in which an alternative would work much better. If you were thinking about a practice that you didn’t like (especially one that other people took more seriously than you considered reasonable), look for three situations in which that practice would be appropriate and valuable. I think we learned this from Brian Marick but I remember it taking hold with most (or all) of the folks who were early adopters of the context-driven approach. I came to see it as a core skill of context-driven testers. Drawing sharp lines, condemning common and often-valuable testing activities as non-testing or non-sapient, is completely out of step with this. It denies the importance of “context” in “context-driven.”