Metrics, Ethics, & Context-Driven Testing (Part 2)

My last post responded to Michael Bolton’s: “Why Pass vs. Fail Rates Are Unethical“, Michael argued that calculating the ratio of passing tests to failing tests is irresponsible, unethical, unprofessional, unscientific and inhumane. I think this is an example of a growing problem in the rhetoric of context-driven testing–I think it considers too little the value of tailoring what we do to the project’s context. Instead, too often, I see a moralistic insistence on adoption of preferred practices or rejection of practices that we don’t like.

I think it’s easy to convert any disagreement about policy or practice into a disagreement about ethics. I think this is characteristic of movements that are maturing into orthodox rigidity. Unfortunately, I think that’s fundamentally incompatible with a contextualist approach. My post advocated for dialing the rhetoric back, for a stronger distinction between disagreeing with someone and morally condemning them.

Michael responded with a restatement that I think is even more extreme.

I think the best way to answer this is with a series of posts (and perhaps some discussion) rather than one excessively long screed.

(Added 3/22/12) Michael and I plan to discuss this soon. My next post will be informed by that discussion.

The core messages of this first post are fairly simple:

Executives are Entitled and Empowered to Choose their Metrics

Several years ago, I had a long talk about metrics with Hung Quoc Nguyen. Hung runs LogiGear, a successful test lab. He was describing to me some of the metrics that his clients expected. I didn’t like some of these metrics and I asked why he was willing to provide them. Hung explained that he’d discussed this with several executives. They understood that the metrics were imperfect. But they felt that they needed ways to summarize what the organization knew about projects. They felt they needed ways to compare progress, costs, priorities, and risks. They felt they needed ways to organize the information so that they could compare several projects or groups at the same time. And they felt they needed to compare what was happening now to what had happened in the past. Hung then made three points:

  1. These are perfectly legitimate management goals.
  2. Quantification (metrics) is probably necessary to achieve these goals.
  3. The fact that there is no collection of metrics that will do this perfectly (or even terribly well) doesn’t eliminate the need. Without a better alternative, managers will do the best they can with what they’ve got.

Hung concluded that his clients were within their rights to ask for this type of information and that he should provide it to them.

If I remember correctly, Hung also gently chided me for being a bit of a perfectionist. It’s easy to refuse to provide something that isn’t perfect. But that’s not helpful when the perfect isn’t available. He also suggested that when it comes to testers or consultants offering a “better alternative”, every executive has both the right and the responsibility to decide which alternative is the better one for her or his situation.

By this point, I had joined Florida Tech and wasn’t consulting to clients who needed metrics, so I had the luxury of letting this discussion settle in my mind for a while before acting on it.

Finance Metrics Illustrate the Executives’ Context

A few years later, I started studying quantitative finance. I am particularly interested in the relationship between model evaluation in quantitative finance and exploratory testing. I also have a strong personal interest–I apply what I learn to managing my family’s investments.

The biggest surprise for me was how poor a set of core business metrics the investors have to work with. I’m thinking of the numbers in balance sheets, statements of cash flow, and income statements, and the added details in most quarterly and most annual investment reports. These paint an incomplete, often inaccurate picture of the company. The numbers are so subject to manipulation, and present such an incomplete view, that it can be hard to tell whether a company was actually profitable last year or how much their assets are actually worth.

Investors often supplement these numbers with qualitative information about the company (information that may or may not present a more trustworthy picture than the numbers). However, despite the flaws of the metrics, most investors pay careful attention to financial reports.

I suppose I should have expected these problems. My only formal studies of financial metrics (courses on accounting for lawyers and commercial law) encouraged a strong sense of skepticism. And of course, I’ve seen plenty of problems with engineering metrics.

But it was still a surprise that people actually rely on these numbers. People invest enormous amounts of money on the basis of these metrics.

It would be easy to rant against using these numbers. They are imperfect. They can be misleading. Sometimes severely, infuriatingly, expensively misleading. So we could gather together and have a nice chant that using these numbers would be irresponsible, unethical, unprofessional, unscientific and inhumane.

But in the absence of better data, when I make financial decisions (literally, every day), these numbers guide my decisions. It’s not that I like them. It’s that I don’t have better alternatives to them.

If someone insisted that I ignore the financial statistics, that using them would be irresponsible, unethical, unprofessional, unscientific, and inhumane, I would be more likely to lose respect for that person than to stop using the data.

Teaching Metrics

I teach software metrics at Florida Tech. These days, I start the course with chapters from Tockey’s Return on Software: Maximizing the Return on Your Software Investment. We study financial statistics and estimate future cost of a hypothetical project. The students see a fair bit of uncertainty. (They experience a fair bit of uncertainty–it can be a difficult experience.) I do this to help my students gain a broader view of their context.

When an executive asks them for software engineering metrics, they are being asked to provide imperfect metrics to managers who are swimming in a sea of imperfect metrics.

It is important (I think very important) to pay attention to the validity of our metrics. It is important to improve them, to find ways to mitigate the risks of using them, and to advise our clients about the characteristics and risks of the data/statistics we supply to them. I think it’s important to use metrics in ways that don’t abuse people. There are ethical issues here, but I think the blanket condemnation of metrics like pass/fail ratios does not begin to address the ethical issues.

The Principles

In the context-driven principles, we wrote (more precisely, I think, I wrote) “Metrics that are not valid are dangerous.” I still mostly (*) agree with these words but I think it is too easy to extend the statement into a position that is dogmatic and counterproductive. If I was writing the Principles today, I would reword this statement in a way that acknowledges the difficulty of the problem and the importance of the context.

(*) The statement that “Metrics that are not valid” is inaccurately absolute. It is not proper to describe a metric as valid (see Trochim and Shadish, Cook & Campbell, for example). Rather, we should talk about metrics as more valid or less valid (shades of gray). The wording “not valid” was a simplification at the time, and in retrospect, should be seen as an oversimplification.

Contexts differ: Recognizing the difference between wrong and Wrong

Contexts differ.

  • Testers provide information to our clients (stakeholders) about the product, about how we tested it, and about what we found.
  • Our clients get to decide what information they want. We don’t get to decide that for them.
  • Testers provide services to software projects. We don’t run the projects. We don’t control those projects’ contexts.

In context-driven testing, we respect the fact that contexts differ.

What Does “contexts differ” Really Mean?

I think it means that in different contexts, the people who are our clients:

  • are going to want different types of information
  • are going to want us to prioritize our work differently
  • are going to want us to test differently, to mitigate different risks, and to report our results in different ways.

Contexts don’t just differ for the testers. They differ for the project managers too. The project managers have to report to other people who want whatever information they want.

We don’t manage the project managers. We don’t decide what information they have to give to the people they report to.

Sometimes, Our Clients Want Metrics

Sometimes, a client will ask how many test cases the testers have run:

  • I don’t think this is a very useful number. It can be misleading. And if I organize my testing with this number in mind, I might do worse testing.
  • So if a client asks me for this number, I might have a discussion with her or him about why s/he thinks s/he needs this statistic and whether s/he could be happy with something else.
  • But if my client says, “No, really, I need that number“, I say, OK and give the number.

Sometimes a client will ask about defect removal efficiency:

  • I think this is a poor excuse for a metric. I have a nice rant about it when I teach my graduate course in software metrics. Bad metric. BAD!
  • If a client asks for it, I am likely to ask, Are you sure? If they’re willing to listen, I explain my concerns.

But defect removal efficiency (DRE) is a fairly popular metric. It’s in lots of textbooks. People talk about it at conferences. So no matter what I say about it, my client might still want that number. Maybe my client’s boss wants it. Maybe my client’s customer wants it. Maybe my client’s regulator wants it. This is my client’s management context. I don’t think I’m entitled to know all the details of my client’s working situation, so maybe my client will explain why s/he needs this number and maybe s/he won’t.

So if the client says, “No, really, I need the DRE“, I accept that statement as a description of my client’s situation and I say, OK and give the number.

One more example: ratio of passing to failing tests. Michael Bolton presents several reasons for disliking this metric and I generally agree with them. In particular, I don’t know what the ratio measures (it has no obvious construct validity). And if the goal is to make the number big, there are lots of ways to achieve this that yield weak testing (see Austin on measurement dysfunction, for discussion of this type of problem.)

But Michael takes it a step further and says that using this metric, or providing it to a client is unethical.

UNETHICAL?

Really?  UNETHICAL?!?

If you give this metric to someone (after they ask, and you say it’s not very good, and they say, really-I-want-it):

  • Are you lying?
  • Are you taking something that doesn’t belong to you?
  • Are you oppressing someone? Intimidating them?
  • Are you hurting someone?
  • Are you cheating anyone?
  • Are you pretending to skills or knowledge that you don’t have?
  • Are you helping someone else lie, cheat, steal, intimidate, or cause harm?

I used to associate shrill accusations of unethicalness with conservatives who were losing control of the hearts and minds of the software development community and didn’t like it, or who were pushing a phony image of community consensus as part of their campaigns to get big contracts, especially big government contracts, or who were using the accusation of unethical as a way of shutting down discussion of whether an idea (unethical!) was any good or not.

Maybe you’ve met some of these people. They said things like:

  • It is unethical to write code if you don’t have formal, written requirements
  • It is unethical to test a program if you don’t have written specifications
  • It is unethical to do exploratory testing
  • It is unethical to manage software projects without formal measurement programs
  • It is unethical to count lines of code instead of using function points
  • It is unethical to count function points instead of lines of code
  • It is unethical to not adopt best practices
  • It is unethical to write or design code if you don’t have the right degree or certificate
  • It should be unethical to write code if you don’t have a license

It seemed to me that some of the people (but not all of the people) who said these things were trying to prop up a losing point of view with fear, uncertainty, doubt — they were using demagoguery as their marketing technique. That I saw as unethical.

Much of my contribution to the social infrastructure of software testing was a conscious rebellion against a closed old boys network that defended itself with dogma and attacked non-conformers as unethical.

wrong versus Wrong

So what’s with this “Using a crummy metric is unethical” ?

Over the past couple of years, I’ve seen a resurgence of ethics-rhetoric. A new set of people have a new set of bad things to condemn:

  • Now it seems to be unethical to have a certification in software testing that someone doesn’t like
  • Now it seems to be unethical to follow a heavyweight (heavily documented, scripted) style of testing
  • Now it seems to be unethical to give a client some data that the client asks for, like a ratio of passing tests to failing ones.

I don’t think these are usually good ideas. In fact, most of the time, I think they’re wrong.

But  _U_N_E_T_H_I_C_A_L_?_!_?

I’m not a moral relativist. I think there is evil in the world and I sometimes protest loudly against it. But I think it is essential to differentiate between:

  • someone is wrong (mistaken)
  • someone is wrong (attempting something that won’t work), and
  • someone is Wrong (unethical).

Let me illustrate the difference. Michael Bolton is a friend of mine. I have a lot of respect for him as a person, including as an ethical being. His blog post is a convenient example of something I think is a broader problem, but please read my comments on his article as an assertion that I think Michael is wrong (not Wrong).

To the extent that we lose track of the difference between wrong and Wrong, I think we damage our ability to treat people who disagree with us with respect. I think we damage our ability to communicate about our professional differences. I think we damage our ability to learn, because the people we most agree with probably have fewer new things to teach us than the people who see the world a little differently.

The difference between wrong and Wrong is especially important for testers who want to think of ourselves (or market ourselves) as context-driven.

Because we understand that what is wrong in some contexts is right in some others.

Contexts differ.

Context-driven testing is not a religion

So, did I really say that context-driven testing is dead? No, that was some other guy (Scott Barber) who’s using the buzz to launch a different idea. It’s effective marketing, and Scott has interesting ideas. But that’s his assertion, not mine.

What I wrote a few days ago was this:

If there ever was one context-driven school, there is not one now.

A “school” provides an organizing social structure for a body of attitudes and knowledge. Schools are often led by one or a few highly visible people.

Over the past few years, several people have gained visibility in the testing community who express ideas and values that sound context-driven to me. Some call themselves context-driven, some don’t. My impression is that some are being told they are not welcome. Others are uncomfortable with a perceived orthodoxy. They like the approach but not the school. They like the ideas, but not the politics.

The context-driven school appeared for years to operate with unified leadership. This appearance was a strength. But it was never quite true: Brian and Bret left early (but they left quietly). I’ve repeatedly raised concerns about the context-driven rhetoric, but relatively quietly. James and I haven’t collaborated successfully for years–this is old news–but for most of that time, our public disagreements were pretty quiet.

I think it is time to go beyond the past illusion of unity, to welcome a new generation of leadership. Not just a new generation of followers. A new generation of leaders. And to embrace their diversity.

There is not one school. There might be none. There might be several. I’m not sure what our real status is today. There will be an evolution and I look forward to seeing the result.

For now, I continue to be enthusiastic about the approach. I still endorse the principles. But what I understand to be the meanings and implications of the principles might not be exactly the same as what you understand. I think that’s OK.

In terms of the politics of The One School, my perception is of an exclusionary tone that has become more emphatic over time. I think this can make good marketing–entertaining presentations, lots of excitement. But does it serve its community? What is the impact on the people who are actually doing the testing: looking for work; looking for advancement in their own careers; striving to increase their skills and professionalism?

For many people, the impact is minimal–they follow their own way.

But for people who align themselves with the school, I think there are risks.

I wasn’t able to travel to CAST last year (health problem), so I watched sessions on video. Watching remotely let me look at things with a different perspective. One of the striking themes in what I saw was a mistrust of test automation. Hey, I agree that regression test automation is a poor bases for an effective comprehensive testing strategy, but the mistrust went beyond that. Manual (session-based, of course) exploratory testing had become a Best Practice.

In the field of software development, I think that people who don’t know much about how to develop software are on a path to lower pay and less job security. Testing-process consultants can be very successful without staying current in these areas of knowledge and skill. But the people they consult to? Not so much.

It was not the details that concerned me. It was the tone. I felt as though I was watching the closing of minds.

I have been concerned about this ever since people in our community (not just our critics–us!) started drawing an analogy between context-driven testing and religion.

As James put it in 2008, “I have my own testing religion (the Context-Driven School).” I objected to it back then, and since then. This is deeply inconsistent with what I signed up for when we declared a school.

An analogy to religion often carries baggage: Divine sources of knowledge; Knowledge of The Truth; Public disagreement with The Truth is Heresy; An attitude that alternative views are irrelevant; An attitude that alternative views are morally wrong.

Here’s an illustration from James’ most recent post:

“One of the things that concerns Cem is the polarization of the craft. He doesn’t like it, anymore. I suppose he wants more listening to people who have different views about whether there are best practices or not. To me, that’s unwise. It empties the concept of much of its power. And frankly, it makes a mockery of what we have stood for. To me, that would be like a Newtonian physicist in the 1690’s wistfully wishing to “share ideas” with the Aristotelians. There’s no point. The Aristotelians were on a completely different path.

This illustrates exactly what troubles me. In my view, there are legitimate differences in the testing community. I think that each of the major factions in the testing community has some very smart people, of high integrity, who are worth paying attention to. I’ve learned a lot from people who would never associate themselves with context-driven testing.

Let me illustrate that with some notes on my last week (Feb 27 to March 2):

  • My students and I reviewed Raza Abbas Syed’s M.Sc. thesis in Computer Science: Investigating Intermittent Software Failures. The supervisor of this work was Dr. Laurie Williams. If she identified herself with any school of software testing, it would probably be Agile, not Context-Driven. But, not surprisingly, the work presented some useful data and suggested interesting ideas. I learned things from it. Should I really stop paying attention to Laurie Williams?
  • Yesterday, Dr. Keith Gallagher gave a guest lecture in my programmer-testing course on program slicing (see Gallagher & Lyle 1991and Gallagher & Binkley, 1996). This is a cluster of testing/maintenance techniques that haven’t achieved widespread adoption. The tools needed to support that adoption don’t exist yet. Creating them will be very difficult. This is classic Analytical School stuff. But his lecture made me want to learn more about it because it presents a glimpse of an interesting future.
  • This evening, I’m reading Kasurinen, Taipale & Smolander’s paper, Software Test Automation in Practice: Empirical Observations. One of my students and I will work through it tomorrow. I’m not sure how these folks would classify themselves (or if they would). Probably if they had to self-classify, it would be Analytical School. Comparing myself to a modern Newtonian physicist and them to outdated Aristoteleans strikes me as one part arrogant and five parts wrong.

I think it’s a Bad Idea to alienate, ignore, or marginalize people who do hard work on interesting problems.

James says later in his post,

“We must have the stomach to keep moving along with our program regardless of the huddled masses who Don’t Get It.

I respect the right of any individual to seek his or her own level of ignorance.

But I see it as a disservice to the craft when thought-leaders encourage narrow-mindedness in the people who look to them for guidance.

When I was an undergraduate, I studied mainly math and philosophy. Of the philosophy, I studied mainly Indian philosophy, about 5 semesters’ worth. My step-grandmother was a Buddhist. Friends of mine had consistent views. I was motivated to take the ideas seriously.

One of the profound ideas in those courses was a rejection of the law of the excluded middle. According to that law, if A is a proposition, then A must be true or Not-A must be true (but not both). Some of the Indian texts rejected that. They demanded that the reader consider {A and Not-A} and {neither A nor Not-A}. In terms of the logic of mathematics, this makes no sense (and it is not a view I associate with Indian logicians). But in terms of human affairs, I think the rejection of the law of the excluded middle is a powerful cognitive tool.

I have thought that for about 40 years. I brought that with me in my part of the crafting of the context-driven principles. Something can be the right thing for your context and its opposite can be the right thing for my context.

I think we need to look more sympathetically at more contexts and more solutions. To ask more about what is right with alternative ideas and what we can learn from them. And to develop batteries of skills to work with them. For that, I think we need to get past the politics of The One School of context-driven testing.