The Danger of Measuring Effectiveness, Efficiency, and Satisfaction

We hear a lot about a product being “usable.” And we hear a lot about “usability testing.” But what exactly is “usability”? And what do we really need to assess in testing?

The international standard ISO 9126 defines usability this way:

A set of attributes that bear on the effort needed for use, and on the individual assessment of such use, by a stated or implied set of users.

By this definition, usability is both how hard something is to use and how hard the people using it think it is to use. And, in this definition, it is not each element itself but rather the “set of attributes” of the product that affect these issues.

ISO 9241 defines it a bit differently as:

The extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.

This definition also states that usability involves the effort to perform tasks in a defined context, is a bit more explicit in that it includes the specific tasks they are trying to perform, and introduces the concepts of “effectiveness, efficiency and satisfaction.” The ISO 9241 definition seems to have lost, or at least buried, the concept that it is the set of product attributes that determine usability.

Note that ISO 9241 does not define usability as “The effectiveness, efficiency, and satisfaction with which a specific set of users can perform specific tasks in a specific environment,” which is how it is sometimes stated. Probably because of this reinterpretation, many practitioners try to measure effectiveness, efficiency, and satisfaction believing they are measuring usability. However, this is backwards. The ISO 9241 definition first states that the user has to perform the task, then it states that their performance should be with effectiveness, efficiency, and satisfaction. Some people have even tried (with little success) to create a formula to combine measures of effectiveness, efficiency, and satisfaction into a single ├╝ber measure of usability. Others simply state that these three dimensions have a “complex relationship.”

So what’s wrong with measuring effectiveness, efficiency, and satisfaction? Let’s look at each in turn.

Golden Stick Figures: One saying Danger through a megaphone, two others pulling out a measuring tape

Effectiveness is actually the clearest of the three elements, since it is often redundant with the base definition. Effectiveness is the ability to perform a task. The way to assess if a user can perform a task is to ask a simple question: Did they, or did they not, perform the task? If the task is simple to define — like locate a specific piece of information, keep a complex system in stasis, or kill all the Zombies before time runs out — the task is binary. You do it or you don’t do it. More importantly, everyone watching would agree on the outcome of the test. In testing we call this “interrater reliability”; all raters watching the task would agree if the task was performed correctly or not.

But what if the task is not so easy to assess as completed? What if the product is designed to aid in decision making? Can you really assess if the user made the right decision? (If it was so easy to assess that everyone agreed, you probably didn’t need a product to help them make the decision.) Or what if the system is designed to help someone learn something new? How do we test if a person successfully learned from the system? And, even if we tried, are we testing the product’s ability to present learning material or the specific user’s ability to learn that material? (Yes, it’s possible to tell these apart, but it’s far from easy.)

Efficiency is the most misunderstood element. How do we measure efficiency? Is it the time required to perform the task? Is it the number of keystrokes, steps, or screens used? All of these have been used and often discussed in literature as things that should be used to measure efficient usability. But are they really measuring usability?

Time is certainly the easiest and most commonly used measure of efficiency, but what is the right amount of time for a task? Is there even such a thing? Yes, if there is a missile on the way and we need to intercept it, there may be an externally imposed time constraint on task performance, but for most systems (thankfully) this is not the case. If there is no external constraint, what’s the basis for determining that something took too long? Or has too many steps? Or has too many keystrokes? Or uses too many screens?

Users don’t count keystrokes or steps or screens while they work. If we’re engaged in a task, making progress, we’re fine and mostly oblivious to these so-called measures of efficiency. Time, steps, keystrokes, and screens only matter when the user becomes aware of them; and then only if they give up. In other words, these so-called measures of efficiency, unless externally dictated, are only important when they show up in effectiveness measurements. Contrary to common usage, time is actually one of the worst measures for evaluating usability. If we could measure time internally, we wouldn’t wear watches.

There is an exception to this. If you had two products that show equal performance, and if the user tried both products, and if the user was aware they could do the task equally well with both products, and if the differences in time, keystrokes, steps, or screens were enough to be noticed while working, then users might say they prefer one product over the other. Otherwise, users wouldn’t even notice, and they’d have no basis for saying the system is inefficient in any of these dimensions.

To reconcile the apparent difference in the two ISO definition, it’s important to see efficiency for the measure that really matters — cognitive effort. There are ways to measure cognitive effort; but again, what is the right amount for a given task? The same rule applies to cognitive effort as it does to the other so-called measures of efficiency. Cognitive effort matters if it:

  • affects performance
  • the user is aware that the effort required is high, or
  • the user has an alternative product they can use with equal efficiency but with lower cognitive effort in comparison.

In one publicized example project, we took a system with a high cognitive workload (an IVR system for the Internal Revenue Service) and increased the number of system menus, but made those menus shorter. This increased the externally measured task performance time. However, all users directly comparing the two systems reported that the new system was faster then the old one, despite the increase in task performance time. Users simply can’t tell time and don’t count keystrokes, steps, or screens while they are engaged in work.

Satisfaction is specified in both ISO definitions. However, in ISO 9126 the phrase “and on the individual assessment of such use” clearly separates it from the effort required (efficiency) to perform the task (effectiveness). This is something that is not clear in ISO 9241. As the authors of ISO 9241 seem to have understood, satisfaction is very hard to equate to performance since people are not aware of their actual performances. As the anthropologist Margret Meade said: “What people say, what people do, and what they say they do are entirely different things.”

In research we conducted on voting systems for the National Institute of Standards and Technology, we tested five different voting systems, finding statistically significant differences in performance on each one (none of them produced 100% perfect performance for all users). In testing one paper ballot system against a direct electronic recording system in five rounds of tests, we found both statistically significant and statistically reliable differences in all tests. Yet, with almost 2,000 people participating, and scores on all systems below 100% for all users, we had over a 99.9% satisfaction rate. People universally were satisfied with the system they used as well as believed that they all had performed the task without error, despite their actual performances (which varied widely).

Despite its popularity, ISO 9241 has caused many practitioners to pursue indirect, misleading, or even incorrect data measures, believing they are measuring a product’s usability. It is possible, though it may be difficult, to measure effectiveness, efficiency (provided efficiency is defined correctly), and satisfaction. Efficiency is hard to combine with effectiveness, since most measures of efficiency are both different from each other and incorrect. Even the relationship of cognitive workload and performance is not well understood, except at the extreme levels (very high or very low cognitive workload). Satisfaction is often a completely independent measure from effectiveness and efficiency. It can be based on many things, but the least one is the user’s performance (unless that performance is blatantly obvious to the user while using the system).

Though we certainly want to know if a user can perform a task (a theme central to both definitions), the focus should be more on the attributes of the product that determine if this is possible (the clear theme of the ISO 9126 definition). Redirecting the focus to the product, not on the specific user’s performance in testing, is critical to understanding a product’s usability, particularly given the non-random, non-representative, and extremely small population size of almost all usability tests.

There are seven attributes of a product that need to be evaluated in usability testing. These attributes directly affect the user’s ability to perform tasks, as well as affect efficiency (measured as cognitive effort) and even user satisfaction. We’ll discuss these attributes in next month’s newsletter.