Qualitative (Formative) versus Quantitative (Summative) Testing

For some time now I had been planning on talking about the concepts of qualitative and quantitative testing, which is called summative and formative testing in usability circles, when I came across a very interesting document.

On February 3rd of this year, the FDA published a document entitled Applying Human Factors and Usability Engineering to Medical Devices. This is an update of a July 2000 document with a similar title.

This is probably one of the best documents I’ve seen in some time on testing methods. The authors warn of several common issues in conducting usability evaluations, such as using user preference ratings that focus on “user acceptance” or “ease of use.” They explain that reporting something like “users are extremely satisfied with our new design” is insufficient to tell anything about the actual effectiveness of the device. They also warn about using a “checkbox” approach for testing, in which a design is evaluated against “arbitrary pass/fail criteria (i.e. <10% failure rate on critical tasks)”. These arbitrary acceptance criteria lack validity (both construct and external validity or generalizability) and leave open the real question — what happened to the ten percent or so that had issues? How severe are these issues? How likely are they to occur in the field? As the authors again state: “the manufacturer must identify the causes of use errors and determine acceptability based on the potential for and severity of harm or ineffective treatment.” This leads me back to summative versus formative and qualitative versus quantitative testing.

The usability community often refers to conducting “formative testing.” Though the term originated in the education field to describe testing that helps the instructor assess the quality of the training (as opposed to assessing the knowledge of the student), the usability community refers to this type of testing as a way to assess interface strengths and weaknesses to help “form the design.” To all other testing communities, this has always been referred to as qualitative research, which has the goal of gaining insights into a problem, or possible solutions. It can be done with small samples, can be disruptive in nature, and can still yield valuable results. Its best and preferred use is one or more times during the design process to check out ideas, but it does rely on the ability of the observer to understand the difference between actual findings and a testing anomaly, or an outlier.

By contrast, the usability community has often referred to an alternate form of testing as “summative testing.” Again, this term is actually from education, where it refers to assessing the knowledge of a student. The User Experience Professionals Association’s Usability Body of Knowledge describes summative testing as “used to obtain measures to establish a usability benchmark or to compare results with usability requirements.” The terms used for this type of testing in other communities is quantitative testing, which is defined as testing designed to allow for generalizations of results from a sample to an entire population of interest. However, it can also be used to test differences between populations, or to compare a population’s results to a criterion. This type of testing can yield valid statistical data that predicts future performance. However, to do so, quantitative testing requires significant controls to be placed on the test procedures, including that the data obtained are not corrupted by the data collection process itself (issues of internal validity) and that sampling is sufficient to ensure the results can be generalized to the larger population (issues of external validity).

External validity requires a representative sample of users (which often means a very large sample), which requires time and resources well beyond the scope of most design and development projects. In what can probably be assumed an attempt to avoid this issue, the FDA document refers to human factors validation testing as a synonym for summative testing, but is quick to point out that “summative usability testing can be defined differently and some definitions omit essential components of human factors validation testing as described in this guidance document.” When quantitative research is conducted with small populations that are not properly sampled through a stratified sampling method, the results cannot be generalized to the entire population. For this reason alone, the data obtained from any test of this type (call it quantitative, summative, or human factors validation), has to be considered qualitative in nature. And the FDA is also quick to point out that human factors evaluation testing, with their recommended fifteen people, is qualitative testing. The results of this testing may produce a large number of findings, but you cannot say that the number of your findings is representative of all issues, or that a different group of users wouldn’t result in a different number or set of issues.

This all seems to fly in the face of published research promising that a small number of participants will guarantee that you will find a large percentage of all your usability issues. I would love to provide all of the points I usually use to try to dispel this myth, but the FDA seems to have done it for me. They state:

“Published estimates of the number of test participants required to identify all problems that exist in a user interface are based on a set of assumptions regarding: a fixed (and known) probability of encountering a problem, a uniform likelihood for each participant to encounter each problem, and the independence of the problems (that is, encountering one problem will not increase or decrease the likelihood of finding other problems). However, none of these assumptions reflects the real world. Most importantly, individual likelihoods of encountering a problem with a user interface vary considerably, depending on the user’s personal capabilities, knowledge and experience levels, nature of interaction with the device, frequency of task performance, attributes of the use environment and use conditions, and the nature of the problem.”

To resolve the difference between these two perspectives (why small user groups can generate findings even if the results are not quantitative), it is important to look at where the findings come from. In qualitative testing, the findings are whatever the observer identifies. If you watch just one person walk down the hall and trip, you may be able to determine if this is an issue or if maybe this person is nervous, a klutz, or something else. If you notice the carpet edge is buckled, you know it’s an issue because you recognize the cause, and you can report it as an issue to be fixed. One person: one finding. You don’t need to watch hundreds of people trip or figure out the percentage of people who will likely trip in a day. This does get harder when we’re talking about what goes on in the user’s mind, but it’s essentially the same thing, as long as you know what to look for.

In fact, qualitative testing allows the observer to report an issue that no one in the small group of users actually experiences, as long as the observer can explain the issue. However, when you need to know how pervasive an issue is, you need quantitative research. In quantitative research, the data is from a statistical analysis of overall user behavior. This cannot be determined from a small group. To use the words of comedian Lewis Black, we’re all like snowflakes. Each of us is unique. If you want to describe the general nature of snowflakes, you’re going to have to look at a lot of them. Looking at five, ten, or fifteen snowflakes will never be good enough to determine prevalence.