Curriculum Innovation, Testing and Evaluation: Proceedings of the 1st Annual JALT Pan-SIG Conference.
May 11-12, 2002. Kyoto, Japan: Kyoto Institute of Technology.

Annotated cloze:
A viable alternative to verbal reports in data-elicitation?

by Bob Gibson (Keio University )

It is widely accepted (cf. Alderson, 2000) that deeper knowledge of the mental operations test-takers employ in choosing or constructing their responses can be of value in understanding why, rather than merely how often, items are answered correctly or not. One way of getting at this information is through 'verbal reports' (VR), in which takers of a more-or-less realistic task introspect in real time, and/or retrospect about their mental processes during a task. VR is undeniably a useful resource in looking at 'black box' processes not easily accessed through observation or experiment, but this paper will point out also has some conceptual and practical drawbacks.
The use of VR data dates from at least the 19th century. Though it was later dismissed by behaviorists as lacking scientific rigor, it has been used widely in problem solving studies and investigations of expert knowledge. In recent decades VR has been rehabilitated as a mode of enquiry, and the intellectual underpinning of that revival may largely be found in Ericsson and Simon's 1984 monograph Protocol analysis: Verbal reports as data. Even today this is regarded as a definitive text in the field. How carefully the book is read may be another matter: Boren and Ramey (2000) observed that the procedures and structures outlined in that text are by no means followed in all studies reputing to employ VR.

Types of VR data

Andrew Cohen has written extensively about the gathering of VR data, and in 1998 he offers a plausible and widely-cited taxonomy of VR types (which are illustrated here from my own data). These are the labels:

Self Revelation: "think-aloud, stream of consciousness disclosure of thought processes"
[... so... uh... this must be has... have?... no has]
Self Observation: introspective or just retrospective (within 20 seconds)
inspection of specific, not generalized, language behavior

[okay...I could get this one because...uh...earlier it mentioned eels... plural right? so it must be are equipped with cells...electric cells]
Self Report: "learners' descriptions of what they do, characterized by general statements"
[well... I look back a little or forward...and if I still don't get it I leave it and come back later...well usually (laughs)]

[ p. 163 ]

Although verbal reports usually comprise some combination of these sub-types (Cohen, op. cit.) the ideal persists that VR data should be both immediate and unrefined (i.e. corresponding to Cohen's self revelation). This is thought to be the most authentic and hence valid type of VR data, in that the longer the chronological interval or cognitive distance between a processing event and its reporting, the greater the risk that informants will reconstruct rather than recall their processing actions. This need not imply conscious dissembling on the part of informants, but the risk is real. Some of my informants, for example, downplayed or even denied certain behaviours – even though they were clearly identifiable on tape.

Some practical problems of VR

Some difficulties in using VR data are taken up by Boren and Ramey (2000) and Cohen (2000). In my view, a VR study can only be properly evaluated and compared to other VR studies if these key attributes are present:
  1. a sufficiently detailed description of the stimulus task(s) is used. If the task was, for example, a 'cloze passage', what was the deletion ratio? Were items targeted for deletion randomly (every nth word) or on some other basis (e.g. adverbials only)?
  2. details of any training in, or model of, verbal reporting which informants were given. My own explorations of VR indicate that such training or modeling may influence the range of processing behavior informants use.
  3. details of how the language-of-reporting was chosen. Were informants allowed to think aloud in their own language or in an L2? The use of an L2, after all, may significantly limit what even fairly fluent informants can report.
  4. a clear picture of the degree of researcher-informant interaction during the task Some protocols labeled 'think aloud' contain so many researcher questions and requests for clarification that they resemble interviews rather than self-revelations or self-observations.
". . . until we know more about the full range of authentic processing and reporting styles informants may resort to, we need to be very careful about how we try to constrain these."

Although by no means all VR studies conform to Ericsson and Simon's guidelines, uncritically adhering to these guidelines may create problems where, arguably, none need exist (Boren and Ramey, op. cit.). To take one example, Cavalcanti 1987 (in Faerch and Kasper, 1987) claims that the tendency of untrained VR informants to read aloud long spans of text (and then, in effect, to retrospect about these) can be eliminated through informant training. This behavior, she implies, does not produce the immediate, unstructured data idealized by the Ericsson and Simon model. Though I have used VR procedures with over 70 informants in the last ten years, I cannot recall even one of my informants who did not at least occasionally display this behavior which Cavalcanti proposes to train away. I would argue that until we know more about the full range of authentic processing and reporting styles informants may resort to, we need to be very careful about how we try to constrain these.
On a practical level, informants' self-revelatory 'inner speech' externalized as recorded and/or observed verbal data may not be readily comprehensible even to a researcher fluent in the language of reporting. Triangulatory data is typically needed, often in the form of post-task retrospection and/or task-concurrent interruptive questioning. This may call for a sequence of one-on-one data-gathering sessions, with recorded protocols requiring transcription and analysis.

[ p. 164 ]

The very high investment of researcher time which VR requires may severely limit sample sizes and, though not inevitably a problem in itself, this may restrict the use of statistical procedures. A typically overlooked problem and one perhaps more serious in Japanese contexts is that the longer the research period is drawn out, the greater the risk that the security of the materials and procedures used may be compromised. Japanese informants often try to glean in advance as much information as possible about what awaits them especially if the research focuses on test-taking. 'Snowball' sampling, in which initial informants recruit subsequent informants, may be especially prone to this risk.

'Improved' VR procedures

Clearly, there is a strong motivation to find ways of making VR data fuller and more comprehensible. Kirsten Haastrup (in Faerch and Kasper, 1987) is one of a number of researchers who have made use of pair (also sometimes termed 'dyad') reporting. In this condition, informants together, either as peers solving a task co-operatively, or in a kind of tutor-pupil relationship in which one 'teaches' the other how a task can be completed. In my own data-gathering a majority of the informants – regardless of gender or first language background who had experienced both solo and pair verbal reporting preferred the latter. Pair verbal reporting was typically rated as both less stressful and more satisfying in terms of how comprehensively the informants thought they had been able to report their processing. The volume of data produced under pair reporting conditions is typically higher and, given that it is aimed at another person, it is inevitably more consistently understandable to the researcher/analyst.
That said, pair reporting may not be an ideal way to improve the quality of verbal reports for test-like tasks. Not only is pair reporting less authentic in terms of the typically solitary task of test-taking, but the structuring of the data necessary for communication to a partner might skew the information in some way. There may thus be an unavoidable trade off between making VR more practically workable, and adhering to the model proposed by Ericsson and Simon.
A perhaps more serious drawback of pair reporting is that an informant may leave out unmentioned information which she supposes that her partner already possesses, or her partner will already have drawn. Given that pair verbal reporting is to a large extent a kind of conversation, it would not be surprising if the Gricean conversational maxim of economy ('Don't tell your interlocutor what s/he already knows.') should apply. The question of how far this suppression of 'shared information' affects the overall informant task-processing calls for further study.

An alternative to verbal-report data?

In the face of these difficulties, I attempted to construct a more time-efficient data- elicitation procedure in which informants themselves take on the task of identifying their task-processing behaviors in terms of an a priori set of mental operations. I have applied the procedure most extensively to the investigation of cloze passage processing behavior, and so have labeled it 'Annotated Cloze', or AC. This is how it works.
Potential AC informants are first introduced to the set of codes – a set of labels which identify individual cloze task processing operations previously isolated from the protocols of VR informants. There are just under forty individual codes, and informants' orientation to these is discussed in more detail below.
In addition to the codes, there are a number of markup conventions designed to allow informants to efficiently record particular aspects of their processing. One example is the markup convention of underlining those parts of a cloze passage which were translated during processing. This markup convention not only provides a backup or cross-check to the code 'TR(anslation)' which informants are asked to use to indicate their use of L1 translation in filling a cloze blank, but also allows them to show more precisely which parts of a text were translated, whether a single word to a supra-sentential chunk. AC informants are also encouraged to supplement their codes and markups with short L1 or target-language written comments in spaces provided in each manuscript whenever they think this may clarify or expand on their record of their task-processing.

[ p. 165 ]

Frequently observed applications of this option include the offering of 'second choice' filler words, and elucidation of prior knowledge ("I learned in school that the [original] marathon runner died at the end of his race.", etc.) which aided filling of the blank.
Number-scales may be added to the boxes in order to record such features of processing as the perceived difficulty of filling a cloze blank, the informant's confidence in her choice of filler, etc. Perhaps because they are difficult to overlook, number scales attract a high rate of response, even for those (perhaps easier) cloze blanks such as articles and prepositions, which are often filled 'silently' in verbal report protocols.

Orientating AC informants

A two-stage procedure is used to orient AC informants. Both stages are described in detail.

Stage 1: Informants explore the behavior codes

Various means were trialled of introducing informants to the set of behavior codes and markup conventions, and the current procedure involves the primary presentation of codes on a set of individual cards such as these ones in Fig. 1:
Figure 1
Figure 1: Some sample behaviour codes used by the author in this study.

This presentation of codes on individual cards appears to be more efficient and less intimidating than the previous method of distributing a list of almost forty codes. It also seems to minimize informants' perceptions of the set of codes as a monolithic whole, and to encourage them to, as it were, edit the set of codes to match their own processing styles. It must be borne in mind that while the full set of AC codes reflects all the task-processing behaviors regularly isolated in the protocols of VR informants, an individual informant is unlikely to use more than an unpredictable subset of these behaviors in her own task processing. Rather than provide a list to be memorized, the set of card-presented codes becomes a reference resource. This inevitably means that informants are unlikely to be working with exactly the same set of codes, but the next stage provides a context in which individual informants' applications of codes may be shared and revised.

[ p. 166 ]

Stage 2: Informants apply the behaviour codes
One or more practice sessions follow, in which informants (who typically work in groups of two or three) identify and code their own conscious processing behaviors while working through the stimulus task(s) presented on AC manuscripts. These are the cloze passage worksheets in which informants fill blanks and encode, mark up and (optionally) comment on their processing. A good deal of mutual cross-checking and discussion is to be expected as informants make sense of their interaction with the task in terms of the codes available. An informant may, for example, discard certain codes as irrelevant to her processing style, but later re-incorporate these into her personal set.
The mechanics of recording codes, markups and comments need to be as straightforward as possible. After experimenting with various manuscript formats, I currently prefer to present a text in the middle of an A3 page, with numbered boxes corresponding to odd numbered cloze blanks down one side, and even numbers down the other. These boxes are designed to be large enough for informants to enter not only the appropriate processing behavior codes, but also add short written comments. This format appears to work reasonably well, and is easily formatted and printed. A sample completed version is available at http// [dead link].

Benefits of AC over VR

How does AC compare with VR as a data-gathering tool? While the gathering of initial verbal report data appears to be indispensable in constructing a set of AC codes, there are good reasons for considering a procedure like AC in subsequent data-collection. These concern the relative workloads of VR and AC, the quantity and quality of insights generated, and informants' affective response to the task.
The first advantage of AC is that it saves time and reduce the workload from 70-80%. This in turn can give researchers a chance to deal with larger sample sizes, and result in more concentrated (and perhaps more secure) data-collection. The grinding process of transcribing audio recordings is obviated, as is the task of deciding how to interpret and classify the sometimes barely intelligible utterances solo-condition verbal reports may contain.
A second argument for the use of AC is that its elicitation format can produce a quantitatively better picture of the processing of the (in my experience, surprisingly many) informants who seem unable, for reasons which are far from clear, to adequately report under VR conditions. My own data indicate that male informants often 'report' in less detail than females, and that Japanese L1 informants verbalize less, overall, than those of German first language background. These disparities were much less conspicuous under AC conditions, and allowed the gathering of data from a number of informants who were unproductive in 'think-aloud' sessions but who had little apparent difficulty in carrying out the AC task. The benefits of AC were not confined to 'low verbalisms', however, for although differences did exist across informants and across cloze blanks, AC appeared on the whole to track informants' processing operations at least as effectively as solo-condition VR.
AC may also offer benefits in terms of informants' affective response to the task. The majority of informants who sampled both solo-condition VR and AC procedures, reported that they found AC to be less stressful. These also claimed, overall, to be at least as satisfied with AC as VR. Not surprisingly, VR informants who verbalized little subsequently incorporated more details into their AC data-gathering sessions. They also reported that AC had allowed them to provide a significantly better picture of their own processing. Although none of my informants has experienced both solo-condition and pair-condition VR as well as AC, those with experience of pair-condition VR and AC rated the latter as only slightly less satisfactory in terms of stressfulness or the 'picture quality' of their task of processing. From the data-collector's point of view, this slight preference for pair reporting on the part of informants may not offset the significant saving in researcher workload which AC provides.

[ p. 167 ]

Some points of concern regarding AC

While VR also has its shortcomings in relation to cloze processing (such as the fact that the filling of a blank may leave no verbal record whatsoever) the potential shortcomings of the AC procedure cannot be ignored. One of these is the difficulty, under normal AC conditions, of retrospectively checking or analyzing informants' processing behaviors except via informants' unsupported recollections. VR, on the other hand, potentially provides corroborating audio data.
". . . there are good reasons for considering a procedure like AC in subsequent data-collection."

In short, my argument is that an AC-like procedure, in which informants categorize their own behaviors in terms of an a priori set of codings, is not inevitably less valid than a researcher-centered procedure such as VR. It may, moreover, provide a productive alternative for informants whose VR output is unacceptably low, and in terms of researcher workload and hence of potential sample size, AC offers clear benefits. Future research may reveal useful modifications to the AC procedure, and expand the range of contexts in which it can be applied, albeit under a different label.


Anyone wishing to explore the application of verbal report data can hardly do better than look at Faerch and Kasper (1987) and the many thought-provoking papers it contains.


Alderson, J.C. (2000). Assessing reading. Cambridge: Cambridge University Press.

Boren. M. & Ramey, J. (2000). Thinking aloud: Reconciling theory and practice. New York: IEEE Transactions on Professional Communication.

Cohen, A. (1998). Strategies in learning and using a second language. London: Longman.

Cohen, A. (2000). Exploring strategies in test-taking: Fine-tuning verbal reports from respondents. In G. Ekbatani & H. Pierson (Eds.), Learner-directed assessment In ESL. Mahwah NJ: Lawrence Erlbaum Associates.

Ekbatani, G. & Pierson, H. (Eds). (2000). Learner-directed assessment In ESL. Mahwah NJ: Lawrence Erlbaum Associates.

Ericsson, K. & Simon, H. (1984/1993). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press.

Faerch, C. & Kasper, G. (Eds.). (1987). Introspection in second language research. Philadelphia: Multilingual Matters.

2002 Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index
Complete Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index

[ p. 168 ]

Last Next