It is widely accepted (cf. Alderson, 2000) that deeper knowledge of the mental
operations test-takers employ in choosing or constructing their responses can be of
value in understanding why, rather than merely how often, items are answered correctly
or not. One way of getting at this information is through 'verbal reports' (VR), in which
takers of a more-or-less realistic task introspect in real time, and/or retrospect about their
mental processes during a task. VR is undeniably a useful resource in looking at 'black
box' processes not easily accessed through observation or experiment, but this paper
will point out also has some conceptual and practical drawbacks.
The use of VR data dates from at least the 19th century. Though it was later
dismissed by behaviorists as lacking scientific rigor, it has been used widely in problem
solving studies and investigations of expert knowledge. In recent decades VR has been
rehabilitated as a mode of enquiry, and the intellectual underpinning of that revival may
largely be found in Ericsson and Simon's 1984 monograph Protocol analysis: Verbal
reports as data. Even today this is regarded as a definitive text in the field. How
carefully the book is read may be another matter: Boren and Ramey (2000) observed
that the procedures and structures outlined in that text are by no means followed in all
studies reputing to employ VR.
Types of VR data
Andrew Cohen has written extensively about the gathering of VR data, and in 1998
he offers a plausible and widely-cited taxonomy of VR types (which are illustrated here
from my own data). These are the labels:
||"think-aloud, stream of consciousness disclosure of thought processes"
[... so... uh... this must be has... have?... no has]
||introspective or just retrospective (within 20 seconds)
inspection of specific, not generalized, language behavior
[okay...I could get this one because...uh...earlier it mentioned eels... plural right? so it must be are equipped with cells...electric cells]
||"learners' descriptions of what they do, characterized by general statements"
[well... I look back a little or forward...and if I still don't get it
I leave it and come back later...well usually (laughs)]
Although verbal reports usually comprise some combination of these sub-types
(Cohen, op. cit.) the ideal persists that VR data should be both immediate and unrefined
(i.e. corresponding to Cohen's self revelation). This is thought to be the most authentic
and hence valid type of VR data, in that the longer the chronological interval or cognitive
distance between a processing event and its reporting, the greater the risk that informants
will reconstruct rather than recall their processing actions. This need not imply conscious
dissembling on the part of informants, but the risk is real. Some of my informants, for
example, downplayed or even denied certain behaviours – even though they were
clearly identifiable on tape.
Some practical problems of VR
Some difficulties in using VR data are taken up by Boren and Ramey (2000) and Cohen
(2000). In my view, a VR study can only be properly evaluated and compared to other
VR studies if these key attributes are present:
- a sufficiently detailed description of the stimulus task(s) is used. If the task was, for example, a 'cloze passage', what was the deletion ratio? Were items targeted for deletion randomly (every nth word) or on some other basis (e.g. adverbials only)?
- details of any training in, or model of, verbal reporting which informants were
given. My own explorations of VR indicate that such training or modeling may
influence the range of processing behavior informants use.
- details of how the language-of-reporting was chosen. Were informants allowed
to think aloud in their own language or in an L2? The use of an L2, after all, may
significantly limit what even fairly fluent informants can report.
- a clear picture of the degree of researcher-informant interaction during the task
Some protocols labeled 'think aloud' contain so many researcher questions and
requests for clarification that they resemble interviews rather than self-revelations or self-observations.
". . . until we know more about the full range of authentic processing and reporting styles informants may resort to, we need to be very careful about how we try to constrain these."
Although by no means all VR studies conform to Ericsson and Simon's guidelines,
uncritically adhering to these guidelines may create problems where, arguably, none
need exist (Boren and Ramey, op. cit.).
To take one example, Cavalcanti 1987 (in Faerch and Kasper, 1987) claims that the
tendency of untrained VR informants to read aloud long spans of text (and then, in effect,
to retrospect about these) can be eliminated through informant training. This behavior, she
implies, does not produce the immediate, unstructured data idealized by the Ericsson and
Simon model. Though I have used VR procedures with over 70 informants in the last ten
years, I cannot recall even one of my informants who did not at least occasionally display
this behavior which Cavalcanti proposes to train away. I would argue that until we know
more about the full range of authentic processing and reporting styles informants may
resort to, we need to be very careful about how we try to constrain these.
On a practical level, informants' self-revelatory 'inner speech' externalized as recorded
and/or observed verbal data may not be readily comprehensible even to a researcher
fluent in the language of reporting. Triangulatory data is typically needed, often in the form
of post-task retrospection and/or task-concurrent interruptive questioning. This may call
for a sequence of one-on-one data-gathering sessions, with recorded protocols requiring
transcription and analysis.
The very high investment of researcher time which VR requires may severely limit
sample sizes and, though not inevitably a problem in itself, this may restrict the use of
statistical procedures. A typically overlooked problem and one perhaps more serious in
Japanese contexts is that the longer the research period is drawn out, the greater the risk
that the security of the materials and procedures used may be compromised. Japanese
informants often try to glean in advance as much information as possible about what awaits
them especially if the research focuses on test-taking. 'Snowball' sampling, in which initial
informants recruit subsequent informants, may be especially prone to this risk.
'Improved' VR procedures
Clearly, there is a strong motivation to find ways of making VR data fuller and more
comprehensible. Kirsten Haastrup (in Faerch and Kasper, 1987) is one of a number of
researchers who have made use of pair (also sometimes termed 'dyad') reporting. In this
condition, informants together, either as peers solving a task co-operatively, or in a kind
of tutor-pupil relationship in which one 'teaches' the other how a task can be completed.
In my own data-gathering a majority of the informants – regardless of gender or first
language background who had experienced both solo and pair verbal reporting preferred
the latter. Pair verbal reporting was typically rated as both less stressful and more satisfying
in terms of how comprehensively the informants thought they had been able to report
their processing. The volume of data produced under pair reporting conditions is typically
higher and, given that it is aimed at another person, it is inevitably more consistently
understandable to the researcher/analyst.
That said, pair reporting may not be an ideal way to improve the quality of verbal
reports for test-like tasks. Not only is pair reporting less authentic in terms of the typically
solitary task of test-taking, but the structuring of the data necessary for communication
to a partner might skew the information in some way. There may thus be an unavoidable
trade off between making VR more practically workable, and adhering to the model
proposed by Ericsson and Simon.
A perhaps more serious drawback of pair reporting is that an informant may leave
out unmentioned information which she supposes that her partner already possesses, or
her partner will already have drawn. Given that pair verbal reporting is to a large extent
a kind of conversation, it would not be surprising if the Gricean conversational maxim
of economy ('Don't tell your interlocutor what s/he already knows.') should apply. The
question of how far this suppression of 'shared information' affects the overall informant
task-processing calls for further study.
An alternative to verbal-report data?
In the face of these difficulties, I attempted to construct a more time-efficient data-
elicitation procedure in which informants themselves take on the task of identifying their
task-processing behaviors in terms of an a priori set of mental operations. I have applied
the procedure most extensively to the investigation of cloze passage processing behavior,
and so have labeled it 'Annotated Cloze', or AC. This is how it works.
Potential AC informants are first introduced to the set of codes – a set of labels which
identify individual cloze task processing operations previously isolated from the protocols
of VR informants. There are just under forty individual codes, and informants' orientation
to these is discussed in more detail below.
In addition to the codes, there are a number of markup conventions designed to allow
informants to efficiently record particular aspects of their processing. One example is the
markup convention of underlining those parts of a cloze passage which were translated
during processing. This markup convention not only provides a backup or cross-check
to the code 'TR(anslation)' which informants are asked to use to indicate their use of L1
translation in filling a cloze blank, but also allows them to show more precisely which
parts of a text were translated, whether a single word to a supra-sentential chunk.
AC informants are also encouraged to supplement their codes and markups with
short L1 or target-language written comments in spaces provided in each manuscript
whenever they think this may clarify or expand on their record of their task-processing.
Frequently observed applications of this option include the offering of 'second choice'
filler words, and elucidation of prior knowledge ("I learned in school that the [original]
marathon runner died at the end of his race.", etc.) which aided filling of the blank.
Number-scales may be added to the boxes in order to record such features of
processing as the perceived difficulty of filling a cloze blank, the informant's confidence
in her choice of filler, etc. Perhaps because they are difficult to overlook, number scales
attract a high rate of response, even for those (perhaps easier) cloze blanks such as
articles and prepositions, which are often filled 'silently' in verbal report protocols.
Orientating AC informants
A two-stage procedure is used to orient AC informants. Both stages are described in detail.
Stage 1: Informants explore the behavior codes
Various means were trialled of introducing informants to the set of behavior codes
and markup conventions, and the current procedure involves the primary presentation
of codes on a set of individual cards such as these ones in Fig. 1:
Figure 1: Some sample behaviour codes used by the author in this study.
This presentation of codes on individual cards appears to be more efficient and
less intimidating than the previous method of distributing a list of almost forty codes.
It also seems to minimize informants' perceptions of the set of codes as a monolithic
whole, and to encourage them to, as it were, edit the set of codes to match their own
processing styles. It must be borne in mind that while the full set of AC codes reflects
all the task-processing behaviors regularly isolated in the protocols of VR informants,
an individual informant is unlikely to use more than an unpredictable subset of these
behaviors in her own task processing. Rather than provide a list to be memorized, the
set of card-presented codes becomes a reference resource. This inevitably means that
informants are unlikely to be working with exactly the same set of codes, but the next
stage provides a context in which individual informants' applications of codes may be shared and revised.
Stage 2: Informants apply the behaviour codes
One or more practice sessions follow, in which informants (who typically work in
groups of two or three) identify and code their own conscious processing behaviors while
working through the stimulus task(s) presented on AC manuscripts. These are the cloze
passage worksheets in which informants fill blanks and encode, mark up and (optionally)
comment on their processing. A good deal of mutual cross-checking and discussion is to
be expected as informants make sense of their interaction with the task in terms of the
codes available. An informant may, for example, discard certain codes as irrelevant to
her processing style, but later re-incorporate these into her personal set.
The mechanics of recording codes, markups and comments need to be as
straightforward as possible. After experimenting with various manuscript formats, I
currently prefer to present a text in the middle of an A3 page, with numbered boxes
corresponding to odd numbered cloze blanks down one side, and even numbers down
the other. These boxes are designed to be large enough for informants to enter not only
the appropriate processing behavior codes, but also add short written comments. This
format appears to work reasonably well, and is easily formatted and printed. A sample
completed version is available at http//.jalt.org/test/gib_1.htm
Benefits of AC over VR
How does AC compare with VR as a data-gathering tool? While the gathering of initial
verbal report data appears to be indispensable in constructing a set of AC codes, there
are good reasons for considering a procedure like AC in subsequent data-collection.
These concern the relative workloads of VR and AC, the quantity and quality of insights
generated, and informants' affective response to the task.
The first advantage of AC is that it saves time and reduce the workload from 70-80%.
This in turn can give researchers a chance to deal with larger sample sizes, and result
in more concentrated (and perhaps more secure) data-collection. The grinding process
of transcribing audio recordings is obviated, as is the task of deciding how to interpret
and classify the sometimes barely intelligible utterances solo-condition verbal reports may contain.
A second argument for the use of AC is that its elicitation format can produce a
quantitatively better picture of the processing of the (in my experience, surprisingly many)
informants who seem unable, for reasons which are far from clear, to adequately report
under VR conditions. My own data indicate that male informants often 'report' in less
detail than females, and that Japanese L1 informants verbalize less, overall, than those
of German first language background. These disparities were much less conspicuous
under AC conditions, and allowed the gathering of data from a number of informants
who were unproductive in 'think-aloud' sessions but who had little apparent difficulty
in carrying out the AC task. The benefits of AC were not confined to 'low verbalisms',
however, for although differences did exist across informants and across cloze blanks, AC
appeared on the whole to track informants' processing operations at least as effectively as solo-condition VR.
AC may also offer benefits in terms of informants' affective response to the task. The
majority of informants who sampled both solo-condition VR and AC procedures, reported
that they found AC to be less stressful. These also claimed, overall, to be at least as
satisfied with AC as VR. Not surprisingly, VR informants who verbalized little subsequently
incorporated more details into their AC data-gathering sessions. They also reported that
AC had allowed them to provide a significantly better picture of their own processing.
Although none of my informants has experienced both solo-condition and pair-condition
VR as well as AC, those with experience of pair-condition VR and AC rated the latter
as only slightly less satisfactory in terms of stressfulness or the 'picture quality' of their
task of processing. From the data-collector's point of view, this slight preference for pair
reporting on the part of informants may not offset the significant saving in researcher
workload which AC provides.
Some points of concern regarding AC
While VR also has its shortcomings in relation to cloze processing (such as the
fact that the filling of a blank may leave no verbal record whatsoever) the potential
shortcomings of the AC procedure cannot be ignored. One of these is the difficulty, under
normal AC conditions, of retrospectively checking or analyzing informants' processing
behaviors except via informants' unsupported recollections. VR, on the other hand,
potentially provides corroborating audio data.
". . . there are good reasons for considering a procedure like AC in subsequent data-collection."
In short, my argument is that an AC-like procedure, in which informants categorize
their own behaviors in terms of an a priori set of codings, is not inevitably less valid than
a researcher-centered procedure such as VR. It may, moreover, provide a productive
alternative for informants whose VR output is unacceptably low, and in terms of researcher
workload and hence of potential sample size, AC offers clear benefits. Future research
may reveal useful modifications to the AC procedure, and expand the range of contexts
in which it can be applied, albeit under a different label.
Anyone wishing to explore the application of verbal report data can hardly do
better than look at Faerch and Kasper (1987) and the many thought-provoking papers it contains.
Alderson, J.C. (2000). Assessing reading. Cambridge: Cambridge University Press.
Boren. M. & Ramey, J. (2000). Thinking aloud: Reconciling theory and practice. New York: IEEE Transactions on Professional Communication.
Cohen, A. (1998). Strategies in learning and using a second language. London: Longman.
Cohen, A. (2000). Exploring strategies in test-taking: Fine-tuning verbal reports from respondents. In G. Ekbatani & H. Pierson (Eds.), Learner-directed assessment In ESL. Mahwah NJ: Lawrence Erlbaum Associates.
Ekbatani, G. & Pierson, H. (Eds). (2000). Learner-directed assessment In ESL. Mahwah NJ: Lawrence Erlbaum Associates.
Ericsson, K. & Simon, H. (1984/1993). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press.
Faerch, C. & Kasper, G. (Eds.). (1987). Introspection in second language research. Philadelphia: Multilingual Matters.