Since the introduction of the cloze procedure as a measure of
readability by Wilson Taylor (1953), it has been employed as one way of measuring the
reading ability of native speakers (Bormuth, 1967; Crawford, 1970).
Other researchers later investigated the effectiveness of cloze testing
as a measure of ESL/EFL proficiency (Darnell, 1968; Brown, 1983,
1988, 1993; Irvine, Atai, and Oller, 1974; Oller, 1972, 1983) to name
a few. The results have indeed been widely varied across studies and a
number of defects have been found with cloze procedures. In the light of
these criticisms, Klein-Braley and Raatz (1984) proposed a modification known as
C-testing. The procedure, developed to answer the psychometric problems
of cloze testing, has been purported as an empirically and theoretically
valid measure of language proficiency (Raatz and Klein-Braley, 1981;
Klein-Braley, 1985; Klein-Braley and Raatz, 1984, 1985; Raatz,
1985). This was later proposed by other researchers as a substitute
for cloze tests (Mc Beath, 1990; Cohen, Segall, and Weiss, 1984).
"[Mochizuki (1994) contends that] long passages, especially narratives, ... [are] the most appropriate for making the C-test effective in terms of reliability and concurrent validity."
Originally, the C-testing procedure involved making a test from four
or five thematically distinct segments of a connected discourse in which
the second half of every second word (usually 100 words in all) were
deleted. Examinees got credit for exact word restorations. The
use of several different short texts minimized the effect of text topic familiarity or
difficulty. Nevertheless, researches did not explore what kind of text produces higher reliability or validity
until Mochizuki (1994) experimented with four kinds of texts for classroom C-tests: narratives, explanations, arguments, and descriptions.
His study – which is later counter-indicated in this paper – suggested that long passages, especially narratives, were the most appropriate for making the C-test effective in terms of reliability and concurrent
Klein-Braley and Raatz basically utilized teacher judgments or school
grades as a criterion for validating C-tests, while other researchers
have supplied evidence grounded on other kinds of criteria. For
example, Nigishi (1987) reports correlation coefficients of .80 and .76
between C-tests and the reading subtest of ELBA and total ELBA,
respectively; while the studies of Ikeguchi (1994) indicate the C-test
responses to correlate highest with the grammar results of TOFEL
exams. Still other studies in support of this test procedure include
the validation of C tests among ESL/EFL learners. For instance, Feldman
and Stemmer (1987) found C test validation through verbal reports, while
Doornyei and Katona (1992) studied C tests against different
language tests, including oral interviews. They found further support for
C-testing, reporting that this procedure gives a random and
representative sample of an original text. That supports an earlier
assertion that the every-other-word deletion in the C test produces a large number of 'random
samples of the word classes of the text involved' (Klein-Braley, 1985,
1984). Other recent SLA researchers suggested that C-tests may also be
useful for L2 vocabulary research. For instance, Singleton and Little
(1991) found the responses of L2 learners to C-tests as a source of
evidence about second language lexical development.
On the other hand, criticisms have been leveled against the C-test
procedure. Some common criticisms meriting further investigation are:
- Do C-tests accurately reflect students' ability to process discourse for general proficiency (Cleary, 1988)?
- Do C-tests encourage only microlevel processing rather than macrolevel processing (Cohen, Segal, and Weiss, 1984)?
- Do C-tests have adequate face validity (Weir, 1988)? and
- Since C-tests have reduced redundancy, are they valid tests of language competence (Carroll, 1987)?
Although C-tests may tap into a measure of grammatical competence (Klein-Braley,1985), there is not enough validity research regarding
the specific traits they measure (Chapelle and Abraham, 1990). Moreover, according to Jafarpur (1995) 'assumptions of random sampling of the
basic elements of a text are doubtful'.
The use of C-tests since their introduction (Klein-Braley and Raatz,
1984) as a means of constructing norm-referenced measures for
proficiency and placement testing, and to solve problems concerning
the cloze procedures, has been extended to certain indefinite limits
such as 'measure of language creativity' (Carroll, 1987), and has
yielded results contrary to the researchers' expectations that were not
the purpose for which this test was originally intended. Furthermore,
the empirical evidence in support of C-tests is scanty (Weir, 1988) and
warrants further investigation in the context of second language
Purpose of the study
The objectives of this study are to investigate whether C-tests, using
two procedures of construction, can discriminate levels of language
proficiency between ESL learners in Japan and to determine the
superiority of a C-test using several passages (C-test 1) over a C-test
constructed from only one long passage, a narrative type (C-test 2),
in terms of reliability and correlation with an external criterion.
Two groups of first year university students in Japan were chosen for
the investigation: one group consisted of 60 undergraduate students
enrolled in a general English course, the second group was
made up of 30 students in a class of English for returnees. Students from the
first group were picked randomly from an intact class, while those
from the latter group belonged to one English class for returnees. To
qualify for that class, the students must have stayed
in an English speaking country for at least a period of one year, and have passed the qualifying exam
administered by the university. In terms of proficiency level, most of the students in that class had advanced listening and oral production skills, but
post-intermediate writing and reading skills (Tschirner, 1996).
Two kinds of C-tests were used in this study: one type was constructed
using four short passages from different texts, while the other
was constructed using only one long narrative text. The use of several
short segments of different texts has been shown by the
researches above, to have a satisfactory reliability above
.80. According to Klein-Braley and Raatz (1984) it is also empirically valid.
For this study, the four short passages were chosen from
different texts within similar readability and interest levels using
the Fry (1985) and Flesch (as described in Klare, 1984) indices. The
readability estimates of the texts where segments were chosen for this
study had a 6 - 8 level by the Fry index, and a 6.7 - to 9.6 level by
the Flesch index. These numbers which appear to be quite different
scales are remarkable only in that they indicate variations in the
readability levels of the passages used (Brown, 1993). C-test 1 was
constructed using 25 items from different passages, making a total of
100 items. Every first and last sentence of each passage were left
intact to provide contextual clues.
The second type of C-test was adopted for use in this study based on Mochizuki's (1994) research on
different types of discourse: the description, the exposition,
the narration and argumentation to construct C-tests for classroom use.
Among these four types of texts, the narrative type was found to be the
most reliable - .92 . The narrative text "The Lock Keeper"
consisting of 120 items which was found to be the most
reliable and with the highest concurrent validity (Mochizuki, 1995).
This study is an attempt to investigate which of these two types of
C-test constructions would yield higher reliability and concurrent
validity. The external criterion used was the STEP-Eiken exam. The STEP-Eiken exam
consists of 66 written test questions on vocabulary, grammar and
reading comprehension. The STEP-Eiken has been established in previous
investigations as resulting in high reliability as well as high
coefficients as an external validating criterion with Japanese
university students (Kimura, 1995). In a previous study using the STEP-Eiken
and CELT results to investigate the external validity of C-tests
constructed from different types of discourse, STEP-Eiken was found to have
a higher reliability (.778) than the CELT (.638), and other C-tests
Each student from the two groups of students took the two versions of the
C-tests and the STEP-Eiken. To control for a potential order effect, the
order of administering the C-test and the STEP-Eiken was counterbalanced:
half the subjects in the non-returnees group and the returnees group
took the two C-tests first, and the STEP-Eiken during the English class the
following week. The other half of each group took the STEP-Eiken test first,
and then the English test.
The students' responses for both C-test 1 and C-test 2 were scored
for exact replacements. Descriptive statistics for the scores of the
C-tests were obtained. Reliability coefficients were obtained by the KR-20 method. The use of KR-20 has been questioned in the past. For
instance, Farhady (1983) and Bachman (1990) claim that the internal
consistency reliability coefficients are inappropriate for cloze and
C-tests because of the interdependence of items. On the other hand,
Woods (1984), Henning (1987) and (Jafarpur, 1995) claimed that the KR-20 method yields the same results as Cronbach's alpha.
Moreover, Brown (1983) provided evidence that the differences between
reliability coefficients from KR-20 and Cronbach's are negligible.
To address the first research question, that of determining the
discriminative power of the C-tests, a comparison of the
subjects' scores among groups was obtained, based on the results
of the group t-tests. The subjects' mean scores within each
group for each test was obtained and subjected to an analysis of
variance test and t-tests.
For the second research objective which is to determine the
reliability and correlation of C-tests and STEP-Eiken, the Pearson product
moment correlation coefficients were computed.
Results and discussion
- continued -
Table 1. Basic descriptive statistics for non-returnees' scores on the C-test 1, C-test 2 and the STEP-Eiken tests.
Test type N No. of Items Mean Reliability *
C-test 1 60 100 61 .67 .73
C-test 2 60 120 98 .70 .83
STEP-Eiken 60 160 109 .75 .85
* Raw score reliabilities (KR 20) appear on the right and reliabilities that would
be observed if all the tests contained 100 items appear on the left.
Table 2. Basic descriptive statistics for returnees' scores on the C-test 1, C-test 2 and STEP-Eiken tests.
Test type N No. of items Mean Reliability *
C-test 1 30 100 74 .65 .76
C-test 2 30 120 109 .71 .89
STEP 30 160 124 .87 .91
* Raw score reliabilities (K-R 20) appear on the left and reliabilities that
would be observed if all the tests contained 100 items on the right.