JALT Testing & Evaluation SIG Newsletter
Vol. 2 No. 1 Oct. 1998. (p. 2 - 10) [ISSN 1881-5537]
PDF PDF Version

Do different C-tests discriminate proficiency levels of EL2 learners?

Cecilia B. Ikeguchi (Tsukuba Women's University)


Since the introduction of the cloze procedure as a measure of readability by Wilson Taylor (1953), it has been employed as one way of measuring the reading ability of native speakers (Bormuth, 1967; Crawford, 1970). Other researchers later investigated the effectiveness of cloze testing as a measure of ESL/EFL proficiency (Darnell, 1968; Brown, 1983, 1988, 1993; Irvine, Atai, and Oller, 1974; Oller, 1972, 1983) to name a few. The results have indeed been widely varied across studies and a number of defects have been found with cloze procedures. In the light of these criticisms, Klein-Braley and Raatz (1984) proposed a modification known as C-testing. The procedure, developed to answer the psychometric problems of cloze testing, has been purported as an empirically and theoretically valid measure of language proficiency (Raatz and Klein-Braley, 1981; Klein-Braley, 1985; Klein-Braley and Raatz, 1984, 1985; Raatz, 1985). This was later proposed by other researchers as a substitute for cloze tests (Mc Beath, 1990; Cohen, Segall, and Weiss, 1984).
"[Mochizuki (1994) contends that] long passages, especially narratives, ... [are] the most appropriate for making the C-test effective in terms of reliability and concurrent validity."

[ p. 2 ]

Originally, the C-testing procedure involved making a test from four or five thematically distinct segments of a connected discourse in which the second half of every second word (usually 100 words in all) were deleted. Examinees got credit for exact word restorations. The use of several different short texts minimized the effect of text topic familiarity or difficulty. Nevertheless, researches did not explore what kind of text produces higher reliability or validity until Mochizuki (1994) experimented with four kinds of texts for classroom C-tests: narratives, explanations, arguments, and descriptions. His study – which is later counter-indicated in this paper – suggested that long passages, especially narratives, were the most appropriate for making the C-test effective in terms of reliability and concurrent validity.
Klein-Braley and Raatz basically utilized teacher judgments or school grades as a criterion for validating C-tests, while other researchers have supplied evidence grounded on other kinds of criteria. For example, Nigishi (1987) reports correlation coefficients of .80 and .76 between C-tests and the reading subtest of ELBA and total ELBA, respectively; while the studies of Ikeguchi (1994) indicate the C-test responses to correlate highest with the grammar results of TOFEL exams. Still other studies in support of this test procedure include the validation of C tests among ESL/EFL learners. For instance, Feldman and Stemmer (1987) found C test validation through verbal reports, while Doornyei and Katona (1992) studied C tests against different language tests, including oral interviews. They found further support for C-testing, reporting that this procedure gives a random and representative sample of an original text. That supports an earlier assertion that the every-other-word deletion in the C test produces a large number of 'random samples of the word classes of the text involved' (Klein-Braley, 1985, 1984). Other recent SLA researchers suggested that C-tests may also be useful for L2 vocabulary research. For instance, Singleton and Little (1991) found the responses of L2 learners to C-tests as a source of evidence about second language lexical development.
On the other hand, criticisms have been leveled against the C-test procedure. Some common criticisms meriting further investigation are:
  1. Do C-tests accurately reflect students' ability to process discourse for general proficiency (Cleary, 1988)?
  2. Do C-tests encourage only microlevel processing rather than macrolevel processing (Cohen, Segal, and Weiss, 1984)?
  3. Do C-tests have adequate face validity (Weir, 1988)? and
  4. Since C-tests have reduced redundancy, are they valid tests of language competence (Carroll, 1987)?

[ p. 3 ]

Although C-tests may tap into a measure of grammatical competence (Klein-Braley,1985), there is not enough validity research regarding the specific traits they measure (Chapelle and Abraham, 1990). Moreover, according to Jafarpur (1995) 'assumptions of random sampling of the basic elements of a text are doubtful'.
The use of C-tests since their introduction (Klein-Braley and Raatz, 1984) as a means of constructing norm-referenced measures for proficiency and placement testing, and to solve problems concerning the cloze procedures, has been extended to certain indefinite limits such as 'measure of language creativity' (Carroll, 1987), and has yielded results contrary to the researchers' expectations that were not the purpose for which this test was originally intended. Furthermore, the empirical evidence in support of C-tests is scanty (Weir, 1988) and warrants further investigation in the context of second language instruction.


Purpose of the study

The objectives of this study are to investigate whether C-tests, using two procedures of construction, can discriminate levels of language proficiency between ESL learners in Japan and to determine the superiority of a C-test using several passages (C-test 1) over a C-test constructed from only one long passage, a narrative type (C-test 2), in terms of reliability and correlation with an external criterion.


Two groups of first year university students in Japan were chosen for the investigation: one group consisted of 60 undergraduate students enrolled in a general English course, the second group was made up of 30 students in a class of English for returnees. Students from the first group were picked randomly from an intact class, while those from the latter group belonged to one English class for returnees. To qualify for that class, the students must have stayed in an English speaking country for at least a period of one year, and have passed the qualifying exam administered by the university. In terms of proficiency level, most of the students in that class had advanced listening and oral production skills, but post-intermediate writing and reading skills (Tschirner, 1996).

[ p. 4 ]


Two kinds of C-tests were used in this study: one type was constructed using four short passages from different texts, while the other was constructed using only one long narrative text. The use of several short segments of different texts has been shown by the researches above, to have a satisfactory reliability above .80. According to Klein-Braley and Raatz (1984) it is also empirically valid. For this study, the four short passages were chosen from different texts within similar readability and interest levels using the Fry (1985) and Flesch (as described in Klare, 1984) indices. The readability estimates of the texts where segments were chosen for this study had a 6 - 8 level by the Fry index, and a 6.7 - to 9.6 level by the Flesch index. These numbers which appear to be quite different scales are remarkable only in that they indicate variations in the readability levels of the passages used (Brown, 1993). C-test 1 was constructed using 25 items from different passages, making a total of 100 items. Every first and last sentence of each passage were left intact to provide contextual clues.
The second type of C-test was adopted for use in this study based on Mochizuki's (1994) research on different types of discourse: the description, the exposition, the narration and argumentation to construct C-tests for classroom use. Among these four types of texts, the narrative type was found to be the most reliable - .92 . The narrative text "The Lock Keeper" consisting of 120 items which was found to be the most reliable and with the highest concurrent validity (Mochizuki, 1995).
This study is an attempt to investigate which of these two types of C-test constructions would yield higher reliability and concurrent validity. The external criterion used was the STEP-Eiken exam. The STEP-Eiken exam consists of 66 written test questions on vocabulary, grammar and reading comprehension. The STEP-Eiken has been established in previous investigations as resulting in high reliability as well as high coefficients as an external validating criterion with Japanese university students (Kimura, 1995). In a previous study using the STEP-Eiken and CELT results to investigate the external validity of C-tests constructed from different types of discourse, STEP-Eiken was found to have a higher reliability (.778) than the CELT (.638), and other C-tests (Mochizuki, 1994).


Each student from the two groups of students took the two versions of the C-tests and the STEP-Eiken. To control for a potential order effect, the order of administering the C-test and the STEP-Eiken was counterbalanced: half the subjects in the non-returnees group and the returnees group took the two C-tests first, and the STEP-Eiken during the English class the following week. The other half of each group took the STEP-Eiken test first, and then the English test.

[ p. 5 ]


The students' responses for both C-test 1 and C-test 2 were scored for exact replacements. Descriptive statistics for the scores of the C-tests were obtained. Reliability coefficients were obtained by the KR-20 method. The use of KR-20 has been questioned in the past. For instance, Farhady (1983) and Bachman (1990) claim that the internal consistency reliability coefficients are inappropriate for cloze and C-tests because of the interdependence of items. On the other hand, Woods (1984), Henning (1987) and (Jafarpur, 1995) claimed that the KR-20 method yields the same results as Cronbach's alpha. Moreover, Brown (1983) provided evidence that the differences between reliability coefficients from KR-20 and Cronbach's are negligible.
To address the first research question, that of determining the discriminative power of the C-tests, a comparison of the subjects' scores among groups was obtained, based on the results of the group t-tests. The subjects' mean scores within each group for each test was obtained and subjected to an analysis of variance test and t-tests.
For the second research objective which is to determine the reliability and correlation of C-tests and STEP-Eiken, the Pearson product moment correlation coefficients were computed.

Results and discussion
Table 1. Basic descriptive statistics for non-returnees' scores on the C-test 1, C-test 2 and the STEP-Eiken tests.
Test type               N    No. of Items     Mean     Reliability *

C-test 1                60      100            61      .67     .73
C-test 2                60      120            98      .70     .83
STEP-Eiken              60      160            109     .75     .85

    * Raw score reliabilities (KR 20) appear on the right and reliabilities that would 
      be observed if all the tests contained 100 items appear on the left.

Table 2. Basic descriptive statistics for returnees' scores on the C-test 1, C-test 2 and STEP-Eiken tests.

Test type               N       No. of items    Mean     Reliability *
C-test 1                30      100              74     .65     .76
C-test 2                30      120             109     .71     .89
STEP                    30      160             124     .87     .91
    * Raw score reliabilities (K-R 20) appear on the left and reliabilities that 
      would be observed if all the tests contained 100 items on the right.

- continued -


[ p. 6 ]