Developing classroom specific rating scales:
James Venema (Nagoya Institute of Technology)
Clarifying teacher assessment of oral communicative competence
This paper documents the development of a classroom specific rating scale for grading student performance on oral tests of three or four people. The paper outlines the rational behind the development of such scales including three parameters of assessment: English skills, communication skills, and content. The process of developing analytic rating scales and incorporating judgements of actual elicited samples of speech is an opportunity for teachers to make explicit assumptions regarding oral communicative competence and a step towards reducing the ambiguity inherent in oral assessment.
Keywords: group oral testing, analytic rating scales, oral communicative competence
For teachers conducting oral testing in their classes, the development of rating scales corresponding to each grading level presents
a more transparent alternative to impression marking. However, the validity and reliability difficulties in implementing rating
scales are well documented (Fulcher, 1987; Matthews, 1990; Upshur and Turner, 1995). The essential problem with implementating
rating scales seems to be that "ratings involve subjective judgements in the scoring process..." (Bachman and Palmer, 1996, p. 221). Even where raters arrive at similar ratings using identical scales, it is not at all clear that they are doing so for similar reasons (see Douglas and Selinker, 1992 and 1993). The development of classroom specific analytical scales, including specific descriptions of performance, represents an opportunity to make explicit teacher assumptions regarding communicative competence and the goals of the course.
"The development of relevant rating scales would appear to be a dynamic and reflective process of clarifying the grading rationale in rating scales, applying the scales to assess student performance, and adapting them where necessary to account for the elicited performances."
According to Bachman and Palmer (1996) an advantage of analytical rating scales, which include descriptions of different levels of performance, is that they "tend to reflect what raters actually do when rating samples of language use" (p. 211). The emphasis on detailed description of performance suggests that the development of such scales in the absence of performance to reflect on may be less revealing. In fact, the scales may shed little light on the actual judgements made by the teacher in assigning grades for a given performance. In order to reduce the subjectivity of rater decision, Bachman and Palmer call for the careful training of raters, including repeated clarification and comparisons of ratings given for specific performances. Douglas (1994) calls for more "think aloud studies of rating processes" and studies "focusing on the basis of raters' judgements, both in terms of what features of discourse language users attend to in making judgements of communicative ability, and in terms of the rating scales the raters may have been trained to use" (p. 136). Upshur and Turner (1999) and others suggest that "empirical procedures be used in the development of rating scales and that rating scales be task-specific" (p. 107). For the teacher developing and implementing classroom specific rating scales the challenge lies not in developing reliability over multiple raters. Instead, the main problem is making explicit assumptions regarding oral communicative competence, and demonstrating the relevance of those theoretical constructs in assessing actual samples of elicited student performance on the given oral test. The goal remains the same: developing scales that reflect real, rather than assumed, basis of rater evaluations. Recognizing the difficulty of this process, Underhill (1987) suggests that "the only solution is to adapt and improve the scales by trial and error" (p. 99). The development of relevant rating scales would appear to be a dynamic and reflective process of clarifying the grading rationale in rating scales, applying the scales to assess student performance, and adapting them where necessary to account for the elicited performances. This paper seeks to describe such a process in the development of rating scales for two classes at a Japanese technical university.
The Classes and the Test
The two classes consisted, respectively, of 27 and 10 first year students, the majority of which were false beginners. The goal of the course was to develop students' ability to participate effectively in conversations by concentrating on a narrow selection of topics. The oral tests were conducted in groups of three students and simply consisted of the random selection of two topics (by drawing a card) from a total of four topics covered in detail in the course. Students were fully aware of the possible topics and were even allowed to bring notes of key vocabulary (although they were warned that any student obviously reading his way through the "conversation" would risk having his notes taken away). Students were given about four minutes to converse on each topic and the results were recorded on video. The video was used as a means for repeated teacher assessment as well as for self assessment in which the students viewed the recorded exams before writing a self evaluation of their own performance, with reference to the rating scales developed by the teacher. One of main advantages of a video recording was that, by allowing for observation of non-verbal aspects of communication, it provided greater richness in viewing and assessing student performance than a tape recording alone would have. As Murphey and Woo (1998) note, video recordings present the opportunity for "repeated viewing and noticing of linguistic and non linguistic features in the acts of communication" (p. 24).
The grading of student oral tests the previous semester relied on student negotiated rating scales (see McClean, 1995, for a description of a student negotiated grading scheme for oral tests). The students were allowed to develop the grading rational, so I limited my role to facilitator and organizer with occasional input and suggestions. While useful from an educational and motivational viewpoint, the scales were a compromise of diverging interpretations and assumptions regarding performance. In fact teacher-student grading differed significantly with correlation coefficients of .45 and .41 for each of the classes. (Correlation coefficients were computed on a calculator at this website:
http://ebook.stat.ucla.edu/calculators/correlation.phtml [Expired Link]). Intra-rater coefficients were respectively high between live real time and later video ratings: .83 and .88 respectively. It is likely though that, even for myself, the rating scales played only a limited role in grading as subjective interpretations of performance, not clearly expressed in the scales, remained consistent over repeated assessment. Since evaluating oral performance is a highly complex and multi-faceted endeavor, the question remained as to what formed the basis of my evaluations of student performance. In the current semester my focus was on developing rating scales that formed the real basis of my evaluations of student performance.
Some starting assumptions
The term communicative competence (Canale, 1988; Bachman, 1990) has been used to describe the multi-faceted skills required for the effective use of language. Bachman and Palmer (1996) note that effective language use requires both "organizational knowledge" (what is said) and "pragmatic knowledge" (how it is said). Not only must a speaker demonstrate lexical and structural language knowledge, but also effectively implement that knowledge in real time conversations. The implication would seem to be that, in the development of descriptive rating scales, static descriptions of language, lexical or structural, be tempered by recognition of the linear demands of conversation. One cannot expect demonstration of lexis or structure outside the topic boundaries of the conversation. Similarly, students should be given limited credit for the demonstration of language knowledge that does not contribute effectively to the unfolding discourse.
In my own oral exams, students are given considerable leeway in the direction their conversation takes, thus the complexity of the ideas students try to express can vary widely between students and testing groups, even though the overall topic is predetermined. Students are encouraged and given due credit for the relative complexity of the discourse demonstrated. Indeed Douglas and Selinker (1993) suggested that, in one study, differences in rater scores of grammar (where no evidence of real differences when found) were, in fact, a reflection of differing assessments of rhetorical complexity. In my own tests I encouraged students to attempt challenging discourse even where it would push at the boundaries of their language competence.
Towards the construction of rating scales
We then have the basis for three parameters of assessment: (1) content, (2) English skills, and (3) communication skills. The first reflects an assessment of the relative complexity of the discourse and topic chosen while the latter two roughly parallel Bachman's organizational knowledge and pragmatic knowledge distinction, respectively. The final scales themselves were constantly adapted and improvised through first (live) , and later markings (by video) of twelve groups of students. With three students interacting, real time attention demands were simply too high to allow for the complete development and systemization of observations with one live viewing.
The scales are not a theoretically complete description of communicative competence but rather a summary of the minimum criteria by which I felt the majority of performances could be fairly distinguished into three broad categories (A, B and C). A summary of the scales follows below:
||Demonstration of some comprehension of a topic and ongoing discourse, as well as some minimal contribution to that topic and discourse.
||Demonstration of interest in a topic through basic communication skills and expressions of interest and eye contact.
||Demonstration of ability to correctly formulate and interpret some simple expressions of meaning such as asking and answer yes/no questions.
||Demonstration of some topic preparation - the student seemed prepared with something meaningful to contribute.
||Demonstration of active listening - the student responded verbally or otherwise to utterances and was able to communicate most intended meanings.
||Demonstration of adequate lexical and structural knowledge to correctly formulate and interpret many basic meanings.
||Demonstration of some detail and depth in discussion of a chosen topic at a level of relative complexity for a false beginner.
||Demonstration of the succinct communication of intended meanings, effective elicitation of meaning, and skill in overcoming communication breakdowns.
||Demonstration of broader lexical and structural knowledge to allow for the confident, succinct, and accurate expression and interpretation of most basic meanings as well as some more complex meanings.
The lowest level, C, was deliberately set at a level attainable simply by participating actively on the test. Two further grading options (AB and BC) were included to allow for the selection of appropriate grades when a given performance did not correspond with one of the three broad categories. F scores, representing failure, were simply defined by virtue of not being able to attain the minimum C criteria in two or more parameters. In fact, no students were given failure marks.
Assessment of oral performance remains an interactive endeavor involving the discourse elicited as well as the assessor and rating scales in question.
For that reason, it may be more enlightening to view some transcripts of student performance along with my grading rational.
To view some actual transcripts, please visit http://jalt.org/test/ven_t.htm.
The process of developing the scales, involving the repeating viewing of student performances, precluded any meaningful estimates of intra-rater reliability coefficients. However the journey of assessment, reassessment, and careful elucidation of my own rationale in response to elicited student speech was an enlightening one. The very complexity of the targeted domain, oral communicative competence, as well as the lack of a consensus on the matter, makes it important for the teacher engaging in oral assessment to make explicit their own grading rationale. An obvious advantage of analytic scales is the opportunity to become more explicit about performance criteria. In addition, analytical scales offer teachers chances to incorporate overall course priorities in their tests. This should, in turn, offer the opportunity for a beneficial backwash effect. However, analytical scales must do more than outline a teachers theoretical construct of communicative competence. They must also form the actual and consistent basis for rater evaluations. One needs to make the claim with confidence that rating scales do indeed express the rationale behind one's actual grading choices, therefore explicit evaluations of student performance needs to be an integral part of the development of the scales. The reflective development of rating scales offers one an opportunity to find out what one actually attends to while assessing oral communicative competence rather than what one thinks one attends to. This is an important step in clarifying the inherently subjective evaluation process.
Bachman, Lyle F. (19900. Fundamental Considerations in Language Testing. Oxford: Oxford University Press.
Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford: Oxford University Press.
Brazil, D. (1995). A Grammar of Speech. Oxford: Oxford University Press.
Canale, M. (1988). The measurement of communicative competence. Annual Review of Applied Linguistics 8, 67-84.
Douglas, D. & Selinker, L. (1992). Analysing oral proficiency test performance in general and specific purpose contexts. System 20, 317-28.
Douglas, D. & Selinker, L. (1993). Performance on a general versus field-specific test of speaking proficiency by international teaching assistants. In Chapelle, C. and Douglas, D. (Eds.) A New Decade of Language Testing Research. Alexandria, VA: TESOL Publications.
Douglas, D. (1994). Quantity and quality in speaking test performance. Language Testing 11 (1) 125-44.
Enochs, K. & Yoshitaka-Strain, S. (1999). Evaluating six measures of EFL learnerís pragmatic competence. JALT Journal, 21 (1) 29-50.
Fulcher, G. (1987). Tests of oral performance: the need for data-based criteria. ELT Journal 41 (4) 287 - 91.
Hughes, A. (1989). Testing for Language Teachers. Cambridge: CUP.
Jungheim, N. O. (1995). Assessing the unsaid: the development of tests of nonverbal ability. In Brown, J.D. and Yamashita, S.O. (Eds.) Language Testing in Japan. Tokyo: The Japan Association for Language Teaching. 149-165.
Matthews, M. (1990). The measurement of productive skills: doubts concerning the assessment criteria of certain public examinations. ELT Journal, 44 (2) 117 - 21.
McClean, J. (1995). Negotiating a spoken-English scheme with Japanese university students. In Brown, J.D. and Yamashita, S.O. (Eds.) Language Testing in Japan. Tokyo: The Japan Association for Language Teaching. 136-48.
McDonough, S. H. (1995). Strategy and Skill in Learning a Foreign Language. London: Edward Arnold.
Nakamura, Y. (1995). Making speaking test valid: practical considerations in a classroom setting. In Brown, J.D. and Yamashita, S.O. (Eds.) Language Testing in Japan. Tokyo: The Japan Association for Language Teaching. 136-48.
Seedhouse, P. (1996). Classroom interaction: possibilities and impossibilities. ELT Journal, 50 (1) 16-24.
Underhill, N. (1987). Testing Spoken Language: A Handbook of Oral Testing Techniques. Cambridge: Cambridge University Press.
Upshur, J.A. & Turner, C.E. (1995). Constructing rating scales for second language tests. ELT Journal, 49 (1) 3-12.
Upshur, J.A. & Turner, C.E. (1999). Systemic effects in the rating of second-language speaking ability: test method and learner discourse. Language Testing, 16 (1) 82-111.