Shiken: JALT Testing & Evaluation SIG Newsletter
Vol. 1 No. 1 Oct. 1996 (p. 12 - 19) [ISSN 1881-5537]
PDF PDF Version

Student evaluation of teachers:
Professional practice or punitive policy?

T. L. Simmons

Teachers are evaluated by various methods. At this point in time the manner in which teachers are evaluated in Japan is undergoing change in some ways due to social and legal pressures (Shiozawa, 1995) Labour Standards Laws in this country, for example, actually contain the directives to refrain from dismissal unless there is reasonable need to do so (Sugeno, 1992, pp. 395-412). Although these laws are not necessarily being applied strictly in the work place and in the courts, we are seeing a greater sensitivity to human rights that will undermine many overtly discriminatory practices we see commonly now. In the place of invidious discrimination enacted with impunity we will see greater sophistication in the manner in which decisions are made to obstruct professional activities by some and to dismiss others for responding to the ethical considerations of their professional commitments. A fairly common sort of evaluation that may actually be used for the best intentions but often facilitates the most common abuses is the use of student opinion in the decisions that effect teachers.

In spite of the evidence

In 1993, H. T. Tagomori, at the University of San Francisco, did his Ed. Doctorate on instruments used for student evaluation of faculty. He established that the assessment used by universities and colleges to appraise a professor's teaching effectiveness were conducted by evaluation through instruments they design, borrow, or adapt from other universities and colleges. The reliability of the instruments used is generally unknown. A comprehensive content analysis of faculty evaluation instruments has not been conducted. As a result, faculty members in higher education may be evaluated with flawed evaluation instruments, conceivably leading to unfair assessment of their teaching performance.
". . . faculty members in higher education may be evaluated with flawed evaluation instruments, conceivably leading to unfair assessment of their teaching performance."
Tagomori did a content analysis of 4,028 evaluation items contained in the 200 evaluation instruments analyzed. His analysis revealed 54.6% of the items were ambiguous, unclear and/or subjective. Another 24.5% of the items did not correlate with classroom teaching performance. Altogether a total of 79.1% of the items were either flawed or did not identify with teaching performance. The content analysis also revealed 58% of the 200 evaluation instruments contained responses to evaluation items that were ambiguous, positively skewed or negatively skewed. Based on a frequency-count recording and frequency distribution of the data, the conclusion for this study is that evaluation instruments used in their present form are unreliable. It would seem clear that current SETEs (Student Evaluations of Teacher(s) (Effectiveness)) for evaluating teaching performance at universities and colleges must be systematically revised. And I would add that the purpose for their use must also be constructively addressed.

[ p. 12 ]

What exactly is being evaluated?

O'Connell and Dickinson (1993) stated that while it is well known that factors other than the instructors' teaching influences student ratings of instruction, not all of the sources have yet been identified. In addition, the correspondence between the amount of student learning and student ratings has not been clearly established. O'Connell and Dickinson also noted that although SETEs are one of the most common processes used in evaluating promotion, tenure and other benefits, their reliability is so low that they should not be used for judging individual performance (Stedman, 1983).

Who is being evaluated and who is evaluating?

Student evaluations of teacher effectiveness (SETEs) are, at best, nothing more than evaluations of the students' perceptions of the teachers' effectiveness – at best. It should be intuitively apparent to most that opinions expressed are subject to a great many variables that may have little or nothing to do with evaluating the teachers' ability to teach. The problems of SETEs are multitude and do not categorically represent a professional endeavour within education. SETEs must be justified in design, administration, analysis, interpretation and application of the interpretation. There is no published research to show that this is a task institutions in Japan are willing or prepared to do.

Variables that effect students' opinions

Class type

Class type is usually defined by such characteristics as lecture, combined lecture discussion, and laboratory. Wigington, Tollefson and Rodriguez in a study involving 5,483 evaluations of 242 different classes (1989) showed that instructors with discussion classes had higher overall ratings than other class types and smaller classes showed a higher rating than larger within a class type. There were no differences among class types in comparisons of class rank (lower, upper or graduate). But they did show that in some types the more experienced instructors got the lower ratings and the least experienced got the lowest in others – instructor rank is apparently interactive with class type. The Wigington, Tollefson and Rodriguez study on the interaction of class type and instructor gender (1989) showed that women got better ratings in lecture, discussions and laboratory classes but received lower scores in lecture/discussion classes.

[ p. 13 ]

Instructor rank & reputation

Instructor rank can be differentiated as graduate teaching assistant (GTA) and various levels of professor. The GTAs position of part-time teacher in what are usually lower division classes may be filled by adjunct lecturers in Japan although the adjunct faculty usually have graduate degrees or doctorates. Research by Wigington, Tollefson and Rodriguez (1989) into the interaction of class size and instructor rank showed that some ranks were greatest at the small class size, decreased in mid class size and somewhat greater in large class size – a 'U' shaped profile. In other cases, the class size showed a inverse relationship with the instructors rank – bigger classes, lower ratings. There were some interesting variations in Wigington, Tollefson and Rodriguez's study of the interaction of instructor rank and instructor gender (1989). At both ends of the ranking system, graduate assistants and full-professors, women got higher than men of the same rank, but the middle ranks of assistant professor and associate professor showed that there were no difference for men and women. In their enquiries into the interaction of level and instructor rank, Wigington, Tollefson and Rodriguez (1989) found significant variation; it was somewhat directly proportional for assistant professors and professors but 'U' shaped for graduate assistants (higher at both ends of the graph than in the middle) and 'bell' shaped for associate professors. Comparatively, GTAs had higher scores in lower level classes than did professors and lower scores for upper division classes. Full-professors did not evince the highest scores at any level.
The reputation of the instructor among the students may also be a factor. Wigington, Tollefson and Rodriguez (1989) found that students often take a class on the basis of the instructors' reputation. When this happened, the ratings were higher for lecture and laboratory class types. The interaction for reputation and rank showed the highest variation for professors. They received the highest scores from students who took classes for the instructors's reputation but they were also the lowest among the ranks when the students did not take the class because of the instructor's reputation. GTAs showed the least variation (Wigington, Tollefson & Rodriguez, 1989).

   Students' conceptions of performance

In judging a performance, the raters (students in this case) may be biased by their belief as to what constitutes a job well done. If that is not enough of a quandary, psychometrically oriented studies have largely ignored the fact that performance judgements are made with incomplete information – the students must recall or infer performance behaviour (Kishor, 1995). We are asked to believe that the student has a realistic idea of how teaching a foreign language is done well and they are doing it from memory. They are, in short, being asked questions that they have not necessarily considered and asked to answer them accurately after the fact in a matter of perhaps one hour.

[ p. 14 ]


Many major reviews of student evaluations conclude that gender does not have a significant effect (Seldin, 1993; Marsh & Dunkin, 1992). However Basow (1995) provides sufficient data from the literature to show that many of these studies examines only the main effects. Gender varies with the gender of the student as well as the teacher, the gender typing of disciplines, status of the professor (e. g. tenured vs. non-tenured), teaching styles, student year, student grade point average, student grade expectations, number of years teacher has taught, the hour the class is taught, student perceptions of the teachers speech, thought stimulation, non-repetition, and overall rating. The point that Basow makes is that for individual teachers, the results of the interaction of numerous variables, negligible alone, may be significant if the influences occur simultaneously. "Anyone using student evaluations should have a sophisticated understanding of how gender variables may operate in such ratings." (Basow, 1995).

Class size

The effect of class is uncertain (Wigington, Tollefson & Rodriguez, 1989). Feldman (1978, 1984) found no significant effect and Smith and Glass (1980) and Whitten and Umble (1980) found that instructors of smaller classes have higher ratings.
Using a demarcation of small as 25 or less, mid sized as 26-49, and large as any class over 50, Wigington, Tollefson and Rodriguez, (1989) found that the interaction of instructor gender and class size produces better ratings for women in small classes and men in large classes. The interaction of class level (lower, upper or graduate division) and size showed higher ratings for instructors in upper division, large classes in comparison with mid-sized classes.

Class level

Classes can be ranked by divisions, lower, upper, and graduate. The research on the effects of class level on ratings is ambiguous (Wigington, Tollefson & Rodriguez, 1989). Romeo and Weber (1985) and others found that there were no difference for ratings of instructors of higher or lower classes. Cranton and Smith (1986) and others found that instructors of upper level classes received higher ratings.
Wigington, Tollefson and Rodriguez (1989) found that while instructors of lower level classes received lower ratings than those of higher level classes, there was no significant difference between men and women in lower levels. However, in upper division or graduate courses men got higher ratings than women.

- continued -

[ p. 15 ]