Student evaluation of teachers:
Professional practice or punitive policy?
T. L. Simmons
Teachers are evaluated by various methods. At this point in time the manner
in which teachers are evaluated in Japan is undergoing change in some ways due
to social and legal pressures (Shiozawa, 1995) Labour Standards
Laws in this country, for example, actually contain the directives to refrain
from dismissal unless there is reasonable need to do so (Sugeno, 1992, pp. 395-412).
Although these laws are not necessarily being applied strictly in the work place
and in the courts, we are seeing a greater sensitivity to human rights that will
undermine many overtly discriminatory practices we see commonly now. In the place
of invidious discrimination enacted with impunity we will see greater
sophistication in the manner in which decisions are made to obstruct professional
activities by some and to dismiss others for responding to the ethical
considerations of their professional commitments. A fairly common sort of
evaluation that may actually be used for the best intentions but often facilitates
the most common abuses is the use of student opinion in the decisions that effect
In spite of the evidence
In 1993, H. T. Tagomori, at the University of San Francisco,
did his Ed. Doctorate on instruments used for student evaluation of faculty. He
established that the assessment used by universities and colleges to appraise a
professor's teaching effectiveness were conducted by evaluation through
instruments they design, borrow, or adapt from other universities and colleges.
The reliability of the instruments used is generally unknown. A comprehensive
content analysis of faculty evaluation instruments has not been conducted. As a
result, faculty members in higher education may be evaluated with flawed
evaluation instruments, conceivably leading to unfair assessment of their
Tagomori did a content analysis of 4,028 evaluation items contained in the
200 evaluation instruments analyzed. His analysis revealed 54.6% of the items
were ambiguous, unclear and/or subjective. Another 24.5% of the items did not
correlate with classroom teaching performance. Altogether a total of 79.1% of
the items were either flawed or did not identify with teaching performance. The
content analysis also revealed 58% of the 200 evaluation instruments contained
responses to evaluation items that were ambiguous, positively skewed or
negatively skewed. Based on a frequency-count recording and frequency distribution
of the data, the conclusion for this study is that evaluation instruments used in
their present form are unreliable. It would seem clear that current SETEs
(Student Evaluations of Teacher(s) (Effectiveness)) for
evaluating teaching performance at universities and colleges must be
systematically revised. And I would add that the purpose for their use must also
be constructively addressed.
". . . faculty members in higher education may be evaluated with flawed evaluation instruments, conceivably leading to unfair assessment of their teaching performance."
What exactly is being evaluated?
O'Connell and Dickinson (1993) stated that while it is well
known that factors other than the instructors' teaching
influences student ratings of instruction, not all of the sources have yet been
identified. In addition, the correspondence between the amount of student learning
and student ratings has not been clearly established. O'Connell and Dickinson also noted that although SETEs
are one of the most common
processes used in evaluating promotion, tenure and other benefits, their
reliability is so low that they should not be used for judging individual
performance (Stedman, 1983).
Who is being evaluated and who is evaluating?
Student evaluations of teacher effectiveness (SETEs) are, at best, nothing
more than evaluations of the students' perceptions of the teachers'
effectiveness – at best. It should be intuitively apparent to most that opinions
expressed are subject to a great many variables that may have little or nothing
to do with evaluating the teachers' ability to teach. The problems of SETEs are
multitude and do not categorically represent a professional endeavour within
education. SETEs must be justified in design, administration, analysis,
interpretation and application of the interpretation. There is no published
research to show that this is a task institutions in Japan are willing or
prepared to do.
Variables that effect students' opinions
Class type is usually defined by such characteristics as lecture, combined
lecture discussion, and laboratory. Wigington, Tollefson and Rodriguez
in a study involving 5,483 evaluations of 242 different classes (1989) showed
that instructors with discussion classes had higher overall ratings than other
class types and smaller classes showed a higher rating than larger within a
class type. There were no differences among class types in comparisons of class
rank (lower, upper or graduate). But they did show that in some types the more
experienced instructors got the lower ratings and the least experienced got the
lowest in others – instructor rank is apparently interactive with class type.
The Wigington, Tollefson and Rodriguez study on the interaction of class type and
instructor gender (1989) showed that women got better ratings in lecture,
discussions and laboratory classes but received lower scores in lecture/discussion
Instructor rank & reputation
Instructor rank can be differentiated as graduate teaching assistant (GTA)
and various levels of professor. The GTAs position of part-time teacher in what
are usually lower division classes may be filled by adjunct lecturers in Japan
although the adjunct faculty usually have graduate degrees or doctorates.
Research by Wigington, Tollefson and Rodriguez (1989) into the
interaction of class size and instructor rank showed that some ranks were
greatest at the small class size, decreased in mid class size and somewhat
greater in large class size – a 'U' shaped profile. In other cases, the class
size showed a inverse relationship with the instructors rank – bigger classes,
lower ratings. There were some interesting variations in Wigington, Tollefson and
Rodriguez's study of the interaction of instructor rank and instructor gender
(1989). At both ends of the ranking system, graduate assistants and
full-professors, women got higher than men of the same rank, but the middle
ranks of assistant professor and associate professor showed that there were no
difference for men and women. In their enquiries into the interaction of level
and instructor rank, Wigington, Tollefson and Rodriguez (1989) found significant
variation; it was somewhat directly proportional for assistant professors and
professors but 'U' shaped for graduate assistants (higher at both ends of the
graph than in the middle) and 'bell' shaped for associate professors.
Comparatively, GTAs had higher scores in lower level classes than did
professors and lower scores for upper division classes. Full-professors did not
evince the highest scores at any level.
The reputation of the instructor among the students may also be a factor.
Wigington, Tollefson and Rodriguez (1989) found that students often take a class
on the basis of the instructors' reputation. When this happened, the ratings were
higher for lecture and laboratory class types. The interaction for reputation
and rank showed the highest variation for professors. They received the highest
scores from students who took classes for the instructors's reputation but they
were also the lowest among the ranks when the students did not take the class
because of the instructor's reputation. GTAs showed the least variation
(Wigington, Tollefson & Rodriguez, 1989).
Students' conceptions of performance
In judging a performance, the raters (students in this case) may be biased
by their belief as to what constitutes a job well done. If that is not enough
of a quandary, psychometrically oriented studies have largely ignored the fact
that performance judgements are made with incomplete information – the students
must recall or infer performance behaviour (Kishor, 1995). We
are asked to believe that the student has a realistic idea of how teaching a
foreign language is done well and they are doing it from memory. They are,
in short, being asked questions that they have not necessarily considered and
asked to answer them accurately after the fact in a matter of perhaps one hour.
Many major reviews of student evaluations conclude that gender does not
have a significant effect (Seldin, 1993; Marsh & Dunkin, 1992).
However Basow (1995) provides sufficient data from the
literature to show that many of these studies examines only the main effects.
Gender varies with the gender of the student as well as the teacher, the gender
typing of disciplines, status of the professor (e. g. tenured vs. non-tenured),
teaching styles, student year, student grade point average, student grade
expectations, number of years teacher has taught, the hour the class is
taught, student perceptions of the teachers speech, thought stimulation,
non-repetition, and overall rating. The point that Basow makes is that for
individual teachers, the results of the interaction of numerous variables,
negligible alone, may be significant if the influences occur simultaneously.
"Anyone using student evaluations should have a sophisticated understanding of
how gender variables may operate in such ratings." (Basow, 1995).
The effect of class is uncertain (Wigington, Tollefson & Rodriguez, 1989).
Feldman (1978, 1984) found no significant effect and
Smith and Glass (1980)
and Whitten and Umble (1980) found that instructors of smaller
classes have higher ratings.
Using a demarcation of small as 25 or less, mid sized as 26-49, and
large as any class over 50, Wigington, Tollefson and Rodriguez, (1989)
found that the interaction of instructor gender and class size produces
better ratings for women in small classes and men in large classes. The
interaction of class level (lower, upper or graduate division) and size
showed higher ratings for instructors in upper division, large classes
in comparison with mid-sized classes.
Classes can be ranked by divisions, lower, upper, and graduate. The
research on the effects of class level on ratings is ambiguous (Wigington,
Tollefson & Rodriguez, 1989). Romeo and Weber (1985) and others found
that there were no difference for ratings of instructors of higher or
lower classes. Cranton and Smith (1986) and others found
that instructors of upper level classes received higher ratings.
Wigington, Tollefson and Rodriguez (1989) found that while instructors
of lower level classes received lower ratings than those of higher level
classes, there was no significant difference between men and women in
lower levels. However, in upper division or graduate courses men got
higher ratings than women.
- continued -