This paper has three purposes. Firstly, it discusses these three assessment contexts:
(1) the 1999 Tokyo Senior High School entrance examination English test,
(2) a proposal to include speaking tests in entrance examinations, and
(3) some information about the assessment of speaking skills in junior high schools in relation to Bachman and Palmer's notion of 'usefulness' (1996).
Secondly, it identifies issues in school-based assessment by junior high school English
teachers through a questionnaire. It also reports the results of a Rasch analysis
using empirical data derived from test trials undertaken by junior high school students.
Finally, this paper argues for the need to build up a 'task bank' for future versions of the senior high school exam. It also
emphasizes the importance of introducing speaking tests in entrance examinations for senior high schools.
The 'usefulness' of the current English entrance examination
Nearly 80% of the 1999 version of the Tokyo Senior High School English Test focused on reading skills and
grammar knowledge and it was in a multiple-choice format. To its credit, the test scores of
were reliable. However, since the entrance examination that did not include any assessment
of speaking skills it lacked authenticity.
Figure 1. The proportion of skills tested in the 1999 version of the English section of the Tokyo senior high school entrance examination.
The 'indirect speaking tests' in Section 2 of this test were low on interactiveness because students
were only required to select an English sentence which situated a given scenario most
appropriately. In terms of test impact, this test does not encourage students or teachers
to focus on oral/aural skills. In terms of practicality, however, the current paper-and-pencil English examination
test is highly practical. Hence the 1999 version of the test has two strong points
(reliability and practicality) and four low points (construct validity, impact, authenticity
and interactiveness), as depicted in Figure 2.
Figure 2 . The author's assessment of the usefulness of the 1999 English test.
". . . speaking tests tend have inherently many variables which reduce reliability . . .
In terms of authenticity, however, the inclusion of the speaking tests could be a genuine boon."
Figure 3 represents the author's evaluation of a hypothetical test with a speaking component. Though it may have
less formal reliability than the current English test, it would be superior in terms of authenticity.
McNamara (1996) has noted that speaking tests tend have
inherently many variables which reduce reliability, such as rater behaviour and interlocutors'
variations. In terms of authenticity, however, the inclusion of the speaking
tests could be a genuine boon since the test reflects the curricular content. As the
inclusion of speaking tests could engage students in completing tasks interactively, such
tests would be more interactive as well. Introducing speaking tests in the
entrance examination would have great impact on teachers and students, as several
studies (e.g. Shohamy, Donitsa-Schmidt, and Ferman, 1996; Cheng, 1997) suggest. As
speaking tests require greater resources such to administer, the inclusion
of a speaking component may be low in terms of practicality.
Figure 3. The author's assessment of the usefulness of the proposed English test.
As studies by Brindley (1999) indicate, the reliability of school based
assessment tends to be low. The construct validity could potentially be high as Hamp-Lyons (1996) claims.
She argues that portfolio assessment tends to have more task validity than
traditional tests. Authenticity and interactiveness could be potentially high because
school-based assessment can provide ample opportunity to speak.
However, these judgments need to be made with caution because results may vary significantly
depending on teachers and teaching styles.
Practicality seems to be the main reason that tests do not currently have a speaking component.
Data collection methods
- How do junior high school teachers assess their students' speaking skills?
- What impact would the introduction of a speaking test in the entrance exams have on teaching?
- To what extent are tasks (speech, role-play, description and interview) different in terms of difficulty?
- To what extent do speaking test items and tasks correlate in terms of Rasch measurement?
- How well did the test population's performances fit in terms of Rasch measurement?
Data collection 1: A questionnaire survey
A questionnaire survey was designed to address research questions 1 and 2.
Approximately 600 questionnaires were distributed to junior high school English
teachers in Tokyo. The questionnaire was completed by 199 junior high school teachers,
representing a response rate of 33.
Data collection 2: Test trials
Five of the four the most popular tasks with the exception of information gap task
(speech, role-play, description, and oral interview) were used for a trial test. All tasks had
a 5 minute completion time, including explanations of the test procedures.
Test-takers were Japanese junior high school students ranging from age 14
(second year students) to age 15 (third year students). 219 students at twelve schools
participated in this test trial. All students at each school undertook two of the four tasks,
representing a total of 438 students performances.
The 13 interlocutors (12 Japanese English teachers at participants' school and the
researcher) administered different tasks to the students.
Raters and scoring criteria
Five independent Japanese English senior high school teachers, with more than 10
years' teaching experience, rated students' performances from tape recordings. Scoring
criteria consisted of 5 items (fluency, vocabulary, grammar, intelligibility and overall task
fulfillment) The items were rated on a 0 to 5 points scale according to different levels of
performance described for each item.
Research Question 1 ascertained what percentage of English teachers assessed
students' speaking ability using 'direct speaking tests'. Those who conducted 'direct
speaking tests' amounted to 57.3% (n=114) and 42.7% (n=85) chose not to
administer speaking tests. However, further analysis shows that the
combination of other assessment methods, such as class observation and pencil-and-paper tests were frequently used. Results revealed that the majority of English teachers
assessed students' speaking skills based on classroom observation with a combination of other methods.
Research Question 2 was to investigate what impact the introduction of speaking
tests would have on Japanese English teachers. More than 75% of the teachers reported that speaking tests would have an impact on them, while
20% stated that little impact or no impact would occur to their teaching. Responses to
this question showed that the introduction of speaking tests in entrance examinations
would have a positive impact on teachers and their teaching activities, in that the majority
of teachers would change their teaching styles towards improvement of students' communicative skills.
Now let us look at the test scores from a Rasch perspective.
Difficulty of items and tasks
Research Question 3 investigated the difficulty of tasks (items) on each task.
The difference between the most difficult and the easiest tasks was approximately 1.5 logits.
Fit indexes across four tasks
Research Question 4 examines the quality of items, and the extent to which data
patterns derived from the Rasch model differ from those of the actual data. Unexpected
items that the Rasch model identifies are called either 'misfit' or 'overfit' items. The
acceptable range of IMS here is from 0.70 to 1.30. Only item was identified
as 'misfit', indicating a larger than the acceptable range of IMS in the sixth and seventh
columns. This shows that the actual data patterns from Item 15 vary unacceptably in
comparison with data patterns estimated by Rasch measurement. Thus the items on
four tasks appeared to produce relatively similar response patterns, suggesting that the
items across tasks assessed the similar construct.
Person fit indexes
The last question focuses on students' scores across the four tasks. This is
particularly important, since this question leads to issues of accountability for students.
5.4 % of the students were identified as misfit students.
This indicates that the percentage of misfit students exceeds the limit of the acceptable
percentages of misfit students. It is important to investigate why this happened.
The combinations of tasks, which produced most misfit students the most frequently, were
speech and interview followed by the combination of description and interview. Other
task combinations produced fewer misfit students than the above two combinations. One
possible explanation for this is that differences of task difficulty in combinations might
have an impact on increasing misfit students.
Results of the questionnaire survey revealed that teachers' assessment methods
varied, suggesting that it would be difficult to compare students' speaking ability across
schools. The introduction of speaking tests might have a positive impact on approximately
80% of teachers, and most teachers maintained that they would change to a more
communicative style of teaching. Thus, it can be argued that the inclusion of the speaking
tests would have the potential to assist in bridging a gap between skills taught in classes
and skills tested in entrance examinations, and between goals of the guidelines and assessment policy.
Results from test trials undertaken by junior high school students showed that all
items except one fitted Rasch measurement, indicating that items on each task were
effective in assessing the target construct. However, results also showed that the four
tasks frequently used by English teachers were different in terms of difficulty. This means
that students who undertake a variety of difficult tasks might not be assessed
appropriately. Given that variables, including rater behavior and interlocutors, are inherent
in performance tests, difficulty of tasks needs to be relatively equal in order to reduce
variables. A concept of 'task bank', presented by Brindley (2001), could have important
implications for the introduction of formal speaking tests in entrance examinations.
Implications for this study are that speaking tasks used in a classroom need to
be trialled, and also investigated with Rasch measurement, given that school-based
assessment represents half of selection procedures for students who wish to enter senior
high schools. In junior high school contexts, a role play task bank, such as shopping
situation, inviting friends to a party, or giving directions to a stranger could be developed.
In order to not only administer speaking tests in a high stakes context, but also to enable
teacher implemented assessment to be comparable across schools, it would be necessary
to investigate tasks with Rasch techniques, based on empirical data, and to build up a
'task bank' with a relatively consistent quality of tasks.
Akiyama, T. (2001). The application of G-theory and IRT in the analysis of data from
speaking tests administered in a classroom context. Melbourne Papers in Language Testing, 10, 1,1-22.
Alderson, J. C. & Wall, D. (1993). Does washback exist? Applied Linguistics, 14, 115-129.
Bachman, L. F. (1990). Fundamental consideration language testing. Oxford University Press
Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
Brindley, G. (2001). Outcome-based assessment in practice: some examples and emerging insights. Language Testing, 18 (4), 393-407.
Cheng, L. (1997). How does washback influence teaching? Implications for Hong Kong. Language and Education, 11 (1), 38-54.
McNamara, T. F. (1996). Measuring second language performance. London and New York: Addison-Wesley Longman.
Messick, S. (1996). Validity and washback in language testing. Language Testing, 13 (3), 239-256.
Shohamy, E., Donitsa-Schmidt, S., & Ferman, I. (1996). Test impact revisited:
washback effect over time. Language Testing, 13 (3), 298-317.