Shiken: JALT Testing & Evaluation SIG Newsletter
Vol. 10 No. 2. Dec. 2006. (p. 17 - 20) [ISSN 1881-5537]
PDF PDF Version

Conference Report:
Tenth Annual Japan Language Testing Association Conference

The theme of this conference was "Language Testing in an Asian Framework". It was attended by approximately 130 scholars from six nations. There were 29 papers presented in concurrent sessions and a plenary speech about measurement scales by Dr. Bernard Spolsky, as well as a symposium on the conference theme by professors Steven Ross of Kwansei Gakuin University, Lianzhen He of Zhezian University, Oryan Kwon of Seoul National University, and Yuji Nakamura of Keio University.
In an era of increasing interchange of information among Asian scholars, educators, and policy makers involved with English language teaching, learning, and assessment, the idea of a formulating a unified Asian assessment framework is starting to gain ground. Many of the presentations at this conference highlighted the possibilities and limitations of developing a common assessment practice in Asia. The underlying, if not primary, issue in the presentations at this conference was that of fairness. One aspect of fairness is freedom from bias or error in assessment design and use. This report highlights only a sample of what was offered, with an apology to those whose presentations are not summarized for brevity.

Plenary Address

In his plenary speech, Dr. Spolsky offered two deliberations. First, he asked whether the attempt to build a single scale for language proficiency would be possible in an Asian context. Secondly, he suggested participants reflect on whether a scale based on current Western models would force conformity rather than permit intelligent localized diversity.
The first question raised issues of accuracy, fairness, and utility. Dr. Spolsky presented a history of formal assessment whose roots extend to the Imperial Chinese civil service examinations. The Jesuits then brought the idea to the West. By the 19th century testing spread through much of Europe. Spolsky outlined developments in 20th century America then cumulated his overview by mentioning the Common European Framework of Reference for Languages: Learning, Teaching, Assessment (CERF).
Throughout the 20th century tests have involved variations on the theme of control and a focus on the problems of validating the use of scales adopted to inform that control. Dr. Spolsky's allusion to 'unavoidable uncertainty' characterizes a dilemma in testing language proficiency unidimensionally since there's a fair amount of evidence that linguistic abilities are in fact multidimensional. Warning against the use of any single measure 'framework', Spolsky recommended the use of profiles informed by multiple scales, which may include such devices as 'Can-Do' statements. He pointed out that the Council of Europe's recent CERF project was a resource worthy of consideration. Given the various uses and purposes for language assessment in Asia, Spolsky suggested that a judicious ad hoc weighing of several scales may be appropriate.

[ p. 17 ]

The gist of Dr. Spolsky's second deliberation seemed to be a rhetorical invitation to not only consider to whether or not our means of assessment are accurate, but also whether or not they are socially and ethically justifiable.

How do the TOEIC® and GEPT® fit it?

High-stakes testing for summative purposes remains a major area of controversy throughout Asia. Use of tests such as the TOEIC accounts for no small percentage of presentations at testing conferences here. Four studies that dealt with the validation of such tests from a variety of different approaches will be briefly mentioned.
A presentation by Steven Ross and TOEIC representative Hiromu Yamada addressed the need for greater care in conducting and reviewing institutional validation studies that use data-driven evidence to support claims about the TOEIC. They pointed out that the data or analysis used to support criticism of the TOEIC may be misleading if certain assumptions about random sampling, normal distribution, sample to population variance, and internal consistency (or reliability) of data samples are not tested beforehand. Dr. Ross warned that correlation between the TOEICand other measures in institutional studies might have been deflated and need to be corrected for attenuation and truncation based on standard deviation and reliability estimates.
In studies involving the use of gain scores, it is important to consider possible pitfalls unless a way to monitor them is built in at very earliest stages of design. Aside from the need for use of parallel and equivalent forms across administrations, the total hours, quality, and continuity of instruction need to be coded into the database so that pooling learners who have undergone different instructional time frames does not muddy the results. Dr. Ross pointed also pointed out that regression to the mean should be modeled and that learning time should co-vary with gain, but that 50 hours of quality contact hours may be the minimum required to detect any measurable TOEIC score gain.
The TOEIC is arguably a domain-referenced test that was not intended to be as sensitive to instruction as, say, achievement or other criterion-referenced measures. However, with the kind of care proposed by the presenters, its usefulness for educational decision-making seems evident. As a third approach to the validation of institutional TOEIC use, Dr. Ross introduced a triangulation model for correlation among proficiency gains as measured by tests such as the TOEIC, the use of syllabus-based Can-Do statements for student (pre- and post-instruction) self-assessment, and for student post-instruction assessment by teachers. The degree of co-variance between post-instructional gains in proficiency, student Can-Do self-assessments, and the teacher Can-Do assessments indicated higher degrees of consistency between reading & listening proficiency and student confidence than with the teacher's assessments. Furthermore, the use of Can-Do statements opened up areas for diagnostic applications.

[ p. 18 ]

Employing qualitative data from surveys to inform development and use of standardized tests was the primary approach taken in two other presentations; one by Mark Chapman (Hokkaido University), and another by Chi-Min-Shih (Chin-yi Institute of Technology, Taiwan). Although both studies used small respondent samples, the results were interesting enough to warrant further research.
In his on-going research into the use of the TOEIC, Chapman presented the results of his survey of a number of EFL teachers and compared the responses to those of students at a Japanese company. Although only 11 teachers and 15 students were in his sample, Chapman's findings showed that use of TOEIC generated much stronger, though not more frequent, negative opinions among teachers than students. The student respondents did not report any misgivings about the use of the TOEIC despite its possible impact on their promotions. Both groups felt that the test motivated language learning, particularly in reading.
The study by Chi-Min-Shih was of a standardized proficiency test that is being used by several institutions in Taiwan for screening purposes; the GEPT. Using qualitative data from 29 respondents, she obtained some findings similar to those of Chapman. The majority of those who were polled accept the consequences of standardized high-stakes testing and welcomed the use of the GEPT, even going so far as to call for more test centers and wider recognition. However, some respondents also expressed a need for improvement in the way the listening and speaking sections of this test are administered, In particular, they wanted more authentic tasks in the speaking section of the GEPT. These and other recommendations indicated that test takers can view the test critically, and can offer valuable insight for test developers and administrators.
At this conference Kiyomi Yoshizawa of Kansai University and Soo im Lee of Ryukoku University also presented a post-hoc item analysis approach to construct validation. Investigating the difficulty of reading comprehension questions, they explored the item characteristics in 40 reading questions from two official TOEIC practice tests. These questions were administered to 381 university students majoring in a variety of subjects. Although it was not stated explicitly, the purpose of the study was not to validate the TOEIC test, but rather to inform in-house test development by a using the TOEIC as a model. In a testing culture where the option for pre-testing is minimal if not non-existent, such studies can supply practitioners with valuable insight into improving the fairness of tests that depend heavily on face validity. The researchers coded a variety of test-task characteristics related to the passages and question stems to investigate correlation with IRT-determined levels of item difficulty. The results indicated that the length of a passage and extent of its familiar vocabulary affected perceived item difficulty. By contrast, vocabulary overlap between passage and question stems did not.

[ p. 19 ]

Other presentations

There were many other presentations not covered here that were relevant to the theme of a common framework for language testing in Asia. For example, there were several studies presented on the use of 'Can-Do" statements. A couple of presentations related to use of on-line programs for assessing, teaching, or studying vocabulary with implications for individualized language teaching and assessment or autonomous vocabulary learning. In another study, a post-hoc analysis of a writing component in an in-house test concluded that practicality limitations could invalidate test-score interpretation. This report ends with a brief summary of the symposium held on the last day of the conference.


Near the end of the symposium, a panel of scholars from Japan, Korea, and China, gave overviews of the social, economic, and cultural setting for EFL education and language testing in their countries. As the messages passed on from panelist to panelist, no clear parameters for an overall Asian framework emerged other than the continued pursuit of validation studies to reform the content and use of widely-used tests developed within each country. The policies of reform in each country were moderated somewhat by market forces related to the use of external instruments such as the TOEFL, TOEIC, and IELTS. Continued dialogue among language testers across borders about differences and commonalities, challenges and solutions in testing could eventually lead to a common framework. It seemed to this reporter that the symposium echoed many of the insights expressed in Bernard Spolsky's plenary.

- reviewed by Jeff Hubbell


Council of Europe. (1996). The Common European Framework of Reference for Languages. Retrieved on November 20, 2006 from

NEWSLETTER: Topic Index Author IndexTitle IndexDate Index
TEVAL SIG: Main Page Background Links Network Join
last Main Page next

HTML:   /   PDF:

[ p. 20 ]