How can teachers assess the effectiveness of a language test in a specific context?
How can the performance of a language test be improved to more effectively achieve
specific program goals? Teachers rely on testing instruments as a matter of routine
classroom practise. However, there are inherent difficulties in the process of measuring
learning progress, which directly affect attempts to assess the effectiveness of language
testing instruments (McNamara, 1996, p. 2). Research is consequently required into
developing effective means for improving language test performance.
The concept of an evaluation cycle has previously been applied to developing language
learning tasks (Breen, 1989), and is usefully extended to apply to the development of
language tests. The language test evaluation cycle discussed in this paper is concerned
with developing testing instruments to better meet the specific needs of local learning
contexts. Essential components of Breen's model which can be directly adapted include
the investigation of test administrations in the classroom, and the measurement of learner
performance against specified test criteria (1989, p. 193). Test specifications play an
important role in this process since they "force explicitness about the design decisions
in the test and . . . allow new revisions to be written in the future" (McNamara, 2000, p.
31). The evaluation process hence first involves reviewing the relationship between test
specifications and the specific objectives of a language program.
Test specifications and program objectives
Test specifications, including formal statements of performance criteria, represent
decisions made during the test design process concerning how to effectively operationalize
theoretical constructs (Bachman & Palmer, 1996, p. 87). It is consequently important to
recognise the significant role that the specifications also play in the evaluation process,
as discussed by McNamara:
even though . . . tests in general performance assessment typically do not
make explicit reference to a theory of the underlying knowledge and ability
displayed in performance, a theoretical position is implicit in the criteria by
which raters are to make judgements. (1996, p. 19)
The specifications should not, however, be regarded as being finally determined in
the test construction stage. Rather, they should be reviewed (with other aspects of the
test) in terms of evaluative feedback on the usefulness of the testing instrument within
a specific context of use (Bachman and Palmer, 1996, p. 87). Significant improvements
can subsequently be made to tests "in the light of their performance and of research
and feedback" (Alderson, Clapham, and Wall, 1995, p. 218). A beneficial evaluation cycle
would consequently involve stages of analysing test performance in the learning context,
devising appropriate revisions, and evaluating the revised test (e.g., see Cohen, 1994,
p. 101-112). Few teachers are, however, able to undertake complex test evaluation
procedures during the course of their regular teaching programs.
In this paper, a communicative language test is evaluated in order to explore the
types of issues that may be encountered in the evaluation process. Test specifications
are reviewed against design principles and communicative language teaching goals.
Professional judgements are made concerning the value and purpose of various aspects
of the test, with a view to developing an improved testing instrument. The test is revised
in order to address problem areas in the test performance. It is hoped that teachers
can apply similar evaluation cycles to specific learning contexts in order to improve the
performance of testing instruments.
The curriculum framework
The test used in this study was developed in an English language teaching institution
in Australia and administered to adult migrant students. The institution is an authorized
provider of the Adult Migrant Education Program (AMEP), a comprehensive national
English language teaching program developed and administered by the Australian
government to provide English instruction to immigrants. Associated with the language
teaching program is a series of nationally recognised certificates (Certificate in Spoken
and Written English, or CSWE). Each level of the CSWE requires achievement of sets of
discrete competencies. As part of the national curriculum, detailed specifications (Adult
Migrant Education Service, 1995) are provided for each competency. The specifications are
divided into a number of content areas: Elements (essential linguistic features, knowledge
relevant to the content, context requirements), Performance Criteria (statements about
the learner's performance in the language interaction), Range Statements (conditions or
parameters to be associated with the assessment task), Evidence Guides (suggestions
for tasks which could be used to assess the competency), Benchmark Performances of
learners' assessments (accompanied by specific grading information at various levels),
and the Moderation process (assessors participate in moderation sessions for the purpose
of developing expertise in assessment determinations).
Assessment in the CSWE system is criterion-referenced, whereby "individual
performances are evaluated against a verbal description of a satisfactory performance
at a given level" (McNamara, 2000, p. 64). In contrast to a normative system in which
numerical scores are allocated to test results, students' work is instead measured against
the performance criteria provided for each competency. Students are assessed in terms
of whether or not their work demonstrates the performance criteria at an appropriate
standard. If their work successfully demonstrates all the performance criteria, they
achieve the competency and progress to studying another competency on the certificate.
Alternatively, they continue working to achieve the same competency in future classes.
The CSWE framework incorporates a number of design features aimed at establishing
validity and reliability in tests. Performance ratings are standardized based on samples
of performance benchmarks (provided as part of the curriculum framework) for each
competency, combined with mandatory teacher training in the moderation process (also
as part of the curriculum framework). The CSWE assessment procedure hence involves
comparison of student performances against performance benchmarks which illustrate
appropriate standards for the performance criteria.
The test evaluation process
Students on the AMEP are generally motivated to achieve CSWE certificates
for purposes of future employment and further study. A primary teaching objective is
consequently to assist students to gain their certificates by achieving the required sets
of competencies, and this objective also provides a meaningful purpose to the current
evaluation process. The performance of a test can be beneficially considered in terms
of results achieved in the classroom. Problem areas can be identified and modifications
subsequently developed which aim to improve future test results. The evaluation process
consequently requires deliberation in a number of important areas. Why didn't more
students perform better on certain performance criteria? To what extent was the test
appropriate for the learners and the language program? Were various aspects of the
test valuable towards achieving the program goals? Did the pre-teaching stage achieve
its designated purposes? How representative were the learners of a typical class in the
same course? The test evaluation process also typically considers a range of general
areas relating to task performance: level of difficulty, task clarity, timing, layout, degree
of authenticity, amount of information provided, and familiarity with the task format (Weir, 1993).
Description of the testing instrument
Teaching institutions selected to provide the AMEP generally develop testing
instruments in order to meet both the national curriculum framework and specific
institutional needs. The test selected for this study is an example of a communicative
writing test administered at the upper-intermediate level in a contemporary teaching
program. It was developed in-house at the Adelaide Institute of Technical and Further
Education for assessing Competency 14 of the CSWE III. Competency 14 requires
students to write a short formal letter of about 100 words in a one hour time period.
Students can use dictionaries and may draft and self-correct the letter as long as a
completed version is submitted at the end of the time period. The testing instrument
describes a situation in which the student had recently purchased a computer which
appeared to have a serious operational fault. Specific information is provided for the type
of computer, the company where the computer was purchased, and the technical fault.
Students are required to write a formal letter to the company explaining the situation
and requesting appropriate action and assistance. A copy of the sample test is provided
in Appendix 1.
Identifying problem areas in test performance
A class of eleven adult migrant students enrolled on the AMEP and working
towards achieving the level three certificate were selected for the study. The students
were first instructed concerning the performance criteria and completed some practice
tasks, according to standard teaching practice. The test was then administered, and
the students were assessed against the performance criteria and the benchmark
performances. Each student's work was assessed in terms of whether each criterion was
appropriately demonstrated. Students who demonstrated all the criteria were awarded
the competency. After completing the assessment process, a table was compiled which
listed each student's results (success or fail) against the performance criteria. Totals and
percentages were subsequently calculated which identified the proportion of the class
achieving each criterion. A summary table was produced (see Table 1) in order to provide
a quantitative basis for evaluating the test's performance.
Table 1: Summary of test performance criteria vs. class results.
|Performance Criteria (Adult Migrant Education Service, 1995)
||Class Results (N=11)
|follows conventions of layout for formal letter
|stages text appropriately -beginning, middle, and end
|writes paragraphs which clearly express objective information about situations / events
|provides information / supporting evidence to substantiate the claim
|makes a request for specific follow-up action
|uses appropriate conjunctive links e.g., causal, additive, temporal, conditional, as required
|uses appropriate vocabulary to reflect the topic
|uses appropriate politeness / level of formality
|uses grammatical structures appropriately
The summary served to identify which performance criteria caused students some
difficulty. Low percentages of students achieving a criterion were considered indicative of
potential problem areas in the test performance. Three criteria were demonstrated by the
entire class ("follows conventions of layout for formal letter", "stages text appropriately-
beginning, middle, and end", and "makes a request for specific follow-up action").Another
criterion was demonstrated by most students ("uses appropriate politeness / level of
formality"). Three criteria were demonstrated by many students ("writes paragraphs
which clearly express objective information about situations / events", "uses appropriate
conjunctive links e.g., causal, additive, temporal, conditional, as required", and "uses
grammatical structures appropriately"). Finally, two criteria were demonstrated by just a
few students ("uses appropriate vocabulary to reflect the topic", and "provides information
/ supporting evidence to substantiate the claim").
Investigating the problem areas
The students' tests were next reviewed in terms of the two performance criteria
which caused major difficulty. Instead of providing supporting evidence to substantiate
the repair claim, it was found that many students simply copied verbatim the description
of the computer fault from the test instructions. They also appeared to lack appropriate
vocabulary resources to be able to describe the situation in some detail. The subject
area of the test was viewed as being a prime cause in these two problem areas, since
most students lacked the technical expertise to discuss the computer problem sufficiently
to substantiate the repair claim.
A difficulty with the measurement process was also identified at this stage of the
evaluation. While the curriculum framework provides for standardized assessments
through a regulated training program, the requirement for simple ratings (success / fail)
of the competency statements was found to be problematic. Accurate determinations of
subtle differences between student performances appeared to require a very high level
of expertise, and it was not clear whether this had been adequately established by the
teacher training process. Furthermore, the significance attached to rating according to
just two categories did not appear to fairly represent the varied range of performance
standards demonstrated during the test. Marginally different performances could result in
significantly different assessment results. However, this problem area is associated with
criterion-referenced assessment in general, rather than with the current study.
Summary evaluation results
The test was practical and efficient in its initial administration. The test specifications,
including the performance criteria, were relevant in determining a useful communicative
writing task for the current group of learners. The test was, however, considered to be
limited in a number of areas that were targeted for subsequent improvement. A major
problem was evident in the presumption of subject knowledge and technical vocabulary
associated with using a computer, and students were mostly unable to achieve two
performance criteria on this account. The task was also made unclear by using an
ambiguous technical term ("backfiles") to describe the computer fault. Furthermore,
informal feedback received directly after the test administration indicated that some
learners disliked using computers (or were at least partially technophobic), and reacted
negatively to the task on this basis alone. Finally, the wording of the task was insufficiently
clear about the requirement to discuss the situation in some detail in order to substantiate
the repair claim. The validity and reliability of the testing instrument were negatively
affected by these problem areas.
Revising the test
The next step in the evaluation cycle is to make revisions in order to address the
problem areas. Firstly, the learning domain should be extended to include relevant
computer terminology prior to the test administration. A general level of familiarity could,
for example, be presumed if this test was sequenced after computer sessions (e.g., word
processing classes, CD-ROM classes) had also been introduced into the curriculum.
A revised testing instrument was also developed (seeAppendix 2). Since the students
were recent immigrants to Australia, and usage of personal information is also a content
area in the curriculum, it would be beneficial for students to use their own names and
addresses, and today's date, rather than using an anonymous third person's details. The
test becomes a more realistic writing task (and a more authentic communicative activity)
when personalised in this manner. Also, while students should discuss their feelings in
order to substantiate the repair claim, leading statements ("you are disappointed ...")
should not be used in the instructions. Rather, students should be required to describe
their own emotional response to the situation. The description of the computer fault
("backfiles") should also be revised, since this is unclear from a technical viewpoint. And
the task description should be reworded to explicitly state the requirement for discussing
the situation in some detail in order to substantiate the repair claim. Since one criterion
requires students to provide supporting evidence, this should be made clear in the task description.
Finally, the testing procedure would be improved by collecting feedback from students,
as discussed by Bachman and Palmer: "low-stakes tests can be improved by planning
to use them over an extended period of time and collecting feedback on usefulness
during each operational administration" (1996, p. 246). This point was particularly
evident from the significance of comments made by students directly after the initial test
administration. A survey should be developed for this purpose in order to complement
the current testing procedure.
Communicative language tests can be evaluated in terms of their performance within
specific learning contexts. The evaluation process involves analysing test results in light of
both test specifications and program objectives. The test should subsequently be revised
in order to address any problem areas. The effectiveness of the modifications should then
also be evaluated as part of a continuing test evaluation cycle. In this paper, a sample
language test has been evaluated and a range of modifications developed with a view to
improving the test's performance within a communicative teaching context in Australia.
A number of problem areas were identified during the evaluation process. In each case,
consideration of the value and purpose of various aspects of the test specifications and
program objectives was beneficial to devising the modifications. It is recommended that
language teachers should implement similar test evaluation cycles in order to improve
the performance of communicative testing instruments.
Adult Migrant Education Service. (1995). Certificate in spoken and written English III. Sydney: Author.
Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press.
Breen, M. (1989). The evaluation cycle for language learning tasks. In R. K. Johnson (Ed.), The second language curriculum (pp. 187-206). Cambridge: Cambridge University Press.
Cohen, A. D. (1994). Assessing language ability in the classroom. Boston: Heinle & Heinle.
McNamara, T. (1996). Measuring second language performance. Harlow: Addison Wesley Longman.
McNamara, T. (2000). Language testing. Oxford: Oxford University Press.
Weir, C. J. (1993). Understanding and developing language tests. London: Prentice Hall.