This is an account of test design and development about a test
for use in a single institution, Universiti Kebangsaan Malaysia in Malaysia. The main purpose of the test
was to provide a suitable way to assess the language proficiency of undergraduates
to establish whether they had sufficient English to undertake study in an academic
programme in which some subjects were taught in English. This paper discusses test
development activities, including a description of the purpose of this test, and a definition of
its construct. It also outlines the tools of inquiry adopted in overviewing the quantitative and
qualitative validation procedures used in this test.
Keywords: EAP test development, test validation, English for Academic Purposes, language proficiency testing.
The development of this test was prompted by three reasons. The first
was the perceived English language inadequacy among Universiti Kebangsaan
Malaysia (UKM) law students. Some students were having difficulty in following their
subject areas, despite having passed the English section of the Sijil Pelajaran Malaysia
(SPM) examination, which is equivalent to a O level paper in the UK.
The second reason for developing this test was that the existing SPM English Language Test seemed inadequate in
assessing the language ability of prospective university students who decided to study
locally, especially in the English extensive faculties such as the Faculty of Law. This is
due to the fact the test was an achievement test based on a particular syllabus. That
syllabus does not coincide with the actual needs of the students or prepare
them with the language needs at the university level.
". . . assessment in Malaysia . . . [focuses] on reliability rather than validity."
Thirdly, the positive developments in language testing (especially with the advent of
communicative language testing) do not seem to have had a profound effect on testing
in Malaysia. Language testing in the local setting is still somewhat traditional in nature.
The Malaysia Examinations Syndicate, the centralized examination body in charge of
designing and developing standardized tests nationwide, seems to rooted in a
traditional testing paradigm and the tests they develop do not appear to measure certain
features of genuine language use. They seem to have more faith in the outdated and
all-embracing models such as the Psychometric-Structuralist model. According to Test
Development in ASEAN Countries (1986), assessment in Malaysia is still
based on Classical Test Theory. Raatz (1985) indicates that the focus of that theory has
always been on reliability rather than validity.
Purpose of the study
The purpose of the study was to design, construct and validate an English language
test for incoming law students at the Faculty of Law of the Universiti Kebangsaan Malaysia.
The main aim of the test was to provide an accurate assessment to assess the language
ability of the students to determine whether they have attained an adequate ability to
undertake studies in the faculty. The specific aims of the study were twofold: (1)
to investigate the validity of the test in terms of its content validity,
construct validity, concurrent validity and predictive validity, and (2)
to investigate the reliability of the test in terms of its inter-rater reliability.
This test was developed according to procedures recommended by Carroll (1980), Carroll and Hall (1985),
and Weir (1990). These test development stages were adopted because they are currently accepted as
the "best practice" in test development and validation. However, certain specific steps unique to this project were also adopted
in developing this test. Table 1 represents the three main steps used in developing
the test for this study.
Table 1: Test development stages/activities adopted in the study.
Let's highlight each stage briefly.
Stage 1: Design
At this stage, the main thing that had to be done was to obtain information on the
purpose of the test and the prospective test takers. In this study, the test takers were the
incoming law students at the Faculty of law, Universiti Kebangsaan Malaysia (UKM).
The test purpose was to provide a proficiency scale of the ability of the students to cope
with studies in law in English. The TLU domain was the first year academic setting of the Faculty of Law, UKM.
In view of this, the first stage of test development, is the identification of language
needs. There are a number of ways in which these communicative needs could be
identified, but for the purpose of this investigation needs analysis was used. Bachman
and Palmer (1996) have recommended the use of needs analysis in identifying the
language tasks in the relevant domain. Needs analysis as a matter of fact is the essence
of the development of an EAP test. Robinson (1991, p. 7) maintains, "Needs analysis is
generally regarded as criterial to ESP".
The construct for the proposed test was based on a framework of language use
contexts of an academic setting at the Faculty of Law, Universiti Kebangsaan Malaysia.
This approach to test construct has been adopted by individuals such as Candlin, Burton
and Coleman (1980) in the design of tests to assess the proficiency of overseas-trained
dentists in Britain. It has also been used by Low and Lee (1985) in an attempt to investigate the relationship
between academic performance and second language problems as well as Hughes (1988b) in
assessing language proficiency for academic purposes in a university in Turkey. Testing
organizations such as Associated Examining Board (AEB) also used such an approach
to test development in tests such as the TEEP.
One fundamental reason why this construct of model of language ability was adopted
in the present study is that the proposed test was basically an English for Academic
Purposes (EAP) test. An EAP test essentially deals with a tightly-defined situation or
setting (Douglas 2000). Our test was a needs-related test meant for a specific group,
law students. Since the academic setting in the study was very specific, it made more
sense to use a well-defined framework. If the proposed test were meant for general
students from various faculties and disciplines, it would have been much better to use
a general theoretical framework such as Bachman's (1990) or Bachman and Palmer's
". . .the definition of language ability should be fluid because there are different purposes of using language tests."
Another reason for adopting this particular construct was based on the notion that
each linguistic situation is unique. Porter (1983, p. 192) argues, "There can't be single
test of communicative proficiency for all comers. We must test to an analysis of
the particular needs of a particular group". He expresses the view that the definition of
language ability should be fluid because there are different purposes of using language
tests. Porter (1991, p. 33) adds, " different needs of different learners may call for different
types of language ability" and furthermore, the notion of language proficiency is different in each case.
The third reason why this particular framework of language ability was adopted is
related to the question of whether the present theoretical models can be generalized
beyond the specific testing situation. The models have become rather general and abstract
and are increasingly elaborate and complex as our understanding of language ability
deepens. McNamara states -
. . . attempts to apply a complex framework
for modeling communicative language ability directly in test design have not always
proved easy, mainly because of the complexity of the framework. This has sometimes
resulted in a rather tokenistic acknowledgement of the framework and then disregard
for it at the stage of practical test design. (p. 20)
A fourth reason why a particular theoretical model was not entirely adopted in the
study is that there is still no consensus with regard to the definition of the constructs
to be measured. There is still no overall proficiency model that is universally accepted.
According to Lantolf and Frawley (1988, p. 186), "A review of the recent literature on
proficiency and communicative competence demonstrates quite clearly that there is
nothing even approaching a reasonable and unified theory of proficiency".
In short, it is clear that there is a strong and valid argument for the construct of the
proposed test to be context based. In the study, the identification of the contexts or the
tasks in the target language use domain was done through the use of needs analysis.
Methods of data collection
Generally, needs analysis involves systematic gathering of information about the
language needs of the learners. The needs analysis adopted for the study was based
on Brown's (1995) interpretation that the systematic collection and analysis of relevant
information is necessary to satisfy the language learning needs of the students within
the context of the particular institution.
Figure 1. Data-gathering methods for the proposed test.
The main purpose of the needs analysis was to gather and identify the linguistic
demands of the target language use situation of the first year law students in UKM.
The results of the analysis were then analyzed and used to draw up the proposed
test specifications. The data accumulated were grouped into modalities or skills. This
was done because of the vast information gathered and also in order to attain a more
manageable test design. Describing test specification is the last activity of the first stage
of test development. In the study, the tasks for the specifications were chosen based on
their importance as decided largely by the law students and subject informants via the
needs analysis. These specifications later served as the test 'blueprint'.
Test specifications for the proposed test
Based on the needs analysis (via questionnaire, interview, observation and document
analysis), it was decided that in general, reading was mainly for reading textbooks and
other written sources in law. Writing was more important for taking notes in lectures and
tutorials and for writing project papers. Speaking, as exemplified in the needs analysis was
mainly for face-to-face interview, asking and answering questions, and for presentation
purposes. Listening, on the other hand, was primarily for listening at lectures and tutorials.
In short, the broad aims of the test were:
Table 2: Table of test objectives.
||To assess candidates' ability to listen and understand law lectures and tutorials
||To assess candidates' ability to read and understand law textbooks and other written sources
||To assess candidates' written English for academic writing tasks in Law
||To assess candidates' ability to speak English to take part in academic tasks in lectures and tutorials
The second stage of test development is the construction stage. The researcher
constructed the instrument based on the test specifications (see Appendix 1 for test
specifications). Clark (1975, p. 11) has always argued in favor of exact specification of
tasks in terms of language and content and confidently speaks of replicating reality in a
test's setting and operation. He adds, "A major requirement of direct proficiency tests is
that they must provide a very close facsimile or 'work sample' of the real-life language
situations in question, with respect to both the setting and operation of the tests and the
linguistic areas and content which they embody".
Being a direct performance-based test, the proposed test tasks reflected as closely
as possible the kinds of tasks needed in the target language use situation. For practical
purposes, the test was designed based on the modality approach comprising listening,
reading, writing and speaking subtests. However, it must be stressed that the tasks were
not specific to the skills. They were tested integratively. The subtests and the rationale
are described in detail below.
Moderation of the test by subject specialists
After the first draft of the test was generated, the researcher consulted four subject
specialists. These were the law lecturers whose opinions were sought to ascertain the
target language use situations. The specialists had between 5 to 8 years of experience
teaching first year law courses. All of them have at least a masters' degree in law and
three of them graduated from universities in the United Kingdom. One of the specialists
is a graduate of International Islamic University, Malaysia. Again, the test was given to
the specialists for moderation purposes. A validation check by specialists is an important
requirement for any test development. One of the main tasks the subject specialists
in this study was to ascertain whether the test content represented the kinds of tasks
that the first year students had to undertake. They also had to considerer the clarity of
instructions and whether the duration of time given the students to complete the test
was sufficient. Weir (1990) has recommended that a test undergo a validation check by
inviting professionals in the field (namely language and subject specialists) to comment
on the suitability of texts, format and items.
On whether the characteristics of the test tasks reflected that of the target language
use situations, the specialists held the view that the test focused on the important aspects
of the target settings. With regard to clarity of the instructions for the test, all experts
agreed that they were very clear. Nevertheless, the same vote of confidence could not
be given to the evaluation with regard to the duration given to the students to complete
the test particularly the writing section. Two of the specialists maintained that more time
was needed to complete the test.
The initial feedback from the specialists provided invaluable information to the
researcher pertaining to several aspects of validity [a priori validation of the test]. The
responses from the content specialists guided the researcher to review the test. This
resulted in some amendments to the test, especially with regard to the reading passages
and the time allowed for test tasks to be carried out.
First pilot test
After the moderation of the test by the specialists and some modifications made to
the instrument, a pilot test was conducted on 17 law students. Immediately after this, a
simple questionnaire was given to the test takers to assess the difficulty of the questions,
and see whether the allotment of time was adequate, and discern the appropriateness
of the passages and clarity of the instructions.
In the first piloting, 41.2% the students believed that the listening subtest was 'very
difficult' .The same percentage thought that the subtest was 'Moderately difficult' and
17.6% maintained it was 'Somewhat difficult'. Generally, the majority of the students felt
that the time allotment for the test was quite reasonable. The students indicated that the
instructions for other subtests were reasonably good. The clearest sets of instructions
appeared to be those for the reading subtest. A great majority of the pilot examinees
(82.3%) felt that instructions for the reading subtest were 'Highly clear'. 76.5% of the
respondents' maintained that the instructions for the writing and the speaking subtests
were 'Highly clear'.
The feedback gathered from the students resulted in some changes made to the test.
Based on their feedback, the instrument was once again revised and improved upon.
The next section discusses the activities in the third stage of the test development
and highlights the validation procedures of the proposed test.
Stage three of the test development began with a second piloting, which was a
full-scale application of the test. The subjects for the second piloting were 85 first year
students from the Faculty of Law, UKM. They represented 90% of the first year students.
The test was also given to four subject and two language specialists to be evaluated for
validation purposes (The subject specialists were law lecturers from the Faculty of Law,
UKM and the language specialists consulted were the English language teachers who
have at least ten years of experience teaching English to the first year law students in
UKM. They were also once coordinators for the English For Law courses for the Faculty of
Law). A proper a posteriori validation of the test, establishing the test measurement
characteristics, was also done at this stage.
Establishment of the test measurement characteristics
Based on the students' performance in the test and the students and specialists'
responses to the questionnaire the measurement characteristics of the proposed test
were then established to ascertain validity and reliability.
To ascertain content validity Weir (1988, p. 26) recommends:
The content validation of this study was through the following means:
- Close scrutiny of the test items by experienced professionals; and
- The relating of the specification for the test to the final form of the test.
Let us discuss each of them briefly.
- Through the use of a systematic and empirical analysis of the target language use setting;
- Through the qualitative judgment by the experts on the content of the test; and
- Through the moderation of the test by the subject informants
Systematic analysis of the target language use situation
The language requirements of the first law students in UKM were systematically and
empirically identified and analyzed by the researcher. Establishing the test content or the
linguistic demands in the study involved not only the use of questionnaire and interview
but also document analysis and observation. At a priori stage of the test development,
questionnaires were given to those familiar with the target settings. Those asked to
provide the input included the subject informants and the students themselves. All the
law lecturers of the first year law program and 54 second year students were involved in the study.
Expert judgment on test tasks and test specifications by subject and language specialists
In addition to the systematic gathering of information pertaining to the students'
language tasks in the target setting, the second method adopted in ascertaining content
validity of the test is by getting the involvement of the subject specialists. Hudson and
Lynch (1984b, p. 182) have suggested that the judgments in deciding whether the
test covers the representative sample of the target language use situations are usually
obtained from experts in the field. "That is, the test is examined to discern whether or not
it includes all the sub-skills and elements of the domain and whether or not it is measuring
those sub-skills properly.
In view of this, the researcher consulted subject and language specialists to give
their opinions and to evaluate the content of the test vis-á-vis the test specifications.
These specialists would have to be satisfied to the extent that when they looked at the
test they would be satisfied that the test really measured what it said it was measuring.
Their judgments of test content were based on a comparison of the test tasks to the test
specifications. In evaluating the content of the test, the experts were given the complete
test specifications document that had been clearly and precisely stated of the purpose
of the test, the language skills and the areas to be tested.
Table 3: Experts' judgments on the extent to which the test tasks covered the areas
being measured as stipulated in the test specifications.
||To a great extent
||To some extent
||To a limited extent
||Not at all
||67% (n = 4)
||33% (n = 2)
||83% (n = 5)
||17% (n = 1)
||83% (n = 5)
||17% (n = 1)
||83% (n = 5)
||17% (n = 1)
Based on the table above, it can be deduced that a great majority of the specialists
agreed that the test tasks covered the areas being measured as stipulated in the test
specifications. A total of 83% of them agreed 'To a great extent' that the test tasks in
reading, writing and speaking subtests covered the areas being measured as specified
in the test specifications. Another 17% of the experts decided that reading, writing and
speaking subtests 'To some extent' covered the areas being measured as stipulated in
the test specifications. None of the respondents chose either 'To a limited extent' or 'Not
at all'. In short, the specialists claimed that the test task 'To a great extent' reflected the
test specifications. This view of the specialists augurs well for the content validity of the
test. In short, the responses from the specialists were highly positive. The majority of
them argued that the test content as highly reflective of the specifications.
Moderation of the test
In addition to the systematic needs analysis, evaluation of the test by subject and
language specialists to ascertain content validity, the subject specialists were also asked
to moderate the test at the early stage of test development. They were given the test to
moderate and to scrutinize it to make sure that the content of the test consisted tasks that
were of the right bits and pieces of the TLU situation. Their feedback was important in
improving the test. In the moderation process, the specialists' reactions to the test were
very positive. They generally thought that the test was a good test overall. They generally
believed that the test tasks covered the areas as stipulated in the test specifications. They
also believed that the test tasks reflected the target language use domain. As regards
to whether the characteristics of the test tasks reflected that of the target language use
situation, the specialists held the view that the test was generally a good test.
According to Davidson et al. (1985), construct validation usually involves:
- a clear statement of theory,
- an a priori prediction of how the test(s) should behave given the theory; and
- following administration of the test(s), a check of the fit of the test to the theory.
If all three of these facets work well, the test can be said to have construct validity.
It can be inferred from Davidson et al. (1985) that that are basically two views with
regard to establishing construct validity. The first view concerns with the external empirical
data in which construct validity is viewed from a purely statistical aspect. It is seen as a
matter of a posteriori statistical validation. The other view to construct validity gives more
attention to non-statistical aspects of construct validity (qualitative analysis). It focuses
more on a priori validation of the test. It sees the importance for construct validation at
a priori stage of test development.
". . . external empirical data . . . [is] necessary but not sufficient to establish construct validity."
The view adopted by the study was that the concern for external empirical data was
seen as necessary but not sufficient to establish construct validity. There was an equally
important need for construct validation at a priori stage. As such, a priori
and a posteriori construct validation should exist.
According to Weir (1990, p. 23), "To establish the construct validity of a test statistically,
it is necessary to show that it correlates highly with the indices of behavior that one might
theoretically expect it to correlate with and also that it does not correlate significantly with
variable that one would not expect it to correlate with". One of the quantitative methods
used in the study to ascertain construct validity was to correlate different subtest scores
with each other using the Pearson Product Moment Correlation formula. The main aim
was to find out whether different subtests really tested different skills. If the subtests did
not correlate highly then it showed that they were testing different skills. One reason for
having different test components is that they measure something different. The correlations
were expected to be fairly low possibly in the order of +. 3 - +. 5. In the study by Fok
(1981), the correlations between the students' self-assessments and test were found
to be at 0.3. High correlation between each subtest (+. 9) shows that the subtests are
testing the same facet. The second statistical procedure that was used in the study to
ascertain construct validity was to correlate each subtest with the overall test scores The
third quantitative method that was employed to measure construct validity was through
correlating each subtest with the total test scores minus self.
Correlating each subtest with other subtests
Table 4: Inter-correlation coefficients among the subtests.
** Correlation significant at 0.01 level.
The data from Table 4 indicates that reading writing subtest had
the lowest correlation (0.10) with . A correlation of 0.02 between speaking subtest
and reading subtest also suggested that these two tests operated differently. It is apparent
that the correlations between the subtests were very low. These low correlations clearly
indicate that the subtests were measuring different kinds of demands. The low correlations
as seen in Table 3 augurs well for construct validity of the proposed test.
Continue to Part 2