Curriculum Innovation, Testing and Evaluation: Proceedings of the 1st Annual JALT Pan-SIG Conference.
May 11-12, 2002. Kyoto, Japan: Kyoto Institute of Technology.

## Developing an EAP test for undergraduates at a national university in Malaysia: Meeting the challenges (Part 2 of 2)

by Mohd. Sallehhudin Abd Aziz (Universiti Kebangsaan Malaysia )

Correlating each sub-test with total scores

In addition to correlating the sub-tests with each other, the sub-tests were also correlated with the test total scores. The findings are illustrated in the table below.

Table 5: Correlations coefficients between sub-tests and overall test scores.
 Subtest Overall test scores Listening 0.79** Reading 0.53** Writing 0.54** Speaking 0.73**
** Correlation significant at 0.01 level.

Table 5 illustrates the correlations between the sub-tests and the overall test. It is apparent that the correlations were on the positive side. The correlations between the sub-tests and the total test scores were fairly high because the total score was more of a general measure of language ability rather than an individual score. The highest correlation (0.79) was between listening sub-test and the overall scores. Another high correlation was with the speaking sub-test (0.73) and the overall test. Although the correlations for reading and writing sub-tests were lower than listening and speaking, they were still fairly high. In a study by Alderson, Wall and Clapham (1986), the correlations between the sub-tests and the total scores were from 0.66 to 0.72. One aspect that can be deduced from this data is that there is a strong correlation between the sub-tests and the overall test scores. This positive correlation is a good indication the concurrent validity of the proposed test.

Correlating each sub-test with total scores minus self

Another procedure for ascertaining construct validity is to correlate sub-tests with total scores minus self (Alderson et al., 1995). This was also done in this study. Table 6 shows the correlations. These were expected to be lower since the overall test score was a measure of individual component score rather than general measure of language ability. It is obvious that the highest correlation is with listening sub-test (0.49). A 0.43 correlation was found with speaking sub-test (0.43) and the total score minus self. This suggests that these two sub-tests have a reasonably strong effect on the overall total scores.

Table 6: Correlations between sub-tests and overall test scores (minus self).
 Subtest Total minus self Listening 0.49** Reading 0.28* Writing 0.23* Speaking 0.43**
* Correlation significant at the 0.05 level
** Correlation significant at 0.01 level.

[ p. 142 ]

In short, the data gathered from the statistical analyses has provided ample evidence of the construct validity of this test. Now let us examine this test from a qualitative perspective.

Qualitative analysis

In this study, construct valididation was also made by a priori non-statistical methods. According to Bachman and Palmer (1996, p. 96) in determining the construct validity of a test whose purpose is to make inferences about students' future performance, "it is essential to engage the expertise of a subject matter specialist in the design and the development of the language test". Taking this view in mind, six specialists were asked to judge the construct of the test. The specialists were made known explicitly the construct of the proposed test and asked to make judgments after an inspection of the test with regard to its construct validity. This concerns more of a priori validation of the test.
This non-statistical approach of judgmental analysis has been utilized by researchers such as Popham (1978, 1981), Hudson and Lynch (1984a), Davidson et al. (1985), Wong (1992) and Alderson et al. (1995). Although this procedure of investigating construct validity is subjective in nature, it is now widely accepted. Ebel (1979, p. 304) argues, "one should never apologise for having to exercise judgment in validating a test. Data never substitute for good judgment".
This pragmatic approach to construct validation was strongly adopted in the study. The main reason why this qualitative judgment was utilized aside from the quantitative data is that the test is basically a performance test, which is better suited with this kind of analysis. In the study, the non-statistical aspect of validation was ensured by making sure that the test specifications were constructed based on the candidates future language use in a way that closely adhered to it.
How the the experts rate this test? A great majority (83%) believed that the language ability construct of the test was 'Very clearly defined'. Only 17% maintained that it was 'Moderately defined'. An average of 67% believed that all the sub-tests of listening, reading, writing and speaking clearly reflected the target language use setting. 33% of the experts also felt that the test task was 'Moderately reflective' of the target language use situation. None of the respondents chose the 'Somewhat reflective' or 'Not reflective' categories. Though only six experts were involved in this evaluation, the majority did feel the test task clearly reflected the target language use situation. The evidence gathered so far shows supports the construct validity of this test.

[ p. 143 ]

Concurrent validity

This type of validity involves collecting other external measures at the same time as the the experimental test, or possibly at a later time. The first method used in the study to measure concurrent validity was by correlating the scores of the incoming students in the proposed test with other tests, namely the SPM English 1322 and the SPM English 1119 (both widely used tests in Malaysia) that the subjects had taken prior to entering the university.
The second way of determining criterion related validity was to gather the opinions of those in regular contact with law students. In this study, the most appropriate persons to do that were the law and language instructors. These persons were asked to rate the students' language ability and find out whether the tests scores matched the teacher's measures of the students' ability. In short, to determine the concurrent validity of the instrument, a co-relational study between the students' tests scores in the proposed instrument with that of other measures was taken concurrently.
The third method used to complement the first set of data gathered for the concurrent validity of this study was through self-assessment. The students were asked to rate themselves in terms of their language ability and assess their general language proficiency. Their tests scores were later matched with their own self-assessments. This method of determining concurrent validity was also used by Wall et al. (1991). They have shown that there was a reasonably close relationship between the self-assessments and the NCE scores. Criper and Davies (1988, p. 52) found a 0.39 correlation between students' self-assessments and test scores. Wall et al. (1994) on the other hand managed to get correlations of 0.30, 0.41 and 0.51 for writing, reading and listening respectively.
Specifically the external criteria used for establishing concurrent validity in this study were:
1. The respondents' SPM English 1322 test scores;
2. The respondents' SPM English 1119 test results;
3. Language teachers' judgments of students' general language proficiency;
4. Subject specialists' judgments of students' general language proficiency; and
5. Students' judgments of their own general language proficiency.
Correlations with the SPM 1322 English language results

The correlation figures between this test (the overall test scores and the sub-tests scores) and the SPM English 1322 test are given in Table 7. The figures show that the correlation between the test and the SPM English test was quite high (0.62). This is not surprising because both tests have the same purpose — assessing students' language ability — and they assess all the language skills although the skills are not explicitly demarcated in the SPM English 1322 test. In addition, they are also integratively tested in the exam.

Table 7: Correlations between overall test and sub-test scores with SPM English language test results.
 SPM 1322 English Overall test 0.62 ** Listening sub-test 0.39 ** Reading sub-test 0.23 Writing sub-test 0.31 ** Speaking sub-test 0.66 **
** Correlation significant at 0.01 level.

[ p. 144 ]

In validating the Bogazici English Language Test, Hughes (1988) correlated the total scores of that test with those of the Michigan Test of English Language Proficiency, obtaining correlation coefficients between 0.70 to 0.84. These correlations were quite high considering that the purpose of the two tests were similar, but methods of testing were quite different. Even though the overall test correlated relatively high with the SPM 1332 English test, this was not the case with the sub-tests. The low correlations, however, might be explained by the different aspects of language being tested. The correlation between writing sub-test and SPM English 1332 test was just 0.31 – a figure which does not differ much from Wong's (1992) 0.27 correlation between a writing test and the SPM English 1332 test. To recap, the overall test and the SPM 1322 English were found to be somewhat highly correlated (0.62). This statistical evidence gathered clearly provides some support for the concurrent validity of this test.

Correlations with SPM English 1119 test

Table 8: Correlations between overall and sub-test scores SPM English 1119 test results.
 1119 English Overall test .42** Listening test .08 Reading test .13 Writing test .23 Speaking test .60 **
** Correlation significant at 0.01 level. (n = 27)

As presented in the table above, the correlation between the test scores and the SPM English 1119 test scores are significant at the 0.01 level. Nonetheless, it is a comparatively low correlation of 0.42. This figure is somewhat below the correlation coefficient for the SPM 1332. A likely reason for this difference is the slightly different aspects of performance were measured. The SPM English 1119 test is a more literature-based reading and writing test designed for students planning to study in the United Kingdom. On the other hand, the SPM English 1332 test is considered a more general test for a wider group of candidates and based on the national Malaysian KBSM syllabus.

Correlations with language instructors' judgments

The language instructors assessed the candidates' general proficiency and their ability in listening, reading, writing, and speaking on a five-point scale. The scale was designed to get the opinions on the students' overall language ability as well as their individual listening reading, writing and speaking skills. Table 9 represents the correlations between the students' test scores and the assessments of language instructors.

[ p. 146 ]

Table 9: Correlations between language instructors' judgments and students' total and sub-test test scores.
 Language instructor's assessments of students' general proficiency Overall score 0.75 ** Listening sub-test 0.75 ** Reading sub-test 0.43 ** Writing sub-test 0.26 Speaking sub-test 0.49 **
** Correlation significant at 0.01 level

The results indicate that there was a close relationship (0.75) between the language instructors' global assessments of the candidates' language ability and the students' total scores on the test. The correlation between the test and the instructor's judgments was higher than the correlations between the test and the SPM English language results. This is not unusual considering that the language instructors were closer to the students. Since they they knew their students well and saw them frequently, they could judge their students' English language proficiency with more accuracy.
The correlations between the instructors' judgments and the reading, writing and speaking sub-tests were rather low. Table 9 indicates these correlations range from 0.26 to 0.49. Criper and Davies (1986, p. 63) in the Edinburgh ELTS Validation Project also had low figures for language tutors' judgment based skills and overall test (0.35 -0.43). Unexpectedly, language instructors' judgments fit in more closely to the listening sub-test with a correlation of 0.75. However, there may be an explanation for this. It has to be pointed out that the language instructors could not directly assess the students' listening skill as compared to reading, writing and speaking. As such, they might have difficulties in assessing the students' ability in that skill area.
In summary, the correlation between the overall test scores and the language teachers' judgments of the students' general proficiency is strong. In fact, it is higher than the correlations with the SPM 1119 and 1332 English tests. This speaks well for the concurrent validity of the test.

Correlations with subject specialists' judgments

*NUMBER* specialists were asked, in the second half of the semester, about their judgments of the students' overall language proficiency and their individual skills. The judgments were important as they could indicate whether the test was measuring the students' language ability to cope with the academic requirements of the course. Table 10 shows the correlations between subject specialists' judgments and the students' overall scores on the test.

[ p. 145 ]

Table 10: Correlations between language instructors' judgments and students' total and sub-test test scores.
 Content specialists' assessments of students' general proficiency The test overall 0.61** Listening sub-test 0.52 ** Reading sub-test 0.21 Writing sub-test 0.19 Speaking sub-test 0.60**
** Correlation significant at 0.01 level

As expected, the figures show a reasonably high correlation (0.61) between the test as a whole and the global assessments of the students' general proficiency by the subject specialists. Even though the correlations were somewhat high, they were consistently lower compared to the assessments given by the language instructors. Weir (1988, p. 68) in validating TEEP also reported similar findings, "it does seem likely that the English language teachers cooperating in the concurrent validity study were . . . better able to assess a student's language proficiency than subject tutors".

Correlations with students' self-assessments

Table 11 shows the correlations between the students' overall and sub-test scores and their assessments of their own general English proficiency.

Table 11: Correlations between students' self-assessments and their total scores on the test.
 Student self-assessments of general proficiency The test overall 0.68** Listening 0.64** Reading 0.37** Writing 0.23 Speaking 0.48**
** Correlation significant at 0.01 level

It is apparent that the correlation between the overall test scores and the students' own assessments of their own general proficiency is higher than the sub-tests. The correlation between the test's overall scores and the students' assessments of their general proficiency was 0.68 and was not unexpected. However, unexceptional levels were found in the relationship between the sub-tests and the students' assessments of the global proficiency. It can be seen from the table that the correlations were between 0.23 and 0.48. Nevertheless, the greatest congruence was in the area of listening sub-test (0.64). One factor influencing the magnitude of the correlation coefficient was the inability of the students to evaluate their listening skills accurately.

[ p. 146 ]

Based on the quantitative and the qualitative data gathered, there was strong evidence of the concurrent validity of this test.

Predictive validity

This involves gathering of subsequent external measures that could indicate how well the test predicts the intended performance. According to Hughes (1989, p. 25), predictive validity "concerns the degree to which a test can predict candidates' future performance". It is normally a posteriori investigation. Traditionally, testing developers have asked language students, teachers, and subject specialists to make global assessments of the students' abilities (Davies, 1965; Moller, 1982; Wong, 1992).
The external measures used in the study were:
1. End-of-semester VB English for Law scores
• via the Pearson Product Moment formula
• via linear regression
2. Language instructors' judgments of the students' of academic performance
3. Content specialists' judgments of thestudents' academic performance
4. Students' own judgments of their academic ability

Correlations with VB English for law scores

Table 12 summarizes the correlations between the test overall and sub-test scores and the English for Law results. This had the highest correlation (0.88) compared to all other external criteria discussed earlier in determining concurrent validity.

Table 12: Correlations between the overall test score and sub-test scores and English For Law test scores.
 English for Law Test scores Overall test scores 0.88** Listening sub-test 0.73** Reading sub-test 0.39** Writing sub-test 0.53** Speaking sub-test 0.62**
** Correlation significant at 0.01 level

The strong figures between the two tests indicate that they are measuring the same construct. Both tests attempt to measure students' language ability.

[ p. 148 ]

Table 13: Regression analyses between the total and sub-test scores and the English for Law Test results.
 R R2 Adjusted R2 SEE Overall test .885 .784 .781 8.2011 Listening sub-test .733 .537 .532 11.9917 Reading sub-test .397 .157 .147 16.186 Writing sub-test .868 .753 .747 8.8084 Speaking sub-test .894 .800 .793 7.9835
NOTE: The dependent variable is the English for Law Test scores (n = 85)

The data from regression analysis suggest that the listening writing and speaking sub-tests are good predictors of the students' ability.

Correlations with language instructors' assessments of students' academic performance

Table 14: Correlations between overall scores and sub-test scores and language Instructors' assessments of students' academic performance.
 Language instructors assessments of students' academic potential Overall test scores 0.80** Listening sub-test 0.76** Reading sub-test 0.47** Writing sub-test 0.39** Speaking sub-test 0.48**
** Correlation significant at 0.01 level

As indicated above, the correlations between the overall test scores and the language instructors' assessments of students' academic potential was 0.80. The strong correlations between these two measures provided some evidence of the reasonable predictive validity of the test. As expected, the sub-tests generally did not correlate closely with the language instructors' estimates.

Correlations with subject specialists' assessments of students' academic performance

As shown in Table 15, the correlation between the test and the subject specialists' estimates is also significant at the 0.01 level. Nonetheless, it must be pointed out that the correlation was comparatively moderate at 0.44. This is a far contrast from the correlation achieved from the language instructors' estimates. One possible explanation for the low correlation could be due to the lenient and inaccurate assessments given by the subject specialists. The subject specialists were perhaps too conservative and did not want to rank the students very low or fail them.

[ p. 149 ]

Table 15. Correlations between overall and sub-test scores and subject specialists' assessments of students' academic performance.
 Subject specialists' assessments of students' academic performance Overall test scores 0.44** Listening sub-test 0.39** Reading sub-test 0.29** Writing sub-test 0.06 Speaking sub-test 0.37
** Correlation significant at 0.01 level

Correlations with students' self-assessments of academic potential

As presented in Table 16, the correlation between the overall test scores and students' own assessment is higher than the ones given by the subject specialists — 0.51 compared to 0.44. One interesting finding is that the test scores were more closely related to the language instructors' estimates of academic potential than the subject specialists and the students' themselves.

Table 16: Correlations between the overall scores and sub-test scores of students' own assessments of their academic potential.

 Student self-assessments of academic potential Overall test scores 0.51** Listening sub-test 0.43** Reading sub-test 0.22* Writing sub-test 0.24* Speaking sub-test 0.39*
* Correlation significant at 0.05 level
** Correlation significant at 0.01 level

One would have expected that the correlation would be higher for the subject specialists' estimates since they could confidently estimate whether a student could pass or fail a course since they were aware of the standards and the expectations of a particular course they were teaching in the university. As mentioned earlier, one reason for the moderate correlation was seemingly lenient assessments given by the specialists.
As expected, the results of the correlations between the sub-tests and the students' estimates of their academic potential did not show any surprises. The correlations were all generally low. The lowest correlation was with the reading sub-test (0.22).
All in all, the figures established that the test correlates moderately with the estimates given by the language specialists, the subject specialists and the students. There is a strong correlation (0.80) between the overall test scores and the language instructors' assessments of the students' academic potential. The correlation against subject specialists' estimates is also significant albeit slightly lower. The data gathered have indicated strong evidence of the predictive validity of the test.

[ p. 150 ]

Reliability

Reliability is another important characteristic of a good test. Weir (1990, p. 32) suggests that there are basically three procedures that can be used to obtain measures to estimate reliability. They are i) test re-test reliability, ii) inter-rater reliability and iii) internal consistency reliability measures. With regard to reliability, this study focused only on inter-marker reliability. Inter-rater reliability was established by correlating the scores obtained by candidates from the markers. The reliability for the raters was calculated using the Pearson product moment correlation formula. The Table 17 illustrates the validity/reliability and procedures used to validate the proposed test.

Overall inter-rater reliability

Table 17 shows very strong inter-rater reliability coefficients. All the raters were almost equally consistent with each other.

Table 17: Inter-rater reliability coefficient of total test scores.
 Rater A Rater B Rater C Rater D Rater A – 0.936** 0.966** 0.953** Rater B 0.936** – 0.934** 0.879** Rater C 0.966** 0.934** – 0.953** Rater D 0.953** 0.879** 0.953** –
** Correlation significant at 0.01 level

In short, the inter-rater correlations for this test was very high. The high correlations coefficients are good indictors of the reliability of the test.

Conclusion
 "This study is probably one of the first attempts at designing, constructing and validating an English language test for incoming law students."

This study is probably one of the first attempts at designing, constructing and validating an English language test for incoming law students. Even though it has some limitations, the study nonetheless has provided strong and solid empirical evidence about the performance of the test. In short, the test has shown that it is a good English language test for academic purposes. The satisfactory reliability and validity evidence gathered have proven that this is a reasonably good test and that it can clearly stand on its own. This test clearly could be used as a model test or as an instrument for deciding whether incoming law undergraduates have the necessary language skills to pursue studies in law at UKM.

References

Alderson, J. C., Clapham, C. & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.

Alderson, J. C., Wall, D & Clapham, C. M. (1986). An evaluation of the National Certificate in English. Lancaster, England: Lancaster University Centre for Research in Language Education.

Anastasi, A. (1982). Psychological testing. London: Macmillan.

Bachman, L. F. & Palmer, A. S. (1981a). Basic concerns in test validation. In J.A.S. Read (Ed.), Directions in language testing, (pp.41-57). Anthology Series 9. Singapore: SEAMEO Regional Language Center.

[ p. 151 ]

Bachman, L.F. & Palmer, A.S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford: Oxford University Press.

Brown, A. (1991). The role of test-taker feedback in the validation of a language proficiency test. Unpublished M.A. thesis. University of Melbourne.

Castillo, E. S. (1990). Validation of the RELC test of proficiency in English for academic purposes. RELC Journal, 21, 2 (PAGES???).

Criper, C. & Davies, A. (1988). The ELTS Validation Project Report. ELTS Research Report, 1 (I) Cambridge: UCLES/ The British Council.

Castillo, E. S. (1990). Validation of the RELC Test of Proficiency in English for Academic Purposes. RELC Journal, 21, 2, (PAGES???)

Davidson, J., Hudson, T., & Lynch, B. (1985). Language testing: Operationalization in classroom measurement and L2 research. In M. Celce-Murcia (ed.). Beyond basics: Issues and research in TESOL (pp. 137-152). Massachusetts, USA: Newbury House Publishers, Inc.

Davies, A. (1965). Proficiency in English as a second language. Unpublished. Ph.D. Thesis. University of Birmingham.

Douglas, D. (2000). Assessing language for specific purposes. Cambridge: Cambridge University Press.

Ebel, R. L. & D. A. Frisbie. (1991). Essentials of educational measurement. Englewood Cliffs, NJ: Prentice Hall.

Fok, A.C.Y. (1981). Reliability of Student self-assessment. Paper presented at BAAL Seminar on Language Testing, Reading England.

Hudson, R. & B. Lynch. (1984). A criterion-referenced measurement approach to ESL achievement testing. Language Testing, 12, 171-201.

Hughes, A. (1988a). Achievement and proficiency: the missing link. In A. Hughes (Ed.), Testing English for university study, (pp. 36-42). Oxford: Modern English Publications and the British Council.

Hughes, A. (1988b). Introducing a needs based test of English Language Proficiency into an English medium university in Turkey. In A. Hughes (Ed.), Testing English for University Study, (pp. 134-153.) Oxford :Modern English Publications and the British Council .

Hughes, A. (1989). Testing for language teachers. Cambridge: Cambridge University Press.

McNamara, T. 2000). Language testing. Oxford : Oxford university Press.

Moller, A.D. (1982). A study in the validation of proficiency tests of English as a foreign language. unpublished PhD. Thesis. University of Edinburgh.

National Development Education Center of Thailand. (1986). Test development in ASEAN countries: A synthesis report of the test development sub-project. Bangkok: Thai Ministry of University Affairs.

Oskarsson, M. (1978). Approaches to self-assessment in foreign language learning. Oxford: Pergamon Press.

Popham, W.J. (1981). Modern educational measurement. Englewood Cliffs, N.J.: Prentice-Hall.

Popham, W.J. (1978). Criterion-referenced measurement. Englewood Cliffs, N.J.: Prentice-Hall.

Raatz, U. (1985). Better theory for better test? Language Testing, 2, 1, 61-75.

Weir, C.J. (1988a). The specification, realization and validation of an English language proficiency test. In A. Hughes (Ed.), Testing English for university study (pp. 46-110). Oxford: Modern English Publications and the British Council.

Weir, C. J. (1988b). Communicative language testing. Exeter linguistic studies, 11. Exeter: University of Exeter.

Weir, C. J. (1990). Communicative language testing. New York: Prentice Hall.

Wong, H. (1992). A test of writing for academic purposes for TESL undergraduates. Unpublished Ph.D. thesis. Pusat Bahasa, Universiti Kebangsaan Malaysia.

2002 Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index
Complete Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index

[ p. 152 ]