The Faculty of Letters at Keio University primarily aims to improve students' reading ability to further enhance their learning. For that purpose
a placement test is needed to accurately place the students into their appropriate proficiency levels to optimize their learning experiences, and to
provide multi-faceted English communicative instruction. The purpose of this placement test is to measure the incoming 2006 students' English reading
ability and English proficiency to provide streamed instruction.
The goals of this project are threefold:
- to offer EFL four levels of classes for students according to their English reading ability as ascertained by the method below,
- to offer classes for those who, according to the method below, need remedial instruction
- to offer classes for those who have already reached the required level that desire further study
Reading ability is thought to consist of grammar knowledge, vocabulary knowledge, long passage reading comprehension ability with full context (in other
words, the text material does not have any deleted words or blanks intended for other questions), and passage understanding ability without sufficient
context or information. In other words, a test of reading ability should be composed of grammar test, vocabulary test, reading comprehension test
and cloze test.
". . . a placement test must be specifically related to a given program."
Although commercialized tests such as the TOEFL-ITP, TOEIC-IP, G-TELP, Step-EIKEN and CASEC exist, it was agreed among the faculty members that the
content, level and purpose of those tests were not appropriate for placing students in the literature department. Furthermore, the admissions test
could not be used for any other purpose than the entrance examination selection. Admission tests are basically used for screening purposes only. Since
there has been a variety of admission tests conducted these days such as ( admissions office tests, interview tests, the center-test, the high-school
recommendation test), the so-called admission tests do not seem to be functioning well as placement purposes. In addition, people are so concerned about
the privacy security issue even on the test scores of the admission tests, it seems extremely difficult to use the admission test results for streaming
Tests can be valid or not depending on whether they agree with the purpose of the test users. The purpose of the aforementioned tests does not seem to
fit the purpose of the faculty of letters of Keio University. For example, we are not solely intent on measuring students who will study overseas, or
assessing the skills of students who will start business communication after graduation. Our purpose of this project is to encourage students to develop
their English reading ability, which is indispensable for their major area studies. Almost all the students in the faculty of letters are required to
read materials in English whether their major is English or not. For these reasons, we have decided to develop our own placement test.
Commenting on this point, Westrick suggests:
More studies on the use of commercially-produced tests and in-house tests for placement purposes at other Japanese colleges and universities are needed.
Creating an effective placement test involves developing test items related to a true curriculum with clear goals and objectives, piloting the tests
items, analyzing the data, and revising the tests to ensure that the scores are reliable and sound placement decisions can be made. This requires hard
work, but it must be done if fair and defensible placement decisions are to be made. (2005, p.90)
Furthermore, a number of other scholars take a similar stance about the placement test. Brown (1996, p. 12) says that a placement test must be specifically
related to a given program. Hughes (2003, p. 16) claims that placement tests should be developed by the users themselves so that they specifically meet
their needs. And, Fulcher argues by stating:
The goal of placement testing is to reduce to an absolute minimum the number of students who may face
problems or even fail their academic degrees because of poor language ability or study skills. (1997, p. 113)
Purpose of this study
The purpose of the present study is to examine the pilot version of a placement test and decide whether the real version of the test should have the same format.
McNamara (2000, p. 83) states, "There are three basic critical dimensions of tests – validity, reliability, and feasibility, whose demands need to be
balanced." McNamara (2000, pp. 50-51) also mentions three aspects that can threaten test validity: (1) test content, (2) test method and (3) test construct.
Taking these facets of a test into consideration, this study seeks to examine whether the pilot version of this particular placement test has enough
validity, reliability and practicality to merit further implementation. This overall question gives rise to the following hypotheses:
Hypothesis 1: The test does not have enough validity.
From a Rasch perspective, validity denotes the degree that observed research results fit a given model. The construct validity in the Rasch model is
investigated through the examination of five steps:
(1) chi-square examination,
(2) fitsresidual examination,
(3) location examination,
(4) item characteristic curve, and
(5) targeting information.
Among these, the item analysis using the item characteristic curve (ICC) is the main focus of this present research because this can make a great
contribution to a better improvement of the revised test. The ICC tells you how the item curve fits the model . In other words, it can give us a
piece of information of the construct validity. Also, it indicates whether the item discriminates the students well or not. Along with the ICC,
the information of distracters will be discussed as well.
"A [placement] test is said to have content validity if the questions reflect the course content or syllabus."
Also, the content validity of this test will be discussed in a non-statistical way. A test is said to have content validity if the questions reflect
the course content or syllabus. A test is said to have face validity if the test stakeholders think that the test is measuring what it should.
In the discussion of content validity, the test construct and the test method are additionally discussed. The test construct will be discussed in
terms of the construct of the difficulty order of the subsections. The test method discussion will focus on how the test was planned, administered
and scored. The face validity will be investigated through examinee questionnaire results.
Hypothesis 2: The test does not have acceptable reliability.
The reliability is investigated by the person separation index, which is equivalent to the cronbach alpha. A widely accepted benchmark for the person
separation index is 0.7 or more.
This pilot placement test was developed to examine these hypotheses in relation to the main research question.
809 first year university students in the Faculty of Letters of Keio University.
A 50-item multiple-choice placement test with four components was used in this study.
The test material contained 15 grammar MC questions, 10 vocabulary MC questions, three long reading passages with 5 MC questions each,
and 10 cloze MC questions. Applicants had 60 minutes to complete this test, which was scored by optical readers.
The reading section consisted of one beginning level, one intermediate level, and one advanced level passage about 400-500 words in
length. Difficulty was rated impressionistically by teachers in terms of content, topic, and vocabulary level.
Nakamura (1998, p. 260) proposed four points to consider in assessing reading ability: (1) the nature of reading, (2) the theoretical or linguistic
underpinnings of reading, (3) the test format of reading, 4) classroom teachers' ideas based on their teaching experiences. The construct of "reading
ability" for this test was established mainly from these plus the specific aspects of the faculty of letters as follows:
The grammar items were chosen by taking into consideration almost all of the grammar items that were supposed to have been mastered at the high
school level. There are textbooks authorized by the Ministry of Education and are available at bookstores. Since we did not pretest items in
order to determine their difficulty empirically, we relied on theory to create items and sections at different ability levels. For example, the
vocabulary items were based on word frequency counts using the benchmark of English Japanese dictionaries are available at bookstores, the grammar items
were based on developmental sequences and on the written structures on textbook analysis. The textbooks authorized by the Ministry of Education available
- the teachers' teaching experience with the reading sections of other existing tests linguistic theories (Alderson, 2000; Grabe, 2000; Hughes, 2003)
- the needs of the Mita campus where students are required to read the major books and references for their study areas.
In other words, the required reading ability at the Mita campus.
- the text books that are actually used in students' study areas.
The reading passages were selected from the three disciplines (humanities, social sciences and natural sciences), and appropriate vocabulary levels
were taken into consideration. The text passages were analyzed using L1 Flesch Reading Ease (Readability Formula) together with the judgments of experienced teachers.
The test data was analyzed using the RUMM statistical program 2020. The Chi-Square was investigated to determine if there was a huge gap among neighboring
scores. The benchmark for the acceptable range for the FitResiduals scores was between -3 and +3. The location order was examined to obtain the construct
of the item difficulty order. The item characteristic curves were examined to check the discriminating power of each item. The benchmark for the person
separation index of the test reliability was set at 0.7 or over.
Results and Discussion
In the explanation below four types of abbreviations will be used: G stands for grammar, V stands for vocabulary, R for teading and C for cloze.
In order to check if the response fits the model , Chi-square is used. The column Chi-square means the smaller the better, while the column probability
(to show the magnitude) means the bigger the better. Also, we examine if there is a big gap between the neighboring items in the Chi-square order.
Table 1. Chi-square order of the 2006 pilot placement test items
This table shows that three items (C47, R29, and G11) need to be examined because there was a gap of 8 Chi-square points or scores from the neighboring
items in the column of Chi-square.
FitResidual is used to check if the item has a discriminating power or not. The acceptable range is usually from -3 to +3. The negative residual
means overdiscriminating (overfitting), while the positive residual means underdiscriminating (underfitting).
Table 2. FitResidual order of the 2006 pilot placement test items
According to the benchmark of the acceptable range (-3 to 3), among the three items pointed out in the Chi-square investigation, R29 is regarded to
be overfitting (overdiscriminating) and G11 is considered underfitting (underdiscriminating). Based on this Chi-square and FitResidual information,
three items (R29 and G11 and C47) appear to be problematic. They need to be investigated further in terms of location order.
The Fitresidual indicates the discriminating power (-3 to +3) and the negative means over discriminating while the positive means underdiscriminating.
On the other hand, the Location order is an indication of the difficulty item of the items. The usual range is (-3 to +3).
Table 3. Location order of the pilot placement test items
Seq Item Type Location SE Residual DF ChiSq DF Prob
28 R28 MC -3.234 0.253 -0.868 781.00 10.653 9 0.300268
10 G10 MC -2.098 0.153 -1.534 781.00 17.766 9 0.037983
14 G14 MC -1.953 0.145 -1.503 781.00 15.122 9 0.087630
29 R29 MC -1.702 0.131 -3.077 781.97 45.955 9 0.000001
27 R27 MC -1.606 0.127 -1.153 781.97 8.169 9 0.517196
13 G13 MC -1.311 0.114 -0.772 780.02 37.366 9 0.000023
34 R34 MC -1.306 0.114 -1.958 781.00 19.300 9 0.022759
39 R39 MC -1.207 0.110 -1.976 778.06 23.730 9 0.004751
2 G02 MC -1.191 0.110 -0.572 781.97 4.759 9 0.854781
26 R26 MC -1.169 0.109 -0.631 781.97 13.323 9 0.148513
46 C46 MC -1.016 0.104 -2.359 775.12 23.194 9 0.005777
7 G07 MC -0.997 0.103 -2.294 781.00 27.909 9 0.000988
40 R40 MC -0.820 0.098 -0.874 778.06 8.037 9 0.530397
1 G01 MC -0.680 0.094 -1.467 781.97 15.424 9 0.079919
45 C45 MC -0.526 0.091 -0.768 777.08 14.598 9 0.102588
8 G08 MC -0.448 0.089 0.661 781.00 6.283 9 0.711335
24 V24 MC -0.393 0.088 2.409 781.97 35.324 9 0.000052
3 G03 MC -0.227 0.085 -3.647 780.02 32.560 9 0.000159
37 R37 MC -0.103 0.083 -1.914 779.04 18.754 9 0.027367
6 G06 MC -0.064 0.082 -2.258 781.00 18.563 9 0.029180
36 R36 MC -0.034 0.082 0.430 774.14 12.346 9 0.194485
15 G15 MC -0.024 0.081 0.012 780.02 14.703 9 0.099426
12 G12 MC 0.057 0.080 1.072 780.02 13.304 9 0.149347
50 C50 MC 0.103 0.081 1.497 764.36 29.045 9 0.000637
48 C48 MC 0.172 0.080 0.518 771.21 9.040 9 0.433593
38 R38 MC 0.209 0.079 0.215 779.04 7.787 9 0.555765
31 R31 MC 0.218 0.079 1.230 780.02 8.772 9 0.458594
23 V23 MC 0.284 0.078 0.658 781.00 6.175 9 0.722263
5 G05 MC 0.341 0.077 0.211 779.04 11.815 9 0.223935
33 R33 MC 0.347 0.077 -0.186 781.97 9.051 9 0.432568
18 V18 MC 0.405 0.077 1.989 778.06 9.579 9 0.385588
41 C41 MC 0.484 0.076 0.962 780.02 13.116 9 0.157419
21 V21 MC 0.527 0.076 -1.753 781.00 17.097 9 0.047220
32 R32 MC 0.533 0.076 -0.613 781.97 10.796 9 0.289949
9 G09 MC 0.652 0.075 -0.882 780.02 15.579 9 0.076204
4 G04 MC 0.672 0.075 3.148 781.00 18.594 9 0.028875
19 V19 MC 0.723 0.075 4.998 780.02 14.843 9 0.095339
25 V25 MC 0.728 0.075 1.496 781.97 4.276 9 0.892325
22 V22 MC 0.748 0.075 0.978 776.10 12.365 9 0.193523
43 C43 MC 0.802 0.075 0.976 780.02 7.183 9 0.618101
42 C42 MC 0.902 0.075 1.848 778.06 13.526 9 0.140218
16 V16 MC 0.975 0.075 4.733 780.02 21.973 9 0.008967
17 V17 MC 0.978 0.075 2.545 780.02 15.440 9 0.079538
44 C44 MC 1.116 0.076 3.518 771.21 19.936 9 0.018314
11 G11 MC 1.255 0.076 5.942 780.02 61.082 9 0.000000
30 R30 MC 1.353 0.076 0.042 781.97 6.548 9 0.684044
20 V20 MC 1.483 0.077 -0.294 781.00 8.577 9 0.477171
35 R35 MC 1.499 0.077 -0.482 780.02 14.418 9 0.108223
49 C49 MC 1.975 0.084 2.342 764.36 30.129 9 0.000417
47 C47 MC 2.570 0.095 2.698 774.14 45.466 9 0.000001
R 29 is the closest to the easiest item (the fourth easiest) in the order, while G11 is the sixth most difficult one.
C47, which was pointed out as problematic in terms of its Chi-Square order, is the most difficult one. Also, the location order
shows that Reading items and Grammar items tend to be placed on the easier side of the continuum while Vocabulary items and Cloze
items are relatively difficult. In other words, the location order shows the construct of item difficulty order.
Item characteristic curve (ICC)
ICC is used to show in detail the degree of agreement between the observed proportions and the theoretical curve.
Figure 1. An examination of item G11 and its item characteristic curve (ICC) in the 2006 pilot placement test
This ICC of G11 shows us that the less able students performed better than anticipated. It also indicates that the more able students performed
more poorly than anticipated. This further shows that this item did not discriminate well between lower level students and intermediate level
students. A likely reason is that this item was a little too difficult (1.255 logits above the mean). This item probably would function better
to differentiate the more able students at the top end. From the lower end to the mid group, there is no discrimination, even negative discrimination.
From the mid to the top it has some discriminating power, but still not as much as anticipated. In short this test did not yield three separate groups
as clearly as wished.
Figure 2. An examination of item R29 and its item characteristic curve (ICC) in the 2006 pilot placement test
This ICC of R29 shows that the item is problematic because it is overdiscriminating. We can tell the difference between the lower and the intermediate
level students. However, it does not discriminate among the top level students. The top level students seem to have some advantage or bias about the
topic. The less able students do not fit the model. The lower end is over discriminating, and the lower group is performing more poorly than anticipated.
Figure 3. An examination of item C47 and its item characteristic curve (ICC) in the pilot placement test
This ICC of C47 indicates that the item has no discriminating power. All the groups get the item correct under the guessing level. This is probably
the reason it was pointed out as problematic in terms of its Chi-Square order.
So far, only three items have been pointed out as problematic. And when we think about the percentage of these three problematic items, they are just
three out of 50, or 6% of the whole. However, it may seem too quick to conclude that this figure is relatively slight.
Distracter curve information
Now let's examine Figure 4, which demonstrates the distracter information curve. The distracter information curve measures indicate how the distracters
(option answers) are functioning to be attractive to the test takers.
Figure 4. A description of a key answer and the three distracters of item G11 in the 2006 pilot placement test
Figure 4 looks strange because up to 1.2, the key and the distracters functioned in a confusing way. After 1.2 ability level, the key answer
functioned properly. Between 0.7 and 1.0, the students preferred Option 2 to the key answer. The lower and upper ability students in this case,
got that item correct.
Let us take a look at how the distracters are behaving in the following item.
Figure 5. A description of a key answer and the three distracters of item R29 in the 2006 pilot placement test
For R 29: the key answer functioned well - other distracters were less common than the key answer. This item appears to be a reasonably good distracter.
Let us look at how the distracters are functioning in the following item.
Figure 6. A description of a key answer and the three distracters of item C47 in the 2006 pilot placement test
Item C47 is strange and difficult. The three distracters and the key answer do not function at any level. Even the key is chosen under the guessing level.
It seems that there are two correct answers with Option 3 as the most popular. All the students misunderstand the concept. The key answer is not discriminating.
Information targeting shows the relative position between persons (person ability) and items (item difficulty). The following graph shows us that we need
some more difficult items to match the more able students in the future version.
Figure 7. Relative positions between persons (person ability) and items (item difficulty) in the 2006 placement test through an information targeting graph
This figure suggests that as a whole the test was very good at measuring students' English proficiency. For future improvement, more difficult items
are needed to match the more able students at the top of this continuum.
Examination of reliability
The reliability was investigated by the person separation index, which is akin to the Cronbach Alpha. The benchmark for the
acceptable boundary is over 0.7. The reliability of this placement test had a score of 0.78 in terms of the person separation
index. This suggests the items in this test were internally consistent.
Summary of the results and discussion
This study explored whether the pilot version of this placement test had enough validity, reliability to proceed to the real
test. Three null hypotheses pertaining to the primary research question were rejected.
The ability level of the top group was higher than the difficulty level of items in most of the cases. On the other hand, the
ability of the bottom group who need remedial instruction was below the difficulty level of reading section and the grammar
section in most of the cases.
". . . validity examinations should be conducted in detailed ways which include concurrent validity and/or factor analysis methods."
Hypothesis 1, "The test does not have enough validity," was not verified. The discovery of three problematic items was a minor defect from the
viewpoint of the whole test. In other words, 94% the test items fit the model, which technically verifies the construct validity of the test.
However, validity examinations should be conducted in detailed ways which include concurrent validity and/or factor
Hypothesis 2, "The test does not have acceptable reliability," was not verified. The reliability was investigated by the person separation index,
and had an acceptable boundary of over 0.7. Accordingly, the alternate hypothesis "The test is reliable" was accepted.
The 2007 version of the test should explore the issue of face validity and practicality more systematically. Since the 2006
test was investigated mainly in terms of reliability and construct validity, further research is needed to fully corroborate
Conclusions and implications
The Research Question for this study was partially supported with the examination of the three presuppositions.
Also, the information obtained from the person-item relative position helped us divide the students into appropriate groups.
However, there were not enough difficult questions at the high end of the spectrum to create three levels. That is not a
problem of test design, but rather item content. In future research this problem should be solved.
Considering McNamara's (2000, p. 83) statement "The right balance of three basic critical dimensions of tests – validity,
reliability and practicality – will depend on the test context and test purpose," the present placement test should be
regarded as acceptable judging from the statistical analyses and the test context as well as the test purpose. For
future improvement, the predictive validity should be investigated as well as the concurrent validity, the factor analysis
and the multi-trait multi-method (MTMM) analysis for the test validation. Future studies should also explore the issue of
face validity and practicality more systematically.
The present author is grateful to Dr. David Andrich and Dr. Irene Styles for their invaluable comments and professional help.
References and Bibliography
Alderson, J.C. (2000). Assessing reading. New York: Cambridge University Press.
Andrich, D., Sheridan, B. & Luo, G. (2004). RUMM 2020: Rasch Unidimensional Measurement Models[computer software].
Perth, Western Australia: RUMM Laboratory.
Bachman, L.F. (1999). Fundamental considerations in language testing. Oxford: Oxford University Press.
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language assessment. New Edition. New York: McGraw-Hill.
Fulcher, G. (1997). An English language placement test: issues in reliability and validity. Language Testing 14, 2, 113-138.
Grabe, W. (2000). Reading research and its implications for reading assessment.
In A. Kunnan (Ed.), Fairness and validation in language assessment (pp. 226 - 62). Cambridge: Cambridge University Press.
Hughes, A. ( 2003). Testing for Language Teachers. Cambridge: Cambridge University Press.
Linacre, M. (2004). WINSTEPS Rasch Measurement computer program (Version 3.51). Chicago: Winsteps.com.
McNamara, T. (2000). Language Testing. Oxford: Oxford University Press.
Nakamura, Y. (1998). Components of Reading Ability. Educational Studies, 40. 259-281. International Christian University.
Westrick, P. (2005). Score Reliability and Placement Testing. JALT Journal 27, 1, 71-92.