Lifelong Learning: Proceedings of the 4th Annual JALT Pan-SIG Conference.
May 14-15, 2005. Tokyo, Japan: Tokyo Keizai University.

How entrance examination scores can inform
more than just admission decisions

(Add Japanese Title Here)

by Christopher Weaver (Tokyo University of Agriculture and Technology)

ŠT—v


Abstract

The scores that test takers achieve on entrance examinations are primarily used for admission decisions. However, if these scores are viewed from a Rasch measurement perspective, they can provide examination developers, curriculum designers, and teachers with numerous insights concerning test takers' current level of knowledge as well as the actual performance of the examination. This paper provides a detailed account how Winsteps, a Rasch-based software package, can provide this rich source of information. It begins with a brief explanation of Rasch measurement theory and the use of a user-friendly scale to report test takers' results on an entrance examination. Then the paper examines how the cut-score for admissions provides an important reference point to analyze examination performance. The Rasch model also defines what the cut-score actually means in terms of test takers' knowledge of a given domain. This information is very valuable for curriculum designers and can provide teachers with a diagnostic tool to assess students' diverse strengths and weaknesses. The paper closes by stressing the importance of developing quality entrance examination items in order to ensure that the test takers' scores can inform more than admission decisions.

Keywords: Rasch theory, entrance exams, admissions decisions, test impact


[ p. 114 ]


"This paper thus proposes an integrative approach to entrance examinations that can help produce more informative test scores."

Entrance examinations play an influential role in students' lives in Japan. In many cases, these tests determine which academic institutions students attend. This gate-keeping function, however, is not the only role that entrance examinations can play. The scores that test takers receive on an entrance examination can also provide administrators, entrance examination developers, curriculum designers, and teachers with additional information which could be used to enhance instruction. This wealth of information depends upon two factors. The first factor concerns the entrance examination's construct validity. Items on the examination should assess the content knowledge that test takers are supposed to have mastered. In addition, the items should provide administrators and curriculum designers with an idea about the probable success of test takers if they are accepted to the institution. Predictive validity is thus an equally important concern for entrance examination designers (Brown, 1996). The second factor involves the use of the Rasch measurement model to determine test takers' level of knowledge and the items' level of difficulty on the entrance examination. This approach not only provides a cut-score for admission decisions, but it also defines test takers' level of knowledge in relation to the items on the entrance examination. Taken together, concerns about an entrance examination's content and predictive validity can be assessed with a Rasch analysis by closely examining the probabilistic relationship between test takers' abilities and the items on the entrance examination. This paper thus proposes an integrative approach to entrance examinations that can help produce more informative test scores.

The Rasch model

Using the test takers' responses on an entrance examination, a Rasch analysis calculates an estimate of person ability for each test taker and an estimate of item difficulty for each question on the examination. Person ability estimates show the test takers' level of content knowledge as defined by the items on the entrance examination; whereas, the item difficulty estimates indicate the level of difficulty that test takers had successfully answering the different items on the entrance examination. These two estimates are then placed on a common scale measured in logits. This unit of measurement is the natural log (e base) of the odds of a test taker successfully answering the different questions on the examination (Smith, 2000). For example, if the odds of answering a qiestion correctly are 2:1, then the natural log of 2 will equal .69 logits; the natural log of 1:1 odds will be 0 logits; and the natural log of 1:2 odds will be -.69 logits.
Since the person ability and item difficulty estimates are on a common scale, it is possible to determine the probability of a test taker correctly answering any item on the entrance examination. To better understand how this is possible, it is helpful to refer to a person-item map which graphically represents person ability estimates and item difficulty estimates. For the sake of illustration, let's imagine a test taker (T1) wrote an entrance examination that had six items (I1-I6). Figure 1 is the person-item map that shows the probabilistic relationship between this test taker's level of ability and the items' level of difficulty.

[ p. 115 ]

Fig. 1
Figure 1. An example of a person-item map for a fictional entrance examination.


If the test taker's person ability estimate is at the same location on the logit scale as the item difficulty estimate, then he or she has a .5 probability of answering that item correctly. Thus, test taker T1 has a fifty percent chance of correctly answering item I3. If the test taker's person ability estimate exceeds an item's difficulty estimate as in the case of items I2 and I1, then the probability of his or her success on these items also increases. The reverse is true when the item difficulty estimates exceed the test taker's person ability estimate. She or he is less likely to correctly answer items I4, I5 and I6.

User-friendly rescaling

Since the logit scale concept is not widely familiar, it can be transformed into a user-friendly scale such as CHIPs (see Wright and Stone, 1979 for an extensive review of different user-friendly scales). The CHIP scale has a number of benefits. First, it eliminates the use of negatives and decimals that accompany logit scores (Smith, 2000). Second, the range of possible scores on the CHIP scale is from 0 to 100. As a result, one does not need a special knowledge of statistics to interpret scores on a CHIP scale. Third, the midpoint of the scale is set at 50 CHIPs, which represents the average difficulty of all the items on the entrance examination. Test takers' scores can thus be interpreted in relation to the average difficulty of the items on the entrance examination. Fourth, the CHIP scale makes it relatively simple to determine the probabilities of test takers correctly answering the different items on the entrance examination. All one needs to do is subtract a test taker's entrance examination score from the level of difficulty for the different items on the entrance examination. The difference between the two defines the test taker's probability of success. The CHIP scale has the additional benefit in that the probabilities of success are in easy-to-remember multiples of 5 as shown in Table 1.

Table 1. The relationship between the probability of test takers correctly answering a question on an entrance examination and CHIP scores.

Probability of correctly answering a question on an entrance examination CHIP score
.10 -10
.25 -5
.50 0
.75 5
.9 10

[ p. 116 ]


According to Table 1, a test taker has only 10 percent chance of correctly answering an item on the entrance examination when that item's level of difficulty exceeds his or her entrance examination score by 10 CHIPs (resulting in a difference of -10 CHIPs). With a difference of -5 CHIPs, the chances of a test taker correctly answering the item becomes 25 percent. When there is no difference between the test taker's entrance examination score and the item's level of difficulty, the probability increases to 50 percent. As a test taker's entrance examination score begins to exceed the item's level of difficulty, the probability of him or her correctly answering the item continues to rise. A difference of 5 CHIPs results in a 75 percent probability; whereas, a 10 CHIP difference raises the chances to 90 percent. It must be remembered, however, that these percentages represent the probability of a test taker correctly answering the different items on the entrance examination. Sometimes a test taker who has a low ability estimate might defy a predicted probability by correctly answering an item on the entrance examination that greatly exceeds his or her ability. One possible explanation for this success might be that the item contains content information the test taker knows very well. For example, a test taker, who has an interest in computers, may have an advantage over other test takers if one of the reading passages on the entrance examination deals with current developments in computer science. Thus, the test taker's improbable success can attributed to his or her content knowledge of computers rather than his or her current knowledge of English.

An example Rasch analysis of the English section of a university entrance examination

The remainder of the paper analyzes test takers' scores on an entrance examination from the perspective of entrance examination developers, curriculum designers, and teachers. Although these perspectives have overlapping interests, a Rasch analysis provides each of them with specific information suited to their particular needs. The entrance examination discussed in this paper is not an actual test which has been used to determine university admissions, but rather it is an idealized example which aims to illustrate how careful test construction combined with a Rasch analysis can inform curriculum development and implementation. The examination is based upon Celce-Murcia, Zoltan and Thurrell's (1995) conception of communicative competence to assess test takers' sociocultural competence, linguistic competence, and actional competence. How these broad competencies are operationalized in an entrance examination involves careful item design that should reflect the requirements of the high school English curriculum and the expectations of the university's English program. An entrance examination designed along these lines thus has the potential of producing more meaningful examination scores with the assistance of a Rasch analysis.
In the case of this illustrative example, the test takers' entrance examination scores are generated using simulated data which has the additional benefit of meeting the underlying assumptions of the Rasch model (see Bond & Fox, 2001 for an accessible discussion of these assumptions). The analyses discussed in this paper center around the graphical output produced by Winsteps (Linacre, 2006). This output is extremely useful because it enables entrance examination developers, curriculum designers, and teachers to examine test takers' overall level of English competence in relationship to the difficulty of the different items on the entrance examination.

Entrance examination scores from a test developers' perspective

From a test developers' perspective, the cut-score is one of the most important points of consideration. For the sake of illustration, let's say that the cut-score for the entrance examination is set at one standard distribution above the average test taker's level of English ability (indicated with a dotted line on Figure 2). The cut-score provides entrance examination developers with an important reference point when analyzing the performance of the examination. Ideally, there should be a cluster of items situated around the cut-score because the standard error associated with each item drops when there are more items surrounding it (Beltyukova, Stone, & Fox, 2004). According to Figure 2, there are 3 items (Soci4, Act6 and Act7) situated around the hypothesized cut-score.
Fig. 2
Figure 2. Person-item map of the hypothetical university entrance examination in Figure 1 measured in CHIPs.

[ p. 117 ]


Entrance examination developers looking to improve the performance of future examinations should examine the properties of the items around the cut-score so that they can write more items at this level of difficulty. This type of analysis may reveal certain cognitive demands (e.g. Weaver & Romanko, 2005) or item types (e.g. Weaver & Sato, 2005) are more challenging for test takers. Likewise test developers might also want to reduce or revise items such as Act11, Ling12, and Soci8 that are located far away from the cut-score.
In summary, the constant challenge for developers of entrance examinations is to design items that have a level of difficulty situated around the cut-score. Although the exact location of the cut-score may vary from year to year, a Rasch analysis helps test developers identify the level of item difficulty that can help provide the most reliable information for admission decisions.

Entrance examination scores from the curriculum designers' perspective

Curriculum designers share many of the same concerns as test writers. Both of them have an interest in the items situated around the cut-score. However, the curriculum designer's perspective differs slightly in that they are interested in determining the language competences of the students who will be admitted to the university. Once again it should be noted that the insights gained through this type of analysis ultimately depend upon the quality of the items in the entrance examination.
In the case of our illustrative example, Figure 2 shows that items assessing test takers' sociocultural knowledge (i.e. items Soci1 to Soci4), actional competencies (i.e. items Act1 to Act6) as well as one aspect of the linguistic knowledge (i.e. item Ling1) are the most difficult items on the examination. As such, the university's English program should provide students with an opportunity to develop these particular areas of English competency.
Curriculum designers seeking a more refined account of test takers' level of ability across the different English competencies should inspect the entrance examination's expected score map (Figure 3). This figure shows the probability of test takers progressing through the different steps in the rating scales used to assess test takers' English abilities. For example, the most difficult question on the entrance examination (Act1) is listed on the upper right-hand corner of Figure 3. This item has a dichotomous rating scale with test takers either receiving 0 for an incorrect response or 1 for a correct response. The location of 0 and 1 on the x-axis of Figure 3 reveals how difficult this item was for test takers. Test takers situated at the cut-score (indicated by the dotted line) would probably receive a score of 0 on this item. To have a 50 percent chance of receiving a score of 1, test takers need an overall entrance examination score of 69 CHIPs. This threshold between the scores of 0 and 1 is indicated with a colon in Figure 3.

[ p. 118 ]

Fig. 3
Figure 3. Expected score map for the different items in the entrance examination in Figure 1.


Looking at the next most difficult item on the entrance examination (Soci1), test takers situated at the cut-score would probably receive 1 out of a possible 3 points. Unlike the most difficult item, this question employs a rating scale to give test takers credit for partial knowledge. Once again the colons between the different steps in the rating scale signify the points where test takers have a 50 percent chance of moving from one step to the next. For the next item, Ling1, the cut-score is located on the right of the threshold between 1 and 2. As a result, test takers at the cut-score have a greater chance of receiving 2 out of 3 possible points on this item. Curriculum designers would continue this type of analysis examining test takers' most probable scores across the different items on the entrance examination in order to gain a more refined account of test takers' English abilities.
In summary, the person-item map combined with the expected score map provide curriculum designers with an indication of relative strengths and weaknesses of new students across the different English competencies tested in the entrance examination. The probable scores shown in Figure 3 suggest that test takers, who have exceeded the cut-score, have a firm grounding in many of these competencies. In most cases, they have successfully answered the dichotomous items and received 2 out of the 3 possible points on items that rewarded partial knowledge. Test takers' poor performance on the 11 most difficult items (i.e. Act1 to Act6), however, suggests a need to further develop students' actional and sociocultural competencies. A Rasch analysis thus provides curriculum designers with an informative means of identifying the kinds of modifications the curriculum might require to meet the needs of new students.


[ p. 119 ]

Entrance examination scores from the teachers' perspective

Teachers in many cases share many of the same concerns as curriculum designers. They are interested in meeting the needs of new students. Teachers, however, are often faced with the task of dealing with the diverse student needs in their classroom. A Rasch analysis can help teachers with this challenge by providing them with a diagnostic tool that identifies their students' strengths and weaknesses. Figure 4, for example, details the performance of test taker number 12, who received a score of 59 CHIPs on the entrance examination. The column of numbers in the middle of this figure represents the expected responses for a test taker with a score of 59. The Rasch model thus expects that test takers at this level of English ability would be unsuccessful on the most difficult item, Act1, and thus score 0. However, the test taker's actual response, which has two period marks on either side of it, was correct. A similar response pattern can be observed with the other items assessing test takers' actional competence. This test taker either outperforms the Rasch model's expectations or receives the highest possible score.
Fig. 4
Figure 4. Response pattern map for Test Taker Number 12 on the entrance examination.


In contrast, this test taker's performance was lower than expected on some of the items that assessed sociocultural and linguistic competencies. In the case of items Soci3 and Soci7, the test taker scored a very unexpected 0. One possible explanation for this unexpected result may be that these two items were at the end of the examination and the test taker did not have time to complete them. Another possibility may be that this student's sociocultural knowledge is less developed than his or her actional and linguistic knowledge. Teachers can determine which of these possibilities underlie this student's performance using targeted assessments and/or student interviews. This follow-up investigation draws upon the concept of triangulation where students' abilities are defined with a variety of different assessment techniques.
In summary, the response map provides teachers with a useful diagnostic tool to identify new students' strengths and weaknesses. This diagnostic tool also represents an important bridge between the English section of the entrance examination and the classroom. Establishing a link between the two creates a situation where teachers can examine their new students' response maps and follow-up any issues that need further examination. Teachers are thus in a better position to deliver instruction that builds upon students' strengths while addressing any weaknesses that may exist.

[ p. 120 ]



A comprehensive conception of entrance examination scores
"Information drawn from the examination and the test takers' performances must also be viewed critically, constantly seeking ways to improve the quality of future examinations . . ."

This paper argues for a more comprehensive use of test takers' scores on entrance examinations. With the assistance of a Rasch analysis, examination writers, curriculum developers, and teachers can gain valuable insights into the performance of the entrance examination and the test takers' abilities. However, the value of these insights ultimately relies upon the quality of the items used in the entrance examination. Thus, it is imperative that considerable care is taken when developing the examination because it serves as the foundation for all of the subsequent analyses. Information drawn from the examination and the test takers' performances must also be viewed critically, constantly seeking ways to improve the quality of future examinations (e.g. Shizuka, Takeuchi, Yashima, and Yoshizawa, 2006; Weaver, in press). The resulting cycle of writing, evaluating, and refining entrance examination question types can help produce more valid and reliable assessments of test takers' abilities. A better performing entrance examination in turn increases the quality of information it provides. These benefits, however, require all of the parties involved in developing entrance examinations to invest a considerable amount of time, effort, and commitment. In some cases, there might be a need for a shift in mindset concerning the role of entrance examinations in the Japanese educational context. Hopefully, this paper illustrates the potential benefits far outweigh the effort required to actualize a more comprehensive conception of entrance examination scores.

Acknowledgements:
I would like to thank the Japan Society for the Promotion of Science for funding this line of research (MEXT Grant No. 17520371).
I am deeply grateful to Yoko Sato, Mark Chapman, and Tim Newfields for their critical comments and suggestions on this position paper.


References

Beltyukova, S., Stone, G., & Fox, C. (2004). Equating student satisfaction measures. Journal of Applied Measurement, 5 (1), 62-69.

Bond, T., & Fox, C. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, N. J.: L. Erlbaum.

Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents.

Celce-Murcia, M., Zoltan, D., & Thurrell, S. (1995). Communicative competence: A pedagogically motivated model with content specifications. Issues in Applied Linguistics, 6 (2), 5-35.

Linacre, J. (2006). WINSTEPS Rasch measurement computer program (Version 3.60). Chicago: Winsteps.com.

Shizuka, T., Takeuchi, O., Yashima, T., & Yoshizawa, K. (2006). A comparison of three- and four-point English test for university entrance selection purposes in Japan. Language Testing, 23 (1), 35-57.

Smith, E., Jr. (2000). Metric development and score reporting in Rasch measurement. Journal of Applied Measurement, 1 (3), 303-326.

Weaver, C. (in press). Evaluating the use of rating scales in a high-stakes Japanese university entrance examination. Spaan Fellow Working Papers in Second or Foreign Language Assessment. Ann Arbor, MI: University of Michigan.

Weaver, C., & Romanko, R. (2005). Assessing oral communicative competence in a university entrance examination. The Language Teacher, 29(1), 3-9.

Weaver, C., & Sato, Y. (2005). Kobetsu-gakuryoku-kensa (eigo) no Rasch bunseki [A Rasch analysis of the English section of a university entrance examination]. Daigaku Nyuushi Kenkyuu Journal, 15, 147-153.

Wright, B., & Stone, M. (1979). Best test design: Rasch measurement. Chicago: Mesa Press.


2005 Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index
Complete Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index

Last [ p. 121 ] Next