Shiken: JALT Testing & Evaluation SIG Newsletter
Vol. 11 No. 2. Sep. 2007. (p. 20a) [ISSN 1881-5537]

Suggested Answers for Assessment Literacy Self-Study Quiz #3
by Tim Newfields

Possible answers for the questions about testing/assessment raised in the
September 2007   issue of this newsletter appear below.

Part I: Open Questions

1 Q: If you could include one additional statistic about this test, which would probably be most helpful to general readers? Also, what other statistics about this test should probably be mentioned to the public?

   A: Though some bare-bones descriptive statistics are offered, no inferential statistics appear. To help readers make conclusions that extend beyond the immediate data itself, inferential statistics are needed. One statistic that should definitely be mentioned is the standard error of measurement for this test – the likelihood that a person's obtained score will differ from their "true score" due to factors other than their ability. Although the standard error of measurement (SEM) is mentioned in the TOEIC Technical Manual (n.d., p. 24), it does not appear in the TOEIC Users Guide nor any general brochures given to test takers in Japan.
* Details of how well the TOEIC® correlates with other measures of English proficiency should be widely distributed, and special care should taken when equating the score of one test with another.

Further reading:

Educational Testing Service. (n.d.). TOEIC Technical Manual. Retrieved September 1, 2007 from

Simmerok, B. D. (n.d.). Session 6 Lecture: Standard Error of Measurement. Retrieved September 2, 2007 from

2 Q: A teacher notices an error in her school entrance exam after the exam was administered: one multiple choice question had two possible correct answers. What, if anything, should be done?

   A: First, it would be good to consider how many items are in the test altogether. Ideally, a high stakes test such as this should have many items, so that even if a few items do not perform well, the overall test is robust. However, there are many shoddy entrance exams with just 25 - 35 items. In such no really a "neat" solution exists: any decision made will involve some messy compromises.
* If we are dealing with a four-option multiple choice question, ordinarily examinees would have a 1:4 choice of guessing the correct answer. With two "correct" answers, however, there is a 1:2 chance of guessing the correct item. This might lead to a slight score inflation. Hence, it would wise to conduct a post-hoc examination of the applicants who just passed the cut-off point for this test. Were any of their scores inflated because of this item? It doesn't really matter whether students well below or well above the cut-off point were affected by the bad item. However, the scores of students who just barely reached the cut-off point should be examined closely.
* At this point, it might be good to look at the item difficulty and item facility of the problematic item and also see how those with the highest scores and lowest scores performed on this particular item. Rather than decide a priori whether to accept or reject that item, it might be good to see how the item is functioning and also reflect on the test purpose and the social context of the test. Since the population of young people in Japan has been declining, many high school entrance exams hold less and less of a gatekeeping function: for many of the less prestigious schools, entrance exams now have two basic purposes: (1) income generation, and (2) demonstrating face validity to the public. In many contexts, the statistical reasons for accepting or rejecting a given test item matter less to decision makers than the impact that item might have on the overall face validity of the test and what it says of their institution.
* Actually, Section B, Item 4 of the ILTA Code of Practice addresses this issue. It states, "Malfunctioning or misfitting tasks or items should not be included in the calculation of individual test taker's scores." By that standard, the item should be cut.

Further reading:

International Language Testing Association. (2007). ILTA Draft Code of Practice. Retrieved September 1, 2007 from

Miner, B. (Winter 2004/2005). Testing Errors Plague Industry. Rethinking Schools. 19 (2).

3 Q: Any problem with the ranking system [proposed by Isao Kobayashi in which universities are rated on the basis of their entrance exam questions]?

   A: There are, in fact, a number of problems with this ranking system. First, the author does not explicitly state what the criteria for a good English entrance examination are: he is merely rating subjectively and in an opaque manner. Moreover, there appears to be a causality error in Kobayashi's rating scheme. No evidence is offered which conclusively links the quality of a university's English entrance exams with the quality of their English education. Indeed, they whole issue of how the quality of an English education should be rated is not covered with adequate precision.
* From a cultural perspective, what I find amazing is that a paper of this sort would even be published. The 2007 Nen Daigaku Rankingu [2007 University Ranking] by Asahi Shinbun is not a minor publication – it is a widely read Japanese almanac. For ideas about how to rank universities, it might be helpful to refer to the Berlin Principles on Ranking of Higher Education Institutions. Although the system of rating universities proposed by Shanghai Jiao Tong University is now widely cited, it is also problematic in many ways. Their system, for example, favors schools which are strong in science or technology or institutions with Nobel laureates. Though such schools might be strong in terms of technical research, that does not mean they are good in terms of overall education.

Further reading:

Jiao Tong University. (2007). Academic Ranking of World Universities. Retrieved September 1, 2007 from

UNESCO European Centre for Higher Education. (2006). Berlin Principles on Ranking of Higher Education Institutions. Retrieved September 3, 2007 from

4 Q: At a university in Japan a placement test was developed to stream science majors into different classes for an EFL reading program. Examinees read a 369 word passage about Alfred Nobel, then answered a series of true/false statements about the passage. One of the statement from that test was –

"Alfred Nobel lived his whole life in Sweden."

The response format was to circle either "true" or "false" in the answer sheet. Any problematic points concerning this question? Also, any issues with using T/F response formats for this sort of test?

   A: Two issues are involved in this question: (1) the use of absolute "all" or "none" statements in tests, and (2) the content validity of true-false test items. Each will be addressed separately.
* (1) Problems with absolute "all" or "none" statements - Essentially, T/F questions are two-item multiple choice tests. As such, each choice should roughly as plausible to those who have not actually read the passage. Whereas hedge words such as "some of" or "most of" have a wide lexical range, statements such as "his whole life" have a very narrow lexical range. Even without reading the passage, readers could guess the statement is false. For such reasons that Kehoe (1995) recommends avoiding words such words as "always," "never," "all", (or cognates such as "his whole life") in multiple choice test items. Instead, use hedge words with a broader range or consider choices such as "over 40 years".
* (2) The content validity of true-false test items - The "true/false" mode of thinking might is appropriate for some types of information. However, since a lot of scientific information is expanding and single text sources cannot provide sole proof of validity, I suggest that a "true / false / not mentioned" format is generally superior to the binary T/F format. To encourage critical thinking skills, it might be good to help readers widen their vision and realize all of the information about a given subject is unlikely to exist in a single text passage. For a statistical point of voice, a 3-choice "true/false/unmentioned" construct is also better than a 2-choice "true/false" construct because the chance of guessing is reduced. With well-constructed true/false questions, you have a 1:2 chance of guessing the correct answer. With the "true/false/ not mentioned" framework, the chance of guessing correctly is 1:3. For such reasons, Davies et al (1999) favors 3-to-5 item multiple choice questions to T/F questions.

Further reading:

Kehoe, J. (1995). Writing multiple-choice test items. Practical Assessment, Research & Evaluation, 4 (9). Retrieved September 2, 2007 from

Davies, A. Brown, A., Elder, C., Hill, K., Lumley, T., McNamara, T. (1999). True/False Item. In Studies in Language Testing, 7: Dictionary of language testing. (pp. 215). Cambridge, UK: Cambridge University Press.

5 Q: To compare a the mean of a particular sub-group to the mean of a larger group that is within the same population, what test(s) should be performed?

   A: The answer depends on the type of test data you have as well as your sample size and the sub-groups you wish to compare.
If you are dealing with parametric data (information in which a dependent variable is on a interval scale and has a normal distribution) and your sample size is under 30, a t-test might be called for. Garson (n.d.) offers a good concise summary of three major types of types of t-tests and conditions for which they are appropriate. With sample sizes over 30, a z-test might be justified. And if you are among the growing number of people with a Rasch disposition, you might want to compare the overall item-person fit of the data.
* If you are dealing with non-parametric data, such as information from a Likert scale or information with an unknown distribution, then a Mann Whitney U test might be called for when comparing two groups. However, depending on the type of data you have and what your test purposes are, a Kruskal-Wallis Test or Friedman Test might actually be most appropriate.

Further reading:

Garson, G. D. (n.d.). Student's t-Test of Difference of Means. Retrieved September 2, 2007 from

Cardone, R. (2005). Nonparametric: Distribution-Free, Not Assumption-Free. Retrieved September 2, 2007 from

Part II: Multiple Choice Questions

1 Q: Which of the following is not a common procedure to investigate test reliability?

   A: The correct answer is (B). There is no documented "equivalent scores method" of ascertaining test reliability. The "equivalent forms method", however, is widely used.

Further reading:

Garson, G. D. (1998, 2007). Reliability Analysis. Retrieved September 2, 2007 from

Mousavi, S.A. (2002). Reliability. In An Encyclopedic Dictionary of Language Testing. (3rd Ed.). (pp. 580-585). Taipei: Tung Hua Book Company.

2 Q: When reporting test scores to students, which of the following is/are unethical?

   A: Option (B) would be unethical because it involves a violation of privacy. Refer to Principle 2 of the ILTA Code of Ethics for details about privacy.

Further reading:

International Language Testing Association. (2000). ILTA Code of Ethics. Retrieved September 2, 2007 from

3 Q: To determine a EFL student's progress toward mastery of a classroom content area, should be used.

   A: Option (C) is probably the best answer. It seems many university teachers in Japan choose Option (A). This is a valid choice for courses specifically about translation.

Further reading:

Mousavi, S.A. (2002). Formative test. In An Encyclopedic Dictionary of Language Testing. (3rd Ed.). (pp. 262). Taipei: Tung Hua Book Company.

4 Q: What does this statistical symbol denote: R2?

   A: Unfortunately, there's a bit of inconsistency with respect to the use of this term. In Abdi (2007) and Easton and McColl (1997) it stands for Option (D) - the multiple correlation coefficient. However, according to Wikipedia and the Oxford High Beam Encyclopedia, it stands for Option (C) - the coefficient of determination. Rather than fret about the symbol, it is more important thing to understand the concept behind it. JD Brown (2003) offers a good description of the coefficient of determination in this publication. If you have a stomach for mathematics, you might enjoy how the Encyclopedia of Mathematics defines the multiple correlation coefficient. However, most readers will probably prefer the Wikipedia explanation of this concept.

Further reading:

Abdi, H. (2007). Multiple Correlation Coefficient. In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. (pp. 648-651). Thousand Oaks, California: Sage.

Brown, J. D. (2003). The coefficient of determination. Shiken: JALT Testing & Evaluation SIG Newsletter 7 (1) 15 - 17. Retrieved September 2, 2007, from

Coefficient of determination. (2007, August 25). In Wikipedia, The Free Encyclopedia. Retrieved September 2, 2007, from

Colman, A. M. (2001). Multiple correlation coefficient R. A Dictionary of Psychology. Oxford, UK: Oxford University Press. Cited in High Beam Encyclopedia under "Multiple correlation coefficient". Retrieved September 2, 2007, from

Multiple correlation coefficient. (1997). In V. J. Easton and J. H. McColl (Eds.) STEPS Statistics Glossary. Retrieved September 2, 2007, from

Prokhorov, A.V. (2002) Multiple-correlation coefficient. In M. Hazewinkel, (Ed.). Encyclopedia of Mathematics. [Online Edition]. Berlin, Heidelberg, New York: Springer-Verlag. Retrieved September 2, 2007, from

NEWSLETTER: Topic IndexAuthor IndexTitle IndexDate Index
TEVAL SIG: Main Page Background Links Network Join
last Main Page next
HTML:   /   PDF:

Quiz 1: Qs   / As    *    2: Qs   / As    *    3: Qs   As    *    4: Qs   / As   *    5: Qs   / As   *    6: Qs   / As   *    7: Qs   / As    8: Qs   / As   9: Qs   / As   10: Qs   / As  

[ p. * ]