Shiken: JALT Testing & Evaluation SIG Newsletter

Vol. 13 No. 2. May 2009 (p. 28 - 34) [ISSN 1881-5537]

PDF Version

Suggested Answers for Assessment Literacy Self-Study Quiz #7

by Tim Newfields

by Tim Newfields

Here are some suggested answers for the questions about testing, statistics, and assessment from the
May 2009 issue of SHIKEN. newfields55 at yahoo dot com.
If any answer seem unclear or you have further questions, contact the author at |

Even without a time-consuming validation study, however, we can speculate about what skills this test item might utilize. In addition to English reading and paragraph-level syntactical skills, it's quite possible that this test taps into psychological knowledge. Those interested in psychological matters are likely to do better on this item than those who aren't.

Is this item appropriate for 18-year-old EFL students trying to enter university? That's hard to say conclusively, but I suspect it might be better suited to graduate level psychology students. This subject matter might be beyond the reach of most high school students, even in their native languages. A topic closer to their realm of experience would be better.

Another issue to explore is whether item has any gender bias. It is quite likely that women will outscore men on this item to a significant degree, but solid data is needed to say for sure.

Further reading:Kunnan, A. J. (Ed.) (1998). Validation in Language Assessment. Mahwah, NJ: Lawrence Erlbaum Associates. Westen, D. & Rosenthal, R. (2003). Quantifying construct validity: Two simple measures. Journal of Personality and Social Psychology, 84 (3) 608-618. Retrieved April 10, 2009 from http://www.psychsystems.net/lab/Quant_Const.pdf |

** [ **
** p. 28**
** ]**

Further reading:Baghaei, P. (2008). Local dependency and Rasch measures. Rasch Measurement Transactions, 21 (3) 1105-6. Retrieved April 11, 2009 from http://www.rasch.org/rmt/rmt213b.htm Beguin, A. A. (2000). Robustness of equating high-stakes tests. Retrieved April 11, 2009 from http://www.cito.nl/share/poc/dissertaties/dissertationbeguin2000.pdf Brannick, M. T. (n.d.) Item Response Theory. Retrieved April 11, 2009 from http://luna.cas.usf.edu/~mbrannic/files/pmet/irt.htm |

The fact the mean for both groups is below zero merits further investigation. Without knowing more about the data or doing any number crunching, it's hard to interpret this.

The trait measured by this scale doesn't vary widely between the groups. A quick glance reveals the similarities of both groups far outweigh their differences. As a consequence, this chart illustrates the sort of pattern researchers aspire for when exploring the impact of a non-construct relevant factor on a research design, such as how well males and females performed on a given foreign language test.

Other programs such as Aabel, fBasics, KaleidaGraph, Mathematica, MatLab, Stata, and S-PLUS are also reputed to handle box-percentile plots, but I have not tired them. Two good sources of information about statistical programs are Wikipedia's List of statistical packages (http://en.wikipedia.org/wiki/Statistical_software) and the list by Pezzullo of free products (http://www.statpages.org/javasta2.html)

** [ **
** p. 29**
** ]**

Further reading:Brown, J.D. Skewness and kurtosis. Shiken: JALT Testing & Evaluation SIG Newsletter, 1 (1) 20 - 23. Retrieved April 13, 2009 from http://jalt.org/test/PDF/Brown1.pdf Esty, W. W. & Banﬁeld, J. D. (2003, October). The Box-Percentile Plot. Journal of Statistical Software, 8 (17). Retrieved April 13, 2009 from http://www.jstatsoft.org/v08/i17/paper Pezzullo, J. C. (2009). Free Statistical Software. Retrieved April 13, 2009 from http://www.statpages.org/javasta2.html Wikipedia. (2009). List of statistical packages. Retrieved April 13, 2009 from http://en.wikipedia.org/wiki/Statistical_software |

- The index is easily calculated and can be applied to any group of researchers in any field.
- By combining two metrics, it is superior to a mere list of publications.
- Small data collection errors are reputed to have little (if any) impact on
*h*-values. - Minor publications as well as single top publications are ignored.

Because of these disadvantages, a number of alternative indices have been proposed. The

** [ **
** p. 30**
** ]**

Further reading:Egghe, L. (2006) An improvement to the h-index: The g-index. ISSI Newsletter 2 (1) 8-9. Retrieved April 14, 2009 from http://stat-athens.aueb.gr/~jpan/Egghe-ISSI-2006.pdf Jin, B-H., Liang, L., Rousseau, R. and Egghe, L. (2007). The R- and AR- indices: Complementing the h-index. Chinese Science Bulletin, 52, 855-863. Retrieved April 14, 2009 from http://dx.doi.org/10.1007/s11434-007-0145-9
Kosmulski, M. (2007). MAXPROD - A new index for assessment of the scientific output of an individual, and a comparison with the h-index. International Journal of Scientometrics, Informetrics and Bibliometrics, 11 (1). Paper 5. Retrieved April 14, 2009 from http://cybermetrics.cindoc.csic.es/articles/v11i1p5.pdf) Panaretos, J. & Malesios, C. (2009, January 18). Assessing scientific research performance and impact with single indices. MPRA Paper No. 12842. Retrieved April 14, 2009 from http://mpra.ub.uni-muenchen.de/12842/ Rousseau, R. (2008, June). Reflections on recent developments of the h-index and h-type indices. In H. Kretschmer & F. Havemann (Eds.). Proceedings of WIS 2008, Berlin. Retrieved April 14, 2009 from http://www.tarupublications.com/journals/cjsim/7-Rousseau.pdf |

Option (B) describes the opposite of what actually happens: as a sampling becomes smaller, chances increase that the distribution will not approximate a normal curve. Conversely as the

Option (C) is incorrect since the sum of probabilities approaches 1 rather than zero.

The remaining option is generally true. For this reason Poisson modeling is used to calculate phenomena such as how frequently a word will likely appear or how often a particular type of error occurs.

Further reading:Baytekin, O. (2002) A x ^{2} analysis of the Poisson approximation to binomial distribution.
Marmara University Journal of Pure and Applied Sciences, 18 33-36. Retrieved April 15, 2009 from http://fbe.marmara.edu.tr/dergi/pdf/inga02004.pdf
Di Raimondo, T. et al. (2007). Discrete distributions: hypergeometric, binomial, and poisson. Retrieved April 15, 2009 from http://controls.engin.umich.edu/wiki/index.php/Discrete_Distributions:_hypergeometric,_ binomial, and_poisson El Sherbiny, M. M. (2007, November 4). Discrete Probability Distributions. Retrieved April 15, 2009 from http://faculty.ksu.edu.sa/73212/Publications/Discrete%20Probability%20Distributions.ppt Nandamurar, K. (n.d.). Poisson Distribution. Retrieved April 15, 2009 from http://www.cse.msu.edu/~nandakum/nrg/Tms/Probability/poisson.htm West Virginia University Department of Statistics. (2006). The Poisson distribution. Retrieved April 15, 2009 from http://ideal.stat.wvu.edu:8080/ideal/resource/modules/1/Poisson/poisson.html |

** [ **
** p. 31**
** ]**

Since the SSQ7c.gif plot curve is smooth, at first glance it does not appear to be a discreet frequency polygon – although discrete frequency polygons that are been statistically rounded off can indeed resemble this shape. Unrounded discrete frequency polygons are usually more angular. Option (B) is hence unlikely.

In a Q-Q plot the quantile ranges for two sets of data are compared with each other. Q-plots can exist in many formats. One of common form is when theoretical data is contrasted with observed data. Another type is when two samples are juxtaposed in the same chart. Usually two distinct distributions are evident in Q-Q plots, but if both distributions are nearly in sync, one distribution might not be visible. So Option (D) seems unlikely, thought not impossible.

Further reading:Cramster, Inc. (2009). Q-Q plot. Retrieved April 17, 2009 from http://www.cramster.com/ reference/wiki.aspx?wiki_name=Q-Q_plot Simon K. (n.d.) Pareto Chart. Retrieved April 17, 2009 from http://www.gate2quality.com/quality-tools_2.html United Stated Department of Commerce Information Technology Laboratory: Statistical Engineering Division. (2006, July 16). NIST/SEMATECH e-Handbook of Statistical Methods: 1.3.3.24. Quantile-Quantile Plot. Retrieved April 19, 2009 from http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm |

The first three options are widely recognized ways to make tests more reliable. Option (D), however, is unlikely to impact test reliability, at least on the first administration. Arguably, using multiple forms could reduce the likelihood of cheating and hence enhance reliability. Although Option (D) is the least likely way of improving test reliability in the short term, even this option could help make subsequent revisions of the test more reliable as more information about which items function well becomes available.

Further reading:Jacobs, L. C. (1991). Test Reliability. Retrieved April 22, 2009 from http://www.indiana.edu/~best/test_reliability.shtml Winsteps. (2009). Winsteps Help for Rasch Analysis: Reliability and separation of measures. Retrieved April 22, 2009 from http://www.winsteps.com/winman/index.htm?reliability.htm |

** [ **
** p. 32**
** ]**

Actually, both the KR-20 and KR-20 measures have significant limitations. For this reason more and more researchers are adopting IRT and Rasch based models of reliability. However, partly because of ease of calculation, and perhaps because of tradition, the KR-21 formula is not likely to entirely disappear soon. In fact, to cover both bases, it is not uncommon to offer both classical statistics along with IRT and/or Rasch measures of reliability (generally the "information function" for IRT or "separation index" for Rasch).

Concerning robustness, both the KR-20 and KR-21 are reported to be robust even if the unifactor trait is violated (Iacobucci and Duhachek, 2003), so Option (A) is false.

Option (B) is the inverse. As Bodner (1980, par. 4) states, the K-21 "severely underestimates the reliability of an exam unless all questions have approximately the same level of difficulty." In other words, in many situations the KR-21 yields higher reliability estimates than the KR-20.

Option (C) is false since both these tests require just one administration.

Further reading:Andrich, D. (1982). An Index of Person Separation in Latent Trait Theory, the Traditional KR-20 Index, and the Guttman Scale Response Pattern. Education Research and Perspectives, 9 (1) 95-104. Retrieved April 24, 2009 from http://www.rasch.org/erp7.htm Bodner, G. (1980). Statistical Analysis of Multiple Choice Exams: Coefficients of Reliability. Journal of Chemical Education, 57, 188-190. Retrieved April 24, 2009 from http://chemed.chem.purdue.edu/chemed/stats.html Brown, J. D. (2002). Do cloze tests work? Or, it is just an Illusion? University of Hawaii Working Papers in Second Language Studies, 21 (1). Retrieved April 26, 2009 from http://www.hawaii.edu/sls/uhwpesl/21(1)/BrownCloze.pdf Halle, C. D. (2009). Active Teaching, Learning, and Assessment: Unit 4: Validity and Reliability. Retrieved April 24, 2009 from charlesdennishale.com/books/eets_ap/ 3_Psychometrics_Reliatility_Validity_Sampling.pdf Iacobucci, D. & Duhachek, A. (2003). Advancing alpha: Measuring reliability with confidence. Journal of Consumer Psychology, 13 (4),478 - 487. Retrieved April 26, 2009 from http://mba.vanderbilt.edu/vanderbilt/data/research/2190full.pdf Linacre, J. M. (1997). KR-20 or Rasch Reliability: Which Tells the "Truth"? Rasch Measurement Transactions, 11 (3) 580 - 581. Retrieved April 26, 2009 from http://www.rasch.org/rmt/rmt113l.htm |

** [ **
** p. 33**
** ]**

Let us assume, for the moment, that a decision must be made with the limited information provided. In 2003 Brown suggests that .50 was the ideal figure for norm-reference tests, regardless of the number of distracter items. Other authors, however, have suggest that the number of distracter choices should be taken into account when determining optimal item difficulty. Lord (1977), offers several different formulas to calculate the optional difficulty level. In one relatively simple procedure first example cited, a pefect score is divided by the number of distractors to ascertain the random guessing level. This is then subtracted from a perfect score and divided by 2. When that sum is added to the random guessing level, the optimal difficulty level is obtained. Let's consider how this would work with a 4-option MC question. The random guessing level is 1.00/4 = 0.25 and hence the optimal difficulty level would be .25 + (1.00 - .25) / 2 = 0.625. In the second example cited, the random guessing level is (1.00/2 = .50) and hence the optimal difficulty level would be .50+(1.00-.50)/2 = .75. What is important to remember is that there is no single "correct" answer regarding difficulty level since several formulas exist.

If we reflect on the social consequences of test washback, this whole line of thought breaks down. The traditional approach of favoring items only with high ID values might demotivate some learners. For this reason, many teachers try to have a sampling of easy and difficult items to give weaker students some sense of achievement – but not so many that the test itself looses all discriminating value.

Further reading:Brown, J. D. (2003). Norm-referenced item analysis (item facility and item discrimination). Shiken: JALT Testing & Evaluation SIG Newsletter, 7 (2) 16 – 19. Retrieved April 16, 2009 from http://jalt.org/test/PDF/Brown17.pdf Lord, F. M. (1977). Optimal number of choices per item: A comparison of four approaches. Journal of Educational Measurement, 14 (1), 33-38. The University of Texas at Austin Division of Instructional Innovation and Assessment. (2007, July 16). Analyzing Multiple-Choice Item Responses. Retrieved April 16, 2009 from http://www.utexas.edu/academic/mec/scan/analysis.html Whatley, M. A. (2007). Item Analysis Worksheet. Retrieved April 16, 2009 from http://chiron.valdosta.edu/ mawhatley/3900/itemanalysis.pdf |

NEWSLETTER: Topic IndexAuthor IndexTitle IndexDate Index

TEVAL SIG: Main Page Background Links Network Join

HTML: http://jalt.org/test/SSA7.htm / PDF: http://jalt.org/test/PDF/SSA7.pdf

Quiz 1: Qs /

** [ **
** p. 34**
** ]**