The Interface Between Interlanguage, Pragmatics and Assessment: Proceedings of the 3rd Annual JALT Pan-SIG Conference.
May 22-23, 2004. Tokyo, Japan: Tokyo Keizai University.

A comparison of holistic and analytic scoring methods in the assessment of writing

[Add Japanese Title Here]
by Yuji Nakamura (Tokyo Keizai University)



Abstract


This paper examines the strengths and weaknesses of holistic and analytic scoring methods, using the Weigle adaptation of Bachman and Palmer's framework, which has six original categories of test usefulness, and explores how we can use holistic or analytic scales to better assess student compositions.

Keywords: holistic scoring, analytic scoring, writing assessment, Bachman and Palmer framework, test usefulness

J Abstract


Theoretical background and rationale

In the assessment of writing, a major advantage of holistic over analytic scoring is that each writing sample can be evaluated quickly by more than one rater for the same cost that would be required for just one rater to do the scoring using several analytic criteria (cf. Davies et al, 1999). One possible disadvantage of holistic judgment is that different raters may choose to focus on different aspects of the written product. On the other hand, an advantage of analytic scoring is that raters are required to focus on each of various assigned aspects of a writing sample, so that they all evaluate the same features of a student's performance. But the practical disadvantage of analytic scoring, as indicated by Davies et al above, is that it is more time-consuming than holistic scoring. The choice of scoring method is not always easy.
". . . A test of writing used for research purposes should have reliability and construct validity as central concerns, and practicality and impact issues should be of lesser significance."

The Bachman and Palmer (1996) framework of test usefulness can be relevant in helping teachers decide which type of test to use. This framework proposes six qualities of test usefulness: Reliability, Construct Validity, Authenticity, Interactiveness, Impact, and Practicality. Bachman and Palmer suggest that test developers develop an appropriate balance among these qualities by setting minimum acceptable standards.
Weigle (2002), comments on the Bachman and Palmer (1996) framework by showing a comparison of holistic and analytic scales based on the same six qualities of test usefulness as follows:

[ p. 45 ]


Table 1. A comparison of holistic and analytic scales in terms of six qualities of test usefulness. (adapted from Weigle 2002, p.121)

Quality Holistic Scales Analytic Scales
Reliability lower than analytic, but still acceptable higher than holistic
Construct Validity assume that all relevant aspects of writing ability develop at the same rate and can thus be captured in a single score; correlate with superficial aspects such as length and handwriting more appropriate for L2 writers as different aspects of writing ability develop at different rates
Practicality relatively fast and easy time-consuming; expensive
Impact single score may mask an uneven writing profile and may lead to misleading placements more scales provide useful diagnostic information for placement and/or instruction; more useful for rater training
Authenticity White(1995) argues that reading holistically is a more natural process than reading analytically Raters may read holistically and adjust analytic scores to match holistic impressions
Interactiveness n/a n/a


Note that Interactiveness, as defined by Bachman and Palmer, relates to the interaction between the test taker and the test. It may be that this interaction is influenced by the rating scale if the test taker knows how his/her writing will be evaluated; this is an empirical question and should be directly investigated by further research. (underlined for emphasis).
The present research focuses on the Weigle adaptation of Bachman and Palmer's framework, because it is more appropriate to writing evaluation scales.

Purpose of the research

Since not all teachers (native speakers and non-native speakers alike) are good at rating compositions, understanding the details of holistic and analytic evaluation systems can be beneficial for teachers, in their classroom assessment and in their training sessions. A clear-cut rating scale with detailed criterion can lead to positive washback in which students have clear study goals. The purpose of the present research is to serve as a starting point for rater training development and help students' set clearer study goals. This paper examines two scoring methods (holistic and analytic) by looking at their respective strengths and weaknesses using the Weigle arrangement of Bachman and Palmer's framework and explores how holistic or analytic scales can better be used to assess student compositions.

[ p. 46 ]

Research design and method

Ninety students took a composition test in class (30 students per class), and their writing scripts were evaluated by three raters both holistically (using one evaluation item = overall) and analytically (using five rating items chosen by the author: grammar, vocabulary, organization, originality, cohesion). The author referred to Cohen (1994) in deciding the evaluation items and making criteria for the four labels (1,2,3,4).
The two types of scoring were conducted on two different days (with a three week interval) by the same raters. The scripts were evaluated by three raters using a four-point scale (1: "poor" to 4: "good"). The data analysis was conducted using a FACETS model so that three facets (students, raters, evaluation items) could be shown on the same continuum.
The acceptable range for Infit and Outfit Statistics in the performance test was 0.6-1.4. (The items below 0.6 were included in a special category called "overfitting," and the items above 1.4 were called "underfitting" in the Rasch model, and all items outside the general category were called "misfitting.").
The students were all freshman students and their age range was between 18 and 20, and approximately half of them were male students. Their English proficiency level could be loosely described as "intermediate" and the test they took was part of their grade.
The raters were trained using ten sample compositions that were collected from the same university students, so that the raters gained a rough idea of the students' proficiency level and sense of how the criterion should be used. The student composition topic was to discuss the proposal to have a five-day school week. Details concerning the rating procedure are summarized below in Tables 2 and 3.

Table 2. Features of the holistic rating scale used in this study.
Raters: Three trained native speakers of English. Items: One evaluation item (Overall)
Rating scale: A four-point scale (1, 2, 3, 4) was used with the following criteria:
4 points
  • the main idea was clearly stated
  • the essay was well organized
  • the choice of words was good
  • very few minor grammatical errors
3 points
  • the main idea was fairly clear
  • the essay was moderately well organized
  • the vocabulary was good
  • some minor grammatical errors
2 points
  • the main idea was indicated, but not clearly
  • the essay was not so well organized
  • the vocabulary choice was fair
  • some major grammatical errors
1 point
  • the main idea was hard to identify
  • the essay was poorly organized
  • the vocabulary was weak
  • many grammatical errors

Table 3. Features of the analytic rating scale used in this study.
Raters: Three trained native speakers of English.
Items:5 criteria were rated: (1) Originality of Content, (2) Organization, (3) Vocabulary, (4) Grammar, (5) Cohesion & Logical Consistency
Rating scale: A four-point scale (1, 2, 3, 4) was used with the following criteria:
Originality of Content
  • 4 points: interesting ideas were stated clearly
  • 3 points: interesting ideas were stated fairly clearly
  • 2 points: ideas somewhat unclear
  • 1 point: ideas not clear
Organization
  • 4 points: well organized
  • 3 points: fairly well organized
  • 2 points: loosely organized
  • 1 point: ideas disconnected
Vocabulary
  • 4 points: very effective choice of words
  • 3 points: effective choice of words
  • 2 points: fairly good vocabulary
  • 1 point: limited vocabulary range of vocabulary
Grammar
  • 4 points: almost no errors
  • 3 points: few minor errors
  • 2 points: some errors
  • 1 point: many errors
Cohesion & Logical Consistency
  • 4 points: sentences logically combined
  • 3 points: sentences fairly logically combined
  • 2 points: sentences poorly combined
  • 1 point: many unfinished sentences

[ p. 47 ]

Results and discussion

Reliability issues

First let us consider the issue of reliability. Tables 4 and 5 summarize the pertinent reliability data for both tests.

Table 4. Raters' holistic assessments.
--------------------------------------------------------------------------------- | Obsvd Obsvd Obsvd Fair-M | Model | Infit Outfit | | | Score Count Average Avrage |Measure S.E. | MnSq ZStd MnSq ZStd | | N raters --------------------------------------------------------------------------------- | 169 78 2.2 2.12 | .03 .26 | 1.15 1.0 1.08 .4 | | 1 A | 178 78 2.3 2.14 | .56 .28 | .61 -2.6 .47 -2.6 | | 2 B | 165 78 2.1 2.07 | -.60 .27 | 1.00 .0 .93 -.2 | | 3 C ---------------------------------------------------------------------------------
Table 5. Raters' analytic assessments.
--------------------------------------------------------------------------------- | Obsvd Obsvd Obsvd Fair-M | Model | Infit Outfit | | | Score Count Average Avrage |Measure S.E. | MnSq ZStd MnSq ZStd | | --------------------------------------------------------------------------------- | 945 440 2.1 2.08 | .07 .08 | 1.14 2.0 1.16 2.0 | | 1 A | 957 440 2.2 2.11 | .10 .09 | .76 -3.8 .74 -3.9 | | 2 B | 931 440 2.1 2.09 | -.18 .09 | 1.06 .9 1.08 1.1 | | 3 C ---------------------------------------------------------------------------------


Among the three raters in the Infit — Outfit column in Table 4, only Rater B was below the acceptable range of 0.6-1.4. In other words, Rater B was misfitting (to be precise, overfitting).
As mentioned in Section 3 as a rule of thumb, acceptable ranges for infit and outfit statistics in the performance test is 0.6-1.4. (c.f. Nakamura 2003). The items below 0.6 were included in a special category called "overfittting" and those above 1.4 were labeled "underfitting".
Table 5 demonstrates that there were no misfitting raters when the analytic scale was used, as shown in the Infit-Out statistics columns. All three raters were within the acceptable range (0.6-1.4) and functioned well with this test format.
The Infit - Outfit MNSQ mainly reveals the raters' consistency in rating, and it is clear that the analytic scale is more reliable. In this regard, the analytic scale is preferable to the holistic scale. It can also be said that according to the Rasch model adjustment, raters function better with others when there are multiple rating items.

Construct validity issues

Now let us consider the issue of construct validity, which should be addressed when evaluating any test. Tables 6 and 7 compare the construct validity of these two test.

[ p. 48 ]

Table 6. Analytic scale item measurement report.
------------------------------------------------------------------------------------------- | Obsvd Obsvd Obsvd Fair-M | Model | Infit Outfit || | | Score Count Average Avrage |Measure S.E. | MnSq ZStd MnSq ZStd || N items | ------------------------------------------------------------------------------------------- | 577 264 2.2 2.13 | .07 .11 | .82 -2.1 .80 -2.3 || 1 organization | | 493 264 1.9 1.79 | -.86 .11 | 1.09 1.0 1.15 1.3 || 2 cohesion | | 607 264 2.3 2.24 | .48 .11 | .93 -.8 .89 -1.1 || 3 vocabulary | | 571 264 2.2 2.10 | .15 .10 | 1.15 1.7 1.15 1.5 || 4 grammar | | 585 264 2.2 2.18 | .16 .11 | .95 -.5 .98 -.1 || 5 originality | -------------------------------------------------------------------------------------------
Table 7. Holistic scale item measurement report
------------------------------------------------------------------------------------------- | Obsvd Obsvd Obsvd Fair-M | Model | Infit Outfit || | | Score Count Average Avrage |Measure S.E. | MnSq ZStd MnSq ZStd || N item | ------------------------------------------------------------------------------------------- | 512 234 2.2 2.11 | .00 .16 | .94 -.6 .83 -1.3 || 1 Overall | -------------------------------------------------------------------------------------------


According to the Infit-Outfit statistics column of Table 6, all five items in the analytic scale functioned well within the acceptable range of 0.6-1.4. Also, the Infit-Outfit statistics column of Table 7, shows that the holistic scale also functioned well.
In terms of the difficulty of the five items in the rating scale, Cohesion & Logical Consistency was the most difficult and Vocabulary was the easiest.
Validity basically means "Does the test measure what it is supposed to measure?" To examine whether a test construct is valid or not is not an easy job. Test methods, test tasks, can be part of the construct of the test, or even marking schemes can be part of the test construct.
The present discussion of the construct validity of the rating scale will specifically focus on the evaluation items. The holistic scale has one general factor whereas the analytic scale has at least five composite (cf. Weigle 2002). To improve the construct validity of a test and test reliability analytic examinations with multiple evaluation items are preferable.
The more items a test has the higher its reliability (from the viewpoint of internal consistency ). However, practicality and issues of test fatique must also be considered.
When we talk about the construct of a writing test, we should wonder what the components are. One way to do this is to start by considering the nature of writing, its theoretical construct, and inquire about the method of instruction from experienced teachers and readings in the field. In this way we can collect a wide variety of data sources to contribute to the construct of writing, even if eventually we narrow this down to only one or two factors through Principal Component Factor Analysis.
Let us take at look at the comparison of students' writing ability according to two rating scales.
This information could be expressed graphically in Fig. 1.

[ p. 49 ]

Figure 1
Figure 1. A comparison of students' writing scores according to holistic and analytic rating scales

           Abbreviations: Holi = Holistic scale scores     Ana = Analytic scale scores
Note: The X axis represents Holistic Scale scores and the Y axis Analytic Scale scores
Table 8, along with the graphical description of Figure 1 suggests that the correlation coefficient of the students' scores in two scales is .95 (the variance is over 90%), which is extremely high, and verified graphically by the plot. It may be that holistic scale test results can predict analytic scale test results with a high degree of probability, vice versa. In other words, a holistic scale can be an alternative to an analytic scale under one very important condition: if the test is construct valid. So it is important for test designers to consider the construct validity of a test as well as its reliability.
The plot in Figure 1 indicates that the only difference between the two scales is in terms of precision. When students are rated by more than one rater, there will be a higher precision in the rating. In other words, the more ratings of a person receives, the higher the precision will be.

Practicality issues

As we know from our teaching and rating experiences (cf. Weigle 2002), a holistic scale rating is relatively fast and easy way to assess students, whereas analytic scale rating is time-consuming and expensive. From the viewpoint of practicality, holistic scales are recommended.

Impact issues

When we talk about the washback effect on instruction, placement, diagnostics for students, and rater training, a single holistic scale format is less informative than a multiple analytic score format. In this sense, analytic scales are often better.
With respect to the washback effect in rater-training, the situation is somewhat more complex. This might be an interesting area of further research.

Authenticity issues

In terms of authenticity of rating, as White (1995) claims, reading holistically is a more natural process than reading analytically. Thus, a holistic scale is more authentic than an analytic one because in reality, we usually do not read for evaluation or rating, but to get information. But in most classroom settings, teachers evaluate students' compositions under discrete items such as "content". And teachers also read the texts with certain expectations of good grammar and vocabulary, clarity of expression and logical organization of thought, as well as respect for academic conventions. Teachers are also aware that students nowadays have rising expectations of meaningful feedback. This would seem to argue for analytic rating scales.

[ p. 50 ]

". . . the best practice is to have multiple raters and multiple rating items. The next best practice is to have one overall evaluation item and multiple raters."

Conclusions

Several conclusions can be drawn from this study.
  1. For practical and economical reasons, holistic (one item evaluation) assessment can be used, but to avoid risky idiosyncratic ratings, analytic assessment (with several evaluation items) is strongly recommended.
  2. In terms of rating options, the best practice is to have multiple raters and multiple rating items. The next best practice is to have one overall evaluation item and multiple raters. In order of preference, the third choice would be to have one rater and multiple items. The least recommended solution would be to have one rater and one item. Even worse than this, however, would be to have one rater and an impressionistic scale.
  3. This study suggests that it is very risky for one classroom teacher to judge students using a holistic rating system (cf. Table 1 and the discussion).
  4. The more ratings a person receives, the higher the rating precision, though one obvious condition is that construct and content validity must come before statistical reliability. Otherwise, we do not know what the test is measuring (cf. Table 3 and Table 4 and the discussion).
As Weigle (2002) further suggests:
  1. If large numbers of students need to be placed into writing courses with limited time and limited resources, a holistic scale may be the most appropriate choice in terms of practicality. Reliability, validity and impact can be ameliorated by the possibility of adjusting placement scores.
  2. A test of writing used for research purposes should have reliability and construct validity as central concerns, and practicality and impact issues should be of lesser significance.
  3. The choice of testing procedures should involve finding the best possible combination of the qualities (reliability, validity, etc.) and deciding which qualities are most relevant in a given situation.

References

Allison, D. (1999). Language testing and evaluation. Singapore: Singapore University Press.

Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press.

Alderson, J. C., Clapham, C. & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Bachman, L. F. (2002). Alternative interpretations of alternative assessments: Some Validity issues in educational performance assessments. Educational Measurement: Issues and Practice, 21 (3), 5-18.

Bachman, L. F. (forthcoming). Statistics for language assessment. Cambridge: Cambridge University Press.

Bachman, L. F. & A. S. Palmer. (1996). Language testing in practice: Designing and developing useful language tests. Oxford: Oxford University Press.

Brown, J. D. & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University Press.

Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.

CARLA (2001). Center for Advanced Research on Language Acquisition. CoWA (Contextualized Writing Assessment) Available on-line: http://carla.acad.umn.edu/CoWa.html.


[ p. 51 ]

Cohen, A. (1994). Assessing language ability in the classroom. 2nd Edition. Boston: Heinle & Heinle.

Davies, A., Brown, A., et al. (Eds.). (1999). Dictionary of language testing. Cambridge: Cambridge University Press.

Davidson, F. (2000). Review: Standards for educational and psychological testing. Language Testing 17: 457-462.

Douglas, D. (2000). Assessing languages for specific purposes. Cambridge: Cambridge University Press.

Hamp-Lyons, L. & Kroll, B. (1997). TOEFL 2000-writing: Composition, community, and assessment. (TOEFL Monograph Series Report No. 5). Princeton, NJ: Educational Testing Service.

Hughes, A. (2003). Testing for language teachers. Second Edition. Cambridge: Cambridge University Press.

Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19, 246-276.

McNamara, T. F. (1996). Measuring second language performance. London: Longman.

McNamara, T. F. (2002). Language testing. Oxford: Oxford University Press.

Nakamura, Y. (2002). Effectiveness of paired rating in the assessment of English compositions. JLTA Journal, 5, 61-71.

Nakamura, Y. (2003). Two-Dimensional performance assessment: speaking and writing tests. Journal of Communication Studies 18, 7-15.

Nakamura, Y. (2003). An application of a multi-faceted Rasch model to writing test analysis. In A. S. Mackenzie and T. Newfields (Eds.), Curriculum Innovation, Testing and Evaluation.: Proceedings of the JALT Testing Conference 2002, 171-179. Tokyo: The Japan Association for Language Teaching.

Nakamura,Y. & H. Tobiwatari. (2004). An empirical, statistical and comparative study of students' writing abilities in Japanese and English. JLTA Journal 6,1-12.

Van Ek, J. A. & J. L. Trim. (2001a). Waystage 1991. Cambridge: Cambridge University Press.

Van Ek, J. A. & J. L. Trim. (2001). Threshold 1991. Cambridge: Cambridge University Press.

Van Ek, J. A. & J. L. Trim. (2001c). Vantage. Cambridge: Cambridge University Press.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing 15 (2): 263-287.

Weigle, S. C. (1999). Investigating rater/promp interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing 6 (2), 145-178.

Weigle, S. C. and Nelson, G. (2001). Academic writing for university examinations. In I. Leki (ed.), Academic writing programs (pp. 121-135). Alexandria, VA: TESOL.

Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press.

White, E. M. (1995). An apologia for the timed impromptu essay test. College Composition and Communication 46, 30-45.



2004 Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index
Complete Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index

[ p. 52 ]

Last Next