Shiken: JALT Testing & Evaluation SIG Newsletter
Vol. 3 No. 1 April. 1999 (p. 17 - 19) [ISSN 1881-5537]
PDF PDF Version

Cover of Measuring second language performance
Book Review

Measuring second language performance
by Tim McNamara (1996)ISBN: 0582089077
Harlow, Essex, UK: Addison Wesley Longman Ltd.

Quite soon into the first chapters of this book, I recalled a humorous little adage that often comes to mind in my work; "The truth will make you free, but first it'll make you miserable". The first two chapters present a thorough and critical history of developments in performance assessment as a test method, first in education and occupational settings, then moving on to language testing theory and practice, particularly since the 1970's, when the ideas of communicative competence and authenticity of test-tasks began to be incorporated into discussions about validity in language assessment. McNamara describes language performance assessment as the product of two traditions. The first is the fundamentally pragmatic 'work sample' approach influenced by sociolinguistic theory, which treats the performance itself as the target of assessment (the "strong" argument). The second stems from psycholinguistic theory which views performance as merely a medium and the underlying knowledge and ability as the target (the "weak" argument).
The book's initial message was that language performance tests possess a seductive face validity that obscures the boundaries between what is to be observed, how the subject reacts to the task at hand, and how it is registered as a score. Theories across and within the differing hues in the spectrum from the 'strong' to 'weak' arguments remain far from definitive, and lack empirical evidence to support its constructs. So okay, my most painful doubts during hands-on in-shop experience were not only confirmed, but thanks to the author's comprehensive treatment, were expanded. But in these first chapters, McNamara is simply telling us what we have to face.
McNamara proposes a 'three pronged' approach to tackling the problems he identifies. The first task is to incorporate a model of communicative competence which can explain interaction between all participants in performance assessment, including the interlocutor(s). Secondly, we must direct our research towards finding how significant each variable in our assessment method; (tasks, participants, settings, topics, scales) is to our measurement. Lastly, we must decide, once we have a picture of the impact of these variables, which will fit or inform our model of communicative competence and what the practical boundaries for "testability" are. McNamara uses the remaining two thirds of the book to illustrate how it might be done, and in doing so makes this book a must for those who wish to gain a clear understanding of what Rasch-based analysis can do.

[ p. 17 ]

From here on, the book can be read like a detailed journal that an explorer might leave behind for others; telling us what to look for, how to find it, and what it means. McNamara uses as his primary example the development and data analysis of the Occupational English Test (OET), a test of ESL for health professionals in Australia which assessed speaking and writing in work-related simulations. He takes a chapter to explain the procedures for determining the OET's test content (analysis of needs, resources, and the communicative demands of the profession); writing up the specifications, materials and scoring protocols; training and recruiting evaluators; piloting and revision. He includes here, and throughout the book, examples of the actual materials used in the decision making process and in the test itself, all of which can serve as models for the reader to consider.
The final four chapters constitute what I think McNamara meant when he talked of directing our research to find what variables in performance assessment can inform our construct of communicative competence and what cannot. He begins with raters and ratings, presenting evidence of wide variation in how raters apply criteria, even when traditional methods to limit this have been employed. Here the author introduces the advantages of using multi-faceted measurement which, as a Rasch-based method, can process raw scores in such a way to estimate factors such as ability, item (criterion) difficulty, and rater severity all on the same scale. This not only allows the analyst to map the variables side by side to see how they introduce bias or interrelate, but, as demonstrated in this chapter, offers the test user an improved, fairer (more accurate) measurement than raw scores. McNamara then illustrates this with detail and clarity his next chapter on the concepts and procedures of Rasch analysis.
Here begins the most 'cookbook-like' part of the book, resembling Grant Henning's treatment of the subject in his 1987 Guide To Language Testing, in that it not only explains how the Rasch model works, but how to interpret the results. What this reporter appreciated most was that here and in subsequent chapters, McNamara uses print-outs and terms which are specific to various Rasch-based computer software that he uses. This allows new users who might otherwise be intimidated by mathematics or psychometric jargon to be on more familiar ground when they try to analyze their own data. The following two chapters present case-studies and reports of related research to build upon the concepts and procedures of Rasch-based analysis introduced earlier.
The author demonstrates how Rasch-based analysis of test data can supply empirical evidence to reveal, at least in part, the impact of various factors, or 'facets' of performance assessment. This fleshes out the 'three-pronged' attack introduced in the beginning of the book with such tools as mapping out a test's purported abilities and skill levels against item difficulty to see how well the rating scale fits the model. Moving more deeply into the area of rating scales and rating criteria, he presents examples of research by himself and others into the impact of individual rater characteristics, descriptors and categories for rating scales, and rater interpretation of the criteria; all of which lead to a deeper, though admittedly unfinished understanding of what performance tests measure.

[ p. 18 ]

Perhaps the greatest value this book has for me personally is its last chapter, where I found clear explanations of the more esoteric areas of IRT & latent trait theory and application that had been bothering me for years. These include a breakdown into the various models (Simple Rasch, 2, and 3 parameter models, rating scales-based, partial-credit, and multi-faceted) and their uses. I found NcNamara's treatment of the issue of unidimensionality of data, as well as his explanation of the significance of 'specific objectivity' when choosing the number of parameters for modeling data, both very illuminating and reassuring. The explanations here and throughout the book have also helped me through several user's manuals for IRT-based software such as Quest and Bilog, which offer me a myriad of choices about how to handle data, but don't seem to explain nearly so well how and why I should make them or how I could interpret the results. This book's comprehensive, straightforward and highly readable treatment of L2 performance assessment in general and how useful IRT can be in unlocking its mysteries has made it one of my most treasured sources. I recommend that you get a copy and see for yourself.

- reviewed by Jeff Hubbell

NEWSLETTER: Topic IndexAuthor IndexTitle IndexDate Index
TEVAL SIG: Main Page Background Links Network Join
last Main Page next
HTML:   /   PDF:

[ p. 19 ]