Shiken: JALT Testing & Evaluation SIG Newsletter
Vol. 15 No. 2 Oct. 2011 (p. 10 - 19) [ISSN 1881-5537]
PDF PDF Version

The 2010 Revision of the TOEIC® Speaking Test
Bradley Irwin and Peter Nagy


This paper reviews the 2010 revision of the TOEIC® Speaking Test. The authors present a detailed description of the test content and discuss the objectives and purposes of that test. Using criteria established by Bachman and Palmer (1996) they investigate the usefulness of the TOEIC Speaking Test in terms of practicality, reliability, validity and authenticity. They find that while the TOEIC Speaking Test adequately addresses the issues of practicality and validity, both reliability and authenticity appear to be somewhat lacking. This article concludes with a discussion of Educational Testing Service's rating procedure and claims of a standard international English common across cultures.
Keywords: TOEIC, ETS, speaking test, practicality, reliability, validity, authenticity, world Englishes

    Originally administered only in Japan, this test is now available in 120 countries and used by over 10,000 organizations for hiring, placement and training decisions. All told, over 5 million TOEIC tests are administered each year. It wasn't until 2006 that the Speaking and Writing sections of this test were produced to allow for a more comprehensive understanding of test taker's English language ability. The TOEIC can now be thought of as two independently administered tests: the Listening and Reading (LR) test and the Speaking and Writing (SW) test.
    Although there had been several minor revisions since its inception, it wasn't until 2006 that the first significant revision of the TOEIC test was undertaken (Chapman & Newfields, 2006, p. 32). The 2006 revision was due in some part to mounting criticism that the TOEIC test provided an inadequate representation of 'real-life' English abilities and that some TOEIC test takers could obtain high listening and reading scores even if they lacked comprehensive communication skills (Powers, 2010, p. 4; Hirai, 2009). As Powers (2010) states:
It is the broader trait of communicative competence, not specific individual skills, that is critical in most academic and workplace settings and of most interest to users of tests like the TOEFL and TOEIC. It is important, however, to test for each of these four skills individually because each is a critical aspect of communicative competence. Furthermore, direct evidence of specific individual skills can provide at least indirect evidence of other skills. (p. 9)
    By expanding the test to include speaking and writing, Educational Testing Service (ETS) hoped allow test takers to better demonstrate their holistic language skills.

Test Purpose

ETS (2010, p. 4) claims that the TOEIC Speaking Test serves six purposes:
  1. To verify current level of English proficiency.
  2. To qualify for a new position and/or promotion in a company.
  3. To enhance professional credentials.
  4. To monitor progress in English.
  5. To set learning goals.
  6. To involve employers in advancing English ability.

[ p. 10 ]

More specifically, Powers et al. (2009, p. 1) state that the new speaking component has the following three objectives:
  1. To create connected, sustained discourse appropriate to the typical workplace.
  2. To carry out routine social and occupational interactions such as giving and receiving directions, asking for information, and asking for clarification.
  3. To produce some language that is intelligible to native and proficient non-native English speakers.
    The remainder of the paper outlines the TOEIC Speaking Test and considers issues of practicality, authenticity, reliability and validity in terms of the framework developed by Bachman and Palmer (1996, p. 17).

Speaking Test Content

    The speaking test is approximately twenty minutes in length and consists of six different task types based on eleven questions. A sample of the task types and questions based on them appears in the Appendix. ETS states that an Evidence-Centered Design (ECD) – a methodology based on the important role of evidentiary reasoning in assessment design that highlights best practices for creating and developing an assessment tool - was used as the basis for the development of each task. ECD is structured around three main assumptions:
  1. Domain of interest analysis – An assessment must build around the important knowledge in the domain of interest, and an understanding of how that knowledge is acquired and put to use;
  2. Principles of evidentiary reasoning – The chain of reasoning from what participants say and do in assessments to inferences about what they know, can do, or should do next, must be based on the principles of evidentiary reasoning;
  3. Purpose built design – Purpose must be the driving force behind design decisions, which reflect constraints, resources, and conditions of use (Mislevy, Almond & Lukas, 2003a, p. 28).
    The ECD model allows test designers to focus on the relational aspects of competencies and tasks during the design process (Hines, 2010, p. 3).
    Hines (2010) claims that the six speaking test tasks from the TOEIC Speaking Test can be categorized into two types for each of the test's three ECD claims. Tasks 1 and 2 satisfy Claim 1 that, "the test taker can generate language intelligible to native and proficient nonnative English speakers" (p. 17). Tasks 3 and 4 satisfy Claim 2 that, "the test taker can select appropriate language to carry out routine social and occupational interaction (such as giving and receiving directions; asking for and giving information; asking for and providing clarification; making purchases; greetings and introductions; etc.)" (p. 17). To fully satisfy Claim 2 test takers must employ a combination of listening/reading or reading/listening/speaking tasks. Tasks 5 and 6 appear to fulfill Claim 3 that, "the test taker can create connected, sustained discourse appropriate to the typical workplace" (p. 17). The hierarchical nature of each claim assumes that test takers who perform well on Claim 3 will also perform well on Claims 1 and 2. However, the inverse may not be true. For a description of the test contents, time allotment, evaluation criteria, etc., please refer to Table 1.

[ p. 11 ]

Table 1
Rubrics for the 2010 Version of the TOEIC Speaking Test
Claim Claim 1: Successful test takers can generate language intelligible to native and proficient nonnative English speakers Claim 2: Successful test takers can select appropriate language to carry out routine social and occupational interaction Claim 3: Successful test takers can create connected, sustained discourse appropriate to the typical workplace
Task Type 1: Read Aloud 2: Describe a Picture 3: Respond to Questions (Survey type) 4: Questions about Schedule or Agenda 5: Problem Solving 6: Expressing Opinions
Question #
and Content
Questions 1 - 2
Two Texts
Question 3
One Photo
Questions 4 - 6
Two Short Questions
One Long Question
Questions 7 - 9
Two Basic Info. Questions
One Summary Question
Question 10
Respond to a dilemma
Question 11
State an Opinion about a Paired Choice
Time Allotment Prep. Time:
45 seconds
Prep. Time:
30 seconds
Prep. Time:
Prep. Time:
Prep. Time:
30 seconds
Prep. Time:
15 seconds
Response Time:
45 Seconds
Response Time:
45 Seconds
Response Time:
15/15/30 seconds
Response Time:
15/15/30 seconds
Response Time:
60 seconds
Response Time:
60 seconds
Evaluation Criteria Pronunciation
Grammar, Vocabulary
Cohesion, Criteria from Tasks 1 and 2
Relevance of content
Completeness of content
Criteria from Tasks 1 and 2
Criteria from Tasks 1-3 Criteria from Tasks 1-3 Criteria from Tasks 1-3
Evaluation Scale 0 - 3 0 - 3 0 - 3 0 - 3 0 - 5 0 - 5
Integrated (listen & speak) Integrated (read, listen, speak) Integrated (long listen & speak) Independent

Source: compiled by the authors from material provided on the TOEIC ETS homepage.

[ p. 12 ]

    Now let us examine the TOEIC Speaking Test according to the previously mentioned Bachman and Palmer framework.

    The criterion of practicality has been defined as "the relationship between the resources that will be required in design, development, and use of the test, and the resources that will be available for these activities" (Bachman & Palmer, 1996, p. 36). In terms of its practicality, the protocol for marking the speaking component of the TOEIC SW is both efficient and simple. The recorded speaking test responses are sent to an online scoring network, rated either on a scale of 0 to 3 or 0 to 5 by certified raters, and then converted into a scale of 0 to 200. Allocated response times, evaluation criteria used, and task type and number of questions are all strictly regulated parameters of the evaluation process described in Table 1. In our view, ETS has adequately thought through the logistics of practically administering the Speaking Test.

    Bachman and Palmer (1996, p. 23) define authenticity as the degree to which the characteristics of a language test task correspond to the features of language in a real world target language use task. While ETS attempts to widen TOEIC's appeal by claiming that the speaking and writing tests "do not require test takers to have specialized knowledge of business." (ETS, 2008, p.3) a cursory glance through their online publications reveals that the test is marketed and oriented towards a "generic" business audience in which idiomatic and casual expressions have been sanitized out.
    While the language tasks may often correspond to actual language use there are nonetheless, two authenticity concerns that are worth addressing. First, the use of a computer to deliver pre-recorded prompts, omitting opportunities for spontaneity and the co-construction of discourse, neglect important aspects of communicative competence (Stoynoff, 2009), and diminish the authenticity claims made by ETS. For example, simple tasks that are performed everyday in speech interactions, such as asking for repetition or clarification, cannot be replicated when using pre-recorded computer prompts as a testing platform. Moreover, providing thirty to forty-five seconds for test-takers to think about, prepare and formulate a response to a question does not follow the normal conventions expected in authentic communication situations. In addition, limiting the amount of time that a test-taker has to answer a question can produce unnecessary anxiety and deprive him or her of the opportunity to fully develop complex responses. Another point worth noting is that ETS strictly forbids the use of note taking during the TOEIC SW test to limit copyright infringement and cheating. Unfortunately, not being able to use pencil and paper to organize one's thoughts can increase the cognitive load of a task as test takers are forced to rely on memory recall and mental maps of a discourse.
    The rating method of the speaking component may also be a constraint on authenticity, as only raters from the US are selected to assess its results (Balogun, 2008; Everson & Hines, 2010, p. 6). Selecting TOEIC raters from other English-speaking backgrounds more representative of the world's population of English speakers would be one way of responding to the authenticity concerns directed at ETS tests that have been raised by Canagarajah (2006). We will return to the issue of world Englishes at the end of this paper.

    Reliability has been defined by Bachman as the answer to the question: "How much of an individual's test performance is due to measurement error, or to factors other than the language ability we want to measure?" (Bachman, 1990, p. 160). In his 2009 survey of TOEIC, Stoynoff bemoans the lack of reliability evidence for the speaking component of the TOEIC test (Stoynoff, 2009, p. 32). Since then, a useful report has been published by ETS that seeks to address that criticism (Liao & Qu, 2010). Of particular interest are the test-retest correlation coefficients

[ p. 13 ]

reported by ETS for multiple test-takers of the speaking test (Liao & Qu, 2010, p. 6). The raw score coefficients are listed as .80 (agreement between first and second sittings of the test), .80 (agreement between second and third sittings), and .79 (agreement between third and fourth sittings). The median time interval between each test-retest were 63 days between the first and second sittings, 45 days between the second and third sittings, and 28 days between the third and fourth sittings. However, the actual score level correlation coefficients were somewhat lower at .76, .77, and .74. For information on the sample size and length of time between sittings please refer to Table 2.

Table 2
Test - Retest Reliability for the TOEIC Speaking Test (Liao & Qu, 2010, p. 3-4)
Sitting Sample Size Median Time Interval Between Tests Raw Score Correlation Coefficient Score Level Correlation Coefficient
1 - 2 16,867 63 days 0.80 0.76
2 - 3 5,129 45 days 0.80 0.77
3 - 4 1,923 28 days 0.79 0.74

    Hughes (2003) recommends a reliability coefficient of .79 for oral production, as a rule of thumb. Therefore, at first glance the coefficients reported by Liao and Qu seem to satisfy the reliability requirements suggested by Hughes. However, it should be noted that the score level coefficients, the actual levels awarded to test-takers are lower than the benchmark recommended by Hughes. Hughes also notes that high stakes tests should be held to an even higher standard of reliability than his suggested minimum range (Hughes, 2003, p. 39). While the number of TOEIC SW test takers remains modest compared to the number taking the TOEIC LR test, it remains a high stakes test because of its impact on hire-ability and promotion-worthiness. Viewed in this light it can be argued that the reliability coefficients are somewhat wanting.
    According to Bachman and Palmer, the term construct validity refers to the extent to which a test score can be interpreted as an indicator of the skills or constructs that the test developers want to measure (Bachman and Palmer, 1996, p. 21). The most common criticism of construct validity for the TOEIC test has been that it fails to take into account the communicative competency of the test taker (Cunningham, 2002). Previously, ETS tried to defend its claim that TOEIC assessed comprehensive English communication skills by using one measure (reading or listening) as a proxy for overall proficiency. However, research by Hirai (2002) undercut these claims. Hirai found an unconvincing correlation between TOEIC test-takers' high scores and their ability to communicate in English and concluded that, "From the above analysis, the author maintains that TOEIC scores cannot be employed as a reliable measure of writing skills in business contexts" (p. 7), and suggests that "TOEIC scores be interpreted cautiously by businesses." (p. 7). The introduction of the speaking and writing components of the TOEIC test has gone some way in answering this criticism.
    One particular concern is using the reading section (questions 1 and 2) to evaluate a test-takers' pronunciation and intonation. While reading aloud provides test-raters with an easy and convenient indicator of what might appear to be 'good' and 'bad' pronunciation or intonation, research supporting the connection between reading and pronunciation skills is still lacking. Learners who study English conversationally or naturalistically may not have highly developed reading skills which would negatively effect their pronunciation and intonation scores regardless of communicative ability. Therefore, some test-takers may struggle with task 1 while excelling at the subsequent tasks. Results like these may produce rater bias because they go against the basic assumption of the hierarchical nature of each task.

[ p. 14 ]

    A further concern is using examinee self-reported can-do lists as indicators of test validity because of the disparity between one's perceived and actual abilities. Even the ETS-produced report on the TOEIC Speaking Test's validity could not establish the soundness of using test-takers' self-reports as valid measures (Powers et al., 2010, p. 14). For example, 2% of examinees who scored the lowest level (between 0-50 out of 200) felt that they could serve as interpreters for managers in business negotiations while only 47% of examinees who scored the highest level (between 190-200 out of 200) felt they could perform the same role. This particular task, the most difficult listed on the can-do report, produced a very low correlation (.32) with actual test scores. Overall, the correlation between speaking can-do reports and actual TOEIC Speaking Test scores was somewhat weak at .54 (Powers et al., 2010, p. 5).

Discussion and Conclusion

    While ETS has begun to address some of the previous criticisms leveled against the TOEIC test there remains room for improvement. The inclusion of the TOEIC SW test means that test users no longer need to rely on reading and listening tests as proxies for all four skills. Even though ETS now promotes the notion of a comprehensive four-skills assessment of foreign language proficiency several issues are worthy of further consideration.
    One of the first issues that should be addressed is that of test reliability. Although the reliability coefficients for raw scores are within an acceptable range, the lower coefficients for actual test scores do not meet the standard we expect to find for high-stakes tests such as the TOEIC SW. A second yet equally important issue is that of validity. Relying on can-do self-evaluations as a measure of validity is untenable. Future TOEIC SW Test research should explore the extent to which this test actually assesses the skills and constructs that ETS claims it measures. As well, a criticism that has been leveled at previous TOEIC tests is that it uses only a very limited range of world Englishes. Previous claims by ETS that the TOEIC represents a test of international English are missing from the published speaking and writing test material. Instead, the tasks that TOEIC SW candidates encounter are those that "people might perform in work-related situations or in familiar daily activities that are common across cultures." (ETS, 2008, p. 3) While the "work-related situations" and "familiar daily activities" may be common across some cultures there is no indication that the language presented in the test is common across all or even most cultures and thus truly "international". (Kawahara, in press). The designation "Test of English for International Communication", therefore, might still be considered a misnomer by those who support the notion of a variety of world Englishes (Lowenberg, 2002; Kawahara & Yamamoto, 2010).
    Canagarajah (2006, p. 235) laments the fact that organizations such as ETS make claims that their test results represent comprehensive English ability when in fact only inner-circle English contexts (American, British, Australian, etc.) are assessed. Furthermore, proficiency level descriptors provided by ETS should be accepted cautiously by educational institutions, especially those in outer-circle English contexts (India, Pakistan, Singapore, etc.), as they may not fully illustrate one's overall language proficiency. (Canagarajah, 2006, p. 235).
    A final concern, related to the notion of a standard English, is the decision to only use test-raters based in the United States. If ETS were serious about its commitment to authentic language testing there should be no compromise to save money by centralizing the raters into one geographic location. As English is clearly no longer the property of a single nation or culture and the notion of a standard English is being replaced by the idea of diverse world Englishes, raters from various backgrounds should be hired.

[ p. 15 ]


Bachman, L. (1990). Fundamental considerations in language testing. Oxford University Press.

Bachman L., & Palmer, A. (1996). Language testing in practice. Oxford University Press.

Balogun, E. (2008, Sept. 16). Re: English and applied linguistics: Test rater, ETS, telecommute [Online job advertisement]. Retrieved from

Brown, H.D. (2001). Teaching by principles (2nd ed.). New York: Longman.

Canagarajah, S. (2006). Changing communicative needs, revised assessment objectives: Testing English as an international language. Language Assessment Quarterly, 3 (3), 229 - 242. Retrieved from

Chapman, M., & Newfields, T. (2008). The new TOEIC. Shiken: JALT Testing & Evaluation SIG Newsletter, 12
(3), 32 - 37. Retrieved from

Cunningham, C. (2002). The TOEIC test and communicative competence: Do test score gains correlate with increased competence? (Master's dissertation, University of Birmingham). Retrieved from

Educational Testing Service. (2007). TOEIC user guide - listening & reading. Retrieved from

Educational Testing Service. (2008). Speaking and writing sample tests. Retrieved from

Educational Testing Service. (2008b). 2008 TOEIC test taker data. Retrieved from

Educational Testing Service. (2010). User guide - speaking and writing. Retrieved from

Everson, P., & Hines, S. (2010). How ETS scores the TOEIC® Speaking and Writing Test responses. (Research Report No. TC-10-08) Retrieved from

Hines, S. (2010). Evidence-Centered Design: The TOEIC Speaking and writing Test. (Research Report No. TC-10-07) Retrieved from

Hirai, M. (2002). Correlations between active skill and passive skill test scores. Shiken: JALT Testing & Evaluation SIG Newsletter, 6 (3), 2-8. Retrieved from

Hirai, M. (2009). Correlation between STEP BULATS Speaking and TOEIC® Scores. In E. Skier & T. Newfields (Eds.) Proceedings of the 8th Annual JALT Pan-SIG Conference. Retrieved from

Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge University Press.

Lawson, A. J. (2008). Testing the TOEIC: Practicality, reliability and validity in the Test of English for International Communication (Unpublished paper). Retrieved from

Liao, C., & Qu, Y. (2010). Alternate forms test-retest reliability and test score changes for the TOEICR speaking and writing tests. (Research Report No. TC-10-14). Retrieved from

Liao, C., & Wei, Y. (2010). Statistical Analyses for the TOEIC speaking and writing pilot study. Research Report No. TC-10-09). Retrieved from

Lowenberg, P. H. (2002). Assessing English proficiency in the Expanding Circle. World Englishes, 21 (3) 431-435. doi: 10.1111/1467-971X.00261.

[ p. 16 ]

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). Design and analysis in task-based language assessment. Language Testing, 19, (4) 477-496. doi: 10.1191/0265532202lt241oa

Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2003a). A brief introduction to evidence-centered design. Retrieved from

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003b). On the structure of educational assessment. Measurement: Interdisciplinary Research and Perspectives, 1, 3-62.

Powers, D. et al. (2009). The TOEIC speaking and writing tests: Relations to test-taker perceptions of proficiency in English. (Research Report No. TC-09-18). Retrieved from

Powers, D. et al. (2010). The TOEIC speaking and writing tests: relations to test-taker perceptions of proficiency in English. (Research Report No. TC-10-11). Retrieved from

Powers, D. (2010). The case for a comprehensive, four skills assessment of English language proficiency. (Research Report No. TC-10-12). Retrieved from

Stoynoff, S. (2009). Recent developments in language assessment and the case of four large-scale tests of ESOL ability. Language Teaching 42(1), 1-40.

Sample Items from a TOEIC Speaking Test

Based on the test design criteria established by ETS and online samples, the authors have created some sample items in order to convey to the reader the general 'look and feel' of the TOEIC Speaking Test. Please note that these items are not official and as such their difficulty level might differ somewhat from an official TOEIC Speaking Test.

Questions 1 - 2: Read aloud

This section of the speaking test evaluates examinees' pronunciation and intonation abilities. Examinees will read an onscreen text aloud. They have 45 seconds to prepare their answer and 45 seconds to complete the reading.

When you go on vacation to a foreign country there are many things you must consider. Many people travel to foreign countries to relax, but some vacations can be stressful. Properly preparing yourself for an upcoming vacation may help you avoid stressful situations. The kind of trip you take can also help reduce stress. Some people enjoy going to beach resorts to relieve tension. These types of places usually have a variety of activities to choose from. Diving, snorkeling, swimming and sunbathing are all common activities at beach resorts. Don't forget to bring your camera with you when you take a trip.

Question 3: Picture description

This section of the test evaluates examinees' grammar, vocabulary, and cohesion skills (in addition to the skills represented in questions 1 and 2). Examinees will describe a picture in as much detail as possible. They have 30 seconds to prepare their answer and 45 seconds to describe the picture.

A Possible TOEIC Speaking Test Photo

[ p. 17 ]

Questions 4 - 6: Question response

This section of the test evaluates examinees' ability to form relevant and complete ideas related to the test content (in addition to the skills found at questions 1 - 3). Examinees will answer three questions based on a scenario (2 short and 1 long). There will be no time to prepare answers. Examinees have 15 seconds to answer questions 4 and 5 and 30 seconds for question 6.

(Narrator): A Hollywood advertising agency is collecting information about movie viewing habits.
You are answering their questions about watching movies over the telephone.

Question 4: How often do you watch movies?

Question 5: What kind of movie do you usually watch?

Question 6: Describe your favorite kind of movie?

Questions 7 - 9: Question response using the information provided

This section of the test evaluates examinees' ability to form relevant and complete responses based on the information provided (in addition to the skills found at questions 1 - 6). Examinees will answer three questions based on information they are given (2 short and 1 long summary type). Examinees have 15 seconds to answer questions 7 and 8 and 30 seconds to answer question 9.

(Narrator): Hello, I'm calling to ask about an advertisement for a business seminar on March 1 that I saw in the newspaper.
It's about maintaining good relations with your customers.
I would like to get some information about the seminar.

A Possible TOEIC Speaking Test Photo

[ p. 18 ]

Question 7: What is the cost of attendance?

Question 8: Can you please tell me when the seminars start and finish?

Question 9: I may be somewhat late in the morning. Can you please tell me about the morning seminars?
Question 10: Solution proposal

This section of the test evaluates examinees' ability to respond to a dilemma or problem (in addition to the skills found at questions 1 - 6). Examinees should create a connected and sustained discourse appropriate to the situation presented. Examinees will have 30 seconds to prepare their answer and 60 seconds to respond.

Respond as if you work at a credit card company.

When you respond, please:
  1. demonstrate that you understand the problem, and
  2. offer a solution to the problem.
Please listen to the recorded message.

Examinees will hear: Hello, this is Dan Connor. I'm calling about my new credit card. Ah, actually, the problem is that I haven't received it yet. My old card expires this week and I would like to get my new card . . . I actually sent a letter to your office last month but I didn't receive a reply. Anyway, this card is very important for me because I use it for my business d . . . In fact, I use a credit card from your company almost every day. I've been using your company for several years now and have always been very happy with the service d . . .I hope that I can continue using your services in the future. Could you please call me back at work? I'll be out of my office this morning because of a business meeting but I would like to know how to get my new card as quickly as possible. This is Dan Connor. My office number is 555-4321. Thank you.

Question 11: Expressing opinion

This section of the test evaluates examinees' ability to express an opinion about a paired choice (in addition to the skills found at questions 1 - 6). Examinees should say as much as they can about the topic presented. They have 15 seconds to prepare their answer and 60 seconds to respond.

Question: (Narrator): Some people argue that environmental issues such as global warming and climate change are caused by human action. What is your opinion about environmental issues like climate change and global warming? Please provide reasons for your opinion.

NEWSLETTER: Topic IndexAuthor IndexTitle IndexDate Index
TEVAL SIG: Main Page Background Links Join
last Main Page next
HTML:   /   PDF:

[ p. 19 ]