Holistic grading

Holistic grading or holistic scoring, in standards-based education, is an approach to scoring essays using a simple grading structure that bases a grade on a paper's overall quality.^[1] This type of grading, which is also described as nonreductionist grading,^[2] contrasts with analytic grading,^[3] which takes more factors into account when assigning a grade. Holistic grading can also be used to assess classroom-based work. Rather than counting errors, a paper is judged holistically and often compared to an anchor paper to evaluate if it meets a writing standard.^[4] It differs from other methods of scoring written discourse in two basic ways. It treats the composition as a whole, not assigning separate values to different parts of the writing. And it uses two or more raters, with the final score derived from their independent scores. Holistic scoring has gone by other names: "non-analytic," "overall quality," "general merit," "general impression," "rapid impression." Although the value and validation of the system are a matter of debate, holistic scoring of writing is still in wide application.

Definition

In holistic scoring, two or more raters independently assign a single score to a writing sample. Depending on the evaluative situation, the score will vary (e.g., "78," "passing." "deserves credit," "worthy of A-level," "very well qualified"), but each rating must be unitary. If raters are asked to consider or score separate aspects of the writing (e.g., organization, style, reasoning, support), their final holistic score is not mathematically derived from that initial consideration or those scores. Raters are first calibrated as a group so that two or more of them can independently assign the final score to writing sample within a pre-determined degree of reliability. The final score lies along a pre-set scale of values, and scorers try to apply the scale consistently. The final score for the piece of writing is derived from two or more independent ratings. Holistic scoring is often contrasted with analytic scoring.^[5]^[6]^[7]

Need

The composing of extended pieces of prose has been required of workers in many salaried walks of life, from science, business, and industry to law, religion, and politics.^[8] Competence in writing extended prose has also formed part of qualifying or certification tests for teachers, public servants, and military officers.^[9]^[10] Consequently, the teaching of writing is part of formal education in school and, in the US, in college. How can that competence in composing extended prose be best evaluated? Isolated parts of it can be tested with "objective", short-answer items: correct spelling and punctuation, for instance. Such items are scored with high degrees of reliability. But how well do item questions evaluate potential or accomplishment in writing coherent and meaningful extended passages? Testing candidates by having them write pieces of extended discourse seems a more valid evaluation method. That method, however, raises the issue of reliability. How reliably can the worth of a piece of writing be judged among readers and across assessment episodes? Teachers and other judges trust their knowledge of the subject and their understanding of good and bad writing, yet this trust in "connoisseurship"^[11] has long been questioned. Equally knowledgeable connoisseurs have been shown to give widely different marks to the same essays.^[12]^[13]^[14]^[15] Holistic scoring, with its attention to both reliability and validity, offers itself as a better method of judging writing competence. With attention to fairness, it can also focus on consequences of score use.^[16]

Model

While analytic grading involves criterion-by-criterion judgments, holistic grading appraises student works as integrated entities. In holistic grading, the learner's performance is approached as one and cannot be reduced or divided into several component performances.^[17] Here, teachers are required to consider specific aspects of the student's answer as well as the quality of the whole.^[18]

Holistic grading operates by distinguishing satisfactory performance from one that is simply adequate or outstanding.^[2]

Four kinds of scoring

Although a wide variety of procedures for holistic scoring have been tried, four forms have established distinct traditions.^[19]

Pooled-rater

Pooled-rater scoring typically uses three to five independent readers for each sample of writing. Although the scorers work from a common scale of rates, and may have a set of sample papers illustrating that scale ("anchor papers"^[20]), usually they have had a minimum of training together. Their scores are simply summed or averaged for the sample's final score. In Britain, pooled-rater holistic scoring was first experimentally tested in 1934, employing ten teacher-raters per sample.^[21] It was first put into practice with 11+ examination scripts in Devon in 1939 using four teachers per essay.^[22] In the United States its rater reliability was validated from 1961 to 1966 by the Educational Testing Service;^[23] and it was used, sporadically, in the Educational Testing Service's English Composition Test from 1963 to 1992, employing from three to five raters per essay.^[24] A nearly synonymous term for "pooled-rater score" is "distributive evaluation"^[25]

Trait-informed

Trait-informed scoring trains raters to score to a scoring guide (also called a "rubric"^[26] or "checklist"^[27])—a short set of writing criteria each scaled in grid format to the same number of accomplishment levels. For instance, the scoring guide used in a 1969 City University of New York study of student writing had five criteria (ideas, organization, sentence structure, wording, and punctuation/mechanics/spelling) and three levels (superior, average, unacceptable).^[28] The rationale for scoring guides argues that it forces scorers to attend to a spread of writing accomplishments and not give undue influence to one or two (the "halo effect"). Trait-informed scoring comes close to analytic scoring methods that have raters score each trait independently of the other traits and then add up the scores for a final mark, as in the Diederich scale.^[29] Trait-informed holistic scoring, however, remains holistic at heart and asks raters only to take into some account all the traits before deciding on a single final score.

Adjusted-rater

Adjusted-rater scoring assumes that some scorers are more accurate in their scores than other raters. Each paper is read independently by two raters and if their scores disagree to a certain extent, usually by more than one point on the rating scale, then the paper is read by a third, more experienced reader. Scorers who cause too many third readings are sometimes re-trained during the scoring session, sometimes dropped out of the reading corps.^[30]^[31] Adjusted-rater holistic scoring may have first been applied by the Board of Examiners for The College of the University of Chicago in 1943.^[32] Today large-scale commercial testing services sometimes use adjusted-rater scoring where one rater for an essay is a trained human and the other a computer programmed for automatic essay scoring, for instance GRE testing.^[33]^[34]

Single-rater

Single-rater monitored scoring trains raters as a group and may provide them with a detailed marking scheme. Each writing sample is scored, however, by only one rater unless, through periodic checking by a monitor, its score is deemed outside the range of acceptability and then it is re-rated, usually by the supervisor. This method, called "single marking" or "sampling" has long been standard in Great Britain school examinations, even though it has been shown to be less valid than double marking or multiple marking.^[35]^[36] In the United States, for the Writing Section of the TOEFLiBT,^[37] the Educational Testing Service now uses the combination of automated scoring and a certified human rater.

History

In Great Britain, formal pooled-rater holistic scoring was proposed as early as 1924^[38] and formally tested in 1934–1935.^[39] It was first applied in 1939 by Chief Examiner R. K. Robertson to 11+ scripts in the Local Examination Authority of Devon, England, and continued there for ten years.^[40] Although other LEAs in Great Britain tried the system during the 1950s and 1960s and its reliability and validity was much studied by British researchers, it failed to take hold. Multiple marking of school scripts, usually written to show competence in subject areas, largely gave way to single-rater monitored scoring with analytical marking schemes.^[41]^[42]

In the US, the first applied holistic scoring of writing samples was administered by Paul B. Diederich at The College of the University of Chicago as a comprehensive examination for credit in the first-year writing course. The method was adjusted-rater scoring with teachers of the course as scorers and members of the Board of Examiners as adjusters.^[43]^[44] Around 1956 the Advanced Placement examination of the College Board began an adjusted-rater holistic system to score essays for advance English credit. Raters were high-school teachers, who brought the rating system back to their schools.^[45] One teacher was Albert Lavin, who installed similar holistic scoring at Sir Francis Drake High School in Marin County, California, 1966–1972, at grades 9, 10, 11, and 12 in order to show progress in school writing over those years.^[46] In 1973 teachers in the California State University and Colleges system used the Advanced Placement adjusted-rater system to score essays written by matriculating students for advance English composition credit.^[47] Pooled-rater holistic scoring was tested as early as 1950 by the Educational Testing Service (using the term "wholistic").^[48] It was first applied in the College Board's 1963 English Composition Test.^[49] In higher education, the Georgia Regents' Testing Program, a rising-junior test for language skills, used it as early as 1972.^[50]

In the USA an exponential spread in holistic scoring took place from around 1975 to 1990, fueled in part by the educational accountability movement. In 1980 assessment of school writing was being conducted in at least 24 states, the large majority by writing samples rated holistically.^[51] In post-secondary education, more and more colleges and universities were using holistic scoring for advance credit, placement into first-year writing courses, exit from writing courses, and qualification for junior status and for undergraduate degree. Writing teachers were also instructing their students in holistic scoring so they could judge one another's writing—a pedagogy taught in National Writing Projects.^[52]

Beginning in the last two decades of the 20th century use of holistic scoring somewhat declined. Other means of rating a student's writing competence, perhaps more valid, were becoming popular, such as portfolios. College were turning more and more to testing agencies, such as ACT and ETS, to do scoring of writing samples for them, and by the first decade of the 21st century those agencies were doing some of that by automatic essay scoring. But holistic scoring of essays by humans is still applied in large-scale commercial tests such as the GED, TOEFL iBT, and GRE General Test. It is also used for placement or academic progression in some institutions of higher education, for instance at Washington State University.^[53] For admission and placement into writing courses, however, most colleges now rely on the analytical scoring of writing skills in tests such as ACT, SAT, CLEP, and International Baccalaureate.

Validation

Holistic scoring is often validated by its outcomes. Consistency among rater scores, or "rater reliability," has been computed by at least eight different formulas, among them percentage of agreement, Pearson's r correlation coefficient, the Spearman-Brown formula, Cronbach's alpha, and quadratic weighted kappa.^[54]^[55] Cost of scoring can be calculated by measuring average time raters spend on scoring a writing sample, the percent of samples requiring a third reading, or the expenditure on stipends for raters, salary of session leaders, refreshments for raters, machine copying, room rental, etc. Occasionally, especially with high-impact uses such as in standardized testing for college admission, efforts are made to estimate the concurrent validity of the scores. For instance in an early study of the General Education Development test (GED), the American Council on Education compared an experimental holistic essay score with the existing multiple-choice score and found that the two scores measured somewhat different sets of skills.^[56] More often, predictive validity is measured by comparing a school student's holistic score with later achievement in college courses, usually first-semester GPA, end-of-course grade in a first-year writing course, or teacher opinion of the student's writing ability. These correlations are usually low to moderate.^[57]

Criticism

Holistic scoring of writing attracted adverse criticism almost from the beginning. In the 1970s and 1980s and beyond, the criticism grew.^[58]^[59]^[60]^[61]

Cost. In the 1980s, when examinations were often scored entirely by humans, valid and reliable holistic scoring of a writing sample took more time and therefore more money than scoring of items. For instance, it cost $0.75 per essay for the first and $0.53 for the second in the 1980-1981 Georgia Regents' Testing Program.^[62] Later, in terms of expense, holistic scoring of papers by humans could compete even less against machine-scored item tests or machine-rated essays, which cost from around half to a quarter of the cost of human scoring.^[63]
Diagnosis. The most common complaint about holistic scoring is the paucity of diagnostic information it provides. Scores of "passing"—or of "3" on a 4-point, 6-point, or 9-point scale—provide little concrete guidance for the student, the teacher, or the researcher. In educational barrier exams, holistic scoring may serve administrators in locating which students did not pass but little serve teachers in helping those students pass on a second try.^[64]^[65] The need to amplify diagnostic information was the reason why, in the second round of the National Assessment of Educational Progress (1973-1974), the Education Commission of the States supplemented holistic scoring with primary-trait scoring of writing samples.^[66] The same reason prompted the International English Language Testing System, run by the British Council and the Cambridge English Language Assessment for second-language speakers and writers, to adopt "profile scoring" in 1985.^[67]
Rubrics. As a pre-set checklist of a few writing traits each scaled equally on a few levels of accomplishment, the rubric has been criticized because it is simplistic, blind to cultural and developmental differences, and falsely premised. When a group of college composition teachers were asked for their "criteria for evaluation" of writing, they mentioned not 5 or 6 criteria but 124.^[68] While the rubric assumes that criteria are independent of one another, studies have shown that the scores readers give to one or two criteria influence the scores they give to the other criteria (the halo effect).^[69] Pre-set and equally valued criteria also do not fit the development of young adult writers, development which may be uneven, non-universal, and regressive.^[70]^[71]^[72] Most fundamentally, standardized rubrics propose a pre-determined language outcome, whereas language is never determined, never free of context. Rubrics use "deterministic formulas to predict outcomes for complex systems"^[73]—a critique that has been leveled at rubrics used for summative scores in large-scale testing as well as for formative feedback in the classroom.
De-contextualization. Traditional holistic scoring may erase vital context of the composing, for instance the influence on different writers responding in a timed, impromptu draft to different topics and different genres of writing.^[74] From the point of view of contrastive rhetoric, vital cultural differences of the writers may also be erased. For instance when researchers for the International Association for the Evaluation of Educational Achievement tried to create measures for rating essays composed by students from Finland, Korea, and the US, they found that "holistic scoring would be doomed at the outset because of the differences in communities".^[75] Holistic scoring—particularly trait-informed scoring with rater training strongly controlled to achieve high rater reliability—also may disregard the ecology of the scorers. The scoring system creates a set of readers artificially forced out of their natural reading response by an imposed consensus.^[76]^[77] Such concerns encouraged institutions such as Ohio University, the University of Louisville, and Washington State University to assess the writing competency of students with portfolios of their essays written from past classes.^[78]
Fairness. Although holistic scoring of writing has been defended as more fair for minorities and second-language writers than objective testing,^[79]^[80] evidence has also been gathered to show that holistic scoring has its own problems with fairness. Coaching was less affordable for low-income candidates.^[81] African American students had more problems with the essay portion of Florida's CLAST test.^[82] The essay prompts for the CUNY Writing Assessment Test were not "content-fair and culture-free" and posed more problems for Hispanic and other second-language writers.^[83] The Educational Testing Service has shown a long-standing concern about test fairness,^[84]^[85] although currently research into unfair outcomes of holistic scoring probably lags behind the intuitions of practitioners and probably needs to apply more discriminant statistical analysis to document those outcomes.^[86]

Projects using holistic grading

Many institutions use holistic grading when evaluating student writing as part of a graduation requirement.^[3] Some examples include:

The National Certificate of Educational Achievement is the New Zealand graduation certificate, which bases its score on holistic grading.^[87]
In the United States, the Graduate Record Examination (GRE) uses holistic grading.^[88]

References

^ Nordquist, Richard (March 7, 2017). "What Is Holistic Grading?". ThoughtCo. Retrieved 2018-12-11.
^ ^a ^b Bishop, Alan; Clements, M. A. (Ken); Keitel-Kreidt, Christine; Kilpatrick, Jeremy; Laborde, Colette (2012). International Handbook of Mathematics Education. Dordrecht: Springer Science & Business Media. pp. 354–355. ISBN 978-94-010-7155-0.
^ ^a ^b "Know Your Terms: Holistic, Analytic, and Single-Point Rubrics". Cult of Pedagogy. 2014-05-01. Retrieved 2018-12-11.
^ "Holistic Scoring in More Detail". writing.colostate.edu. Retrieved 2018-12-11.
^ Cooper, C. R. (1977). "Holistic Evaluation of Writing". In C. R. Cooper and L. Odell (Eds.), Evaluating Writing: Describing, Measuring, Judging, 3-31. Urbana, IL: National Council of Teachers of English.
^ Myers, M. (1980). A Procedure for Writing Assessment and Holistic Scoring. Urbana, IL: National Council of Teachers of English.
^ White, E. M. (1986). Teaching and Assessing Writing: Recent Advances in Understanding, Evaluating, and Improving Student Performance. San Francisco: Jossey-Bass Publishers.
^ Oliveri, M. E., & McCulla, L. (2019). Using the Occupational Network Database to Assess and Improve English Language Communication for the Workplace (Research Report No. RR-19-2). Princeton, NJ: Educational Testing Service.
^ Ballard, P. B. (1923). The New Examiner. London: University of London Press.
^ Elliot, N. (2005). On a Scale: A Social History of Writing Assessment in America. New York: Peter Lang.
^ Weir, C. J., Vidakovic, I., and Galaczi, E. D. (2013). Measured Constructs: A History of Cambridge English Language Examinations 1913-2012. Cambridge, England: Cambridge University Press
^ Edgeworth, F. Y. (1988). "The Statistics of Examinations". Journal of the Royal Statistical Society 51: 599-635.
^ Edgeworth, F. Y. (1890). "The Element of Chance in Competitive Examinations". Journal of the Royal Statistical Society 53: 460-475, 644-663.
^ Starch, D., and Elliott, E. C. "The Reliability of Grading High School Work". School Review 20: 254-259.
^ Thomas, C. W., et al. (1931). Examining the Examination in English: A Report to the College Entrance Examination Board. Cambridge, MA: Harvard University Press.
^ Slomp, D., Corrigan, J., and Sugimoto, T. (2014). "A Framework for Using Consequential Validity Evidence in Evaluating Large-Scale Writing Assessments." Research in the Teaching of English, 48(3): 276-302.
^ Joughin, Gordon (2008). Assessment, Learning and Judgement in Higher Education. Cham: Springer Science & Business Media. p. 48. ISBN 978-1-4020-8904-6.
^ Franck, Olof (2017). Assessment in Ethics Education: A Case of National Tests in Religious Education. Cham, Switzerland: Springer. p. 72. ISBN 978-3-319-50768-2.
^ The following names are taken from Haswell, R., & Elliot, N. (2019). Early Holistic Scoring of Writing: A Theory, a History, a Reflection. Logan, UT: Utah State University Press, pp. 24-25.
^ Coffman, W. (1971). "On the Reliability of Ratings of Essay Examinations in English". Researching in the Teaching of English 5 (1): 24-36.
^ Hartog, P. J., Rhodes, E. C., and Burt, C. L. (1936). The Marks of examiners: Being a Comparison of Marks Allotted to Examination Scripts by Independent Examiners and Boards of Examiners, Together with a Section on a Viva Voce Examination. London: Macmillan.
^ Wiseman, S. (1949). "The Marking of English Composition in Grammar School Selection". British Journal of Educational Psychology 19 (3): 200-209.
^ Godshalk, F. I., Swineford, F., and Coffman, W. E. (1966). The Measurement of Writing Ability. New York: College Entrance Examination Board.
^ Elliot (2005, pp. 158-165.
^ Whithaus, C. (2010). "Distributive evaluation", WPA-CompPile Research Bibliography, No. 3. WPA-CompPile Bibliographies
^ Eley, E. G. (1956). "Testing the Language Arts." Modern Language Journal 40 (6): 310-315.
^ Scriven, M. (1974). "Checklist for the Evaluation of Products, Producers, and Proposals." In W. J. Popham (Ed.), Evaluation in Education: Current Applications, 7-33. Berkeley, CA: McCutchan Publishing Co.
^ Bossone, R. M. (1969). The Writing Problems of Remedial English Students in Community College of the City University of New York. New York: CUNY Research and Evaluation Unit for Special Programs. ERIC Document Reproduction Service, ED 028 778
^ Diederich, P. B. (1966). "How to Measure Growth in Writing Ability." English Journal 55 (4): 435-449.
^ White (1985), pp. 23-26.
^ Haswell and Elliot (2019), pp. 99-109
^ Diederich, P. B. (1946). "The Measurement of Skill in Writing." School Review 54 (10): 584-592.
^ Deane, P., Williams, F., Weng, V., and Trapani, C. S. (2013). "Automated Essay Scoring in Innovative Assessments of Writing from Sources". Journal of Writing Assessment 6 (2013). Retrieved 10 January 2022.
^ Educational Testing Service. Frequently Asked Questions About the e-rater Scoring Engine. Retrieved 9 January 2021.
^ Weir, Vidakovic, and Galaczi, p. 201.
^ Office of Qualifications and Examinations. (2014). Review of Double Marking Research, p. 10. Coventry: Ofqual.
^ "TOEFL iBT Test Writing Section". www.ets.org. Educational Testing Service. Retrieved 25 February 2022.
^ Boyd, W. (1924). Measuring Devices in Composition, Spelling and Arithmetic. London: Harran.
^ Hartog, P. J., Rhodes, E. C., and Burt, C. L. (1946).
^ Wiseman, S.
^ Brooks, V. (1980). Improving the Reliability of Essay Marking: A Survey of the Literature with Particular Reference to the English Language Composition. Certificate of Secondary Education Research Project Report 5. Leicester: University of Leicester.
^ Hamp-Lyons, L. (2016). "Farewell to Holistic Scoring?" Assessing Writing 27: A1-A2; 29: A1-A5.
^ Diederich, P. B. (1946). "Measurement of Skill in Writing." School Review 54 (10): 584-592.
^ Haswell and Elliot, pp. 99-109
^ Fuess, C. M. (1950). The College Board: Its First Fifty Years. New York: Columbia University Press.
^ Haswell and Elliot, pp. 160-163.
^ White, E., and English Council of the California State Universities and Colleges, with Friedrich, G., Burbank, R., and Cowell, W. (1973). Comparison and Contrast: The 1973 California State University and Colleges English Equivalency Examination. Los Angeles, CA: Office of the Chancellor, California State University and Colleges. ERIC Document Reproduction Service, ED 114 825.
^ Coward, A. F. (1952). "A Comparison of Two Methods of Grading English Compositions." Journal of Educational Research 46 (2): 81-93.
^ Elliot (2005), pp. 158-165.
^ Rentz, R. R. (1984). "Testing Writing by Writing." Educational Measurement: Issues and Practices 3 (4): 4.
^ McCready, M., and Melton, V. S. (1983). "Issues in Assessing Writing in a State-wide Program". Notes from the National Testing Network in Writing 2: 18, 22. Retrieved 9 January 2022.
^ National Writing Project. (2017). National Writing Project Offers High-Quality Writing Assessment Services. Retrieved 9 January 2022.
^ "University Writing Portfolio". writingprogram.wsu.edu. Washington State University. Retrieved 25 February 2022.
^ Cherry, R. D., and Meyer, P. R. (1993). "Reliability Issues in Holistic Scoring." In M. M. Williamson and B. A. Huot, Validating Holistic Scoring for Writing Assessment: Theoretical and Empirical Foundations, pp. 109-141. Cresskill, NJ: Hampton Press.
^ Williamson, D. M, Xi, X, and Breyer, F. J. (2012) "A Framework for Evaluation and Use of Automated Scoring." Educational Measurement: Issues and Practices 31 (1): 2-13.
^ Swartz, R., Patience, W. M., and Whitney, D. R. (1985). Adding an Essay to the GED Writing Skills Test: Reliability and Validity Issues. GED Testing Service Research Studies No. 7. General Education Testing Service: Washington, D. C. Retrieved 10 January 2022.
^ Hayes, J. R., Hatch, J. A., and Silk, C. M. (2000). "Does Holistic Assessment Predict Writing Performance? Estimating the Consistency of Student Performance on Holistically Scored Writing Assignments." Written Communication 17 (1): 3-26. [1] Wiseman (1949), p. 206.
^ Mellon, J. C. (1972). "Review [of National Assessment of Educational Progress Reports 3 and 5]. Research in the Teaching of English 6 (1): 86-105.
^ Gray, J. R., and Ruth, L. R. (Eds.) Properties of Writing Tasks: A Study of Alternative Procedures for Holistic Assessment. National Institute of Education final report, NIE-G-80-0034. Berkeley, CA: University of California, Berkeley. ERIC Document Reproduction Service, ED 230 576. Retrieved 12 January 2022.
^ Charney, D. (1984). "The Validity of Using Holistic Scoring to Evaluate Writing: A Critical Overview". Research in the Teaching of English 18 (1): 65-83.
^ Purves, A. C. (1984). "In Search of an Internationally-Valid Scheme for Scoring Compositions". College Composition and Communication 35 (4): 426-438.
^ Hudson, S. A., and Veal, L. R. (1981). An Empirical Investigation of Direct and Indirect Measures of Writing: Report of the 1980-1981 Georgia Competency Based Education Writing Assessment Project. ERIC Document Reproduction Service, ED 205 993.
^ Topol, B., Olson, J., and Roeber, E. (2014). "Pricing study: Machine scoring of student essays". Getting Smart. Retrieved 20 January 2022.
^ Baron, J. B. (1984). "Writing Assessment in Connecticut: A Holistic Eye toward Identification and an Analytic Eye toward Instruction." Educational Measurement: Issues and Practice 3 (1): 27-28, 38.
^ Elliot, N., Plata, M., and Zelhart, P. F. (1990). A Program Development Handbook for the Holistic Assessment of Writing. Lanham, MD: University Press of America.
^ Mullis, I. V. S. (1967). The Primary Trait System for Scoring Writing Tasks. Education Commission of the States: Denver, CO. ERIC Document Reproduction Service, ED 202 761. Retrieved 12 January 2022.
^ Hamp-Lyons, L. (1987). "From Holistic Scoring to Profile Scoring of Specific Academic Writing". Notes from the National Testing Network 7 (7). Retrieved 12 January 2022.
^ Broad, B. (2003). What We Really Value: Beyond Rubrics in Teaching and Assessing Writing. Logan, UT: Utah State University Press.
^ Freedman, S. W. (1979). "How Characteristics of Student Essays Influence Teachers' Evaluations". Journal of Educational Psychology 73 (3): 328-338.
^ Feldman, D. H. (1980). Beyond Universals in Cognitive Development. Norwood, NJ: Ablex.
^ Bever, T. G. (Ed.) (1982). Regression in Mental Development: Basic Phenomena and Theories. Hillsdale, NJ: Erlbaum.
^ Knoblauch, C. H., and Brannon, L. (1984). Rhetorical Traditions and the Teaching of Writing. Upper Montclair, NJ: Boynton/Cook.
^ Wilson, M. (2018), Reimagining Writing Assessment: From Scales to Stories, p. xx. Portsmouth, NH: Heinemann.
^ Gray and Ruth (1982).
^ Wesdorp, H., Bauer, B. A., and Purves, A. C. (1982). "Toward a Conceptualization of the Scoring of Written Composition". Evaluation in Education 5 (3): 299-315.
^ Gere, A. R. (1980). "Written Composition: Toward a Theory of Evaluation." College English 42 (1): 44-58.
^ Raymond, J. C. (1982). "What we Don't Know about the Evaluation of Writing". College Composition and Communication 33 (4): 399-403.
^ Belanoff, P., and Dickson, M., Eds. (1991). Portfolios: Process and Product. Portsmouth, NH: Boyton/Cook Publishers.
^ White, E. M., and Thomas, L. L. (1981). "Racial Minorities and Writing Skills Assessment in the California State University and Colleges". College English 43 (3): pp. 276-283.
^ Shaefer, R., and Rankin, D. (1985), "Statewide Teacher Certification Models". Notes from the National Testing Network in Writing 5: 5-6. Retrieved 12 January 2022.
^ Fallows, J. (1980). The Test and the 'Brightest': How Fair are the College Boards? The Atlantic 245 (2): 37-48.
^ Rubin, S. J. (1982). "The Florida College-Level Academic Skills Project: Testing Communication Skills Statewide". Notes from the National Testing Network in Writing 1 (5): 18. . Retrieved 12 January 2022.
^ Ruiz, A., and Diaz, D. (1983). "Writing Assessment and ESL Students". Notes from the National Testing Network in Writing 3: 5. Retrieved 12 January 2022.
^ Breland, H., and Ironson, G. H. (1976). "DeFunis Reconsidered: A Comparative Analysis of Alternative Admissions Strategies". Journal of Educational Measurement 13 (1): 89-99.
^ Breland, H. M. (1977). Group Comparisons for the Test of Standard Written English. ERIC Document Reproduction Service ED 146 228. Retrieved 12 January 2022.
^ Poe, M., and Elliot, N. (2019). "Evidence of Fairness: Twenty-five Years of Research in Assessing Writing". Assessing Writing 42: 100418. Retrieved 10 January 2022.
^ "NCEA External Assessment: Grade Score Marking". www.nzqa.govt.nz. Retrieved 2018-12-11.
^ "How the GRE General Test is Scored (For Test Takers)". www.ets.org. Retrieved 2018-12-11.

[1] Nordquist, Richard (March 7, 2017). "What Is Holistic Grading?". ThoughtCo. Retrieved 2018-12-11.

[:1-2] Bishop, Alan; Clements, M. A. (Ken); Keitel-Kreidt, Christine; Kilpatrick, Jeremy; Laborde, Colette (2012). International Handbook of Mathematics Education. Dordrecht: Springer Science & Business Media. pp. 354–355. ISBN 978-94-010-7155-0.

[:0-3] "Know Your Terms: Holistic, Analytic, and Single-Point Rubrics". Cult of Pedagogy. 2014-05-01. Retrieved 2018-12-11.

[4] "Holistic Scoring in More Detail". writing.colostate.edu. Retrieved 2018-12-11.

[5] Cooper, C. R. (1977). "Holistic Evaluation of Writing". In C. R. Cooper and L. Odell (Eds.), Evaluating Writing: Describing, Measuring, Judging, 3-31. Urbana, IL: National Council of Teachers of English.

[6] Myers, M. (1980). A Procedure for Writing Assessment and Holistic Scoring. Urbana, IL: National Council of Teachers of English.

[7] White, E. M. (1986). Teaching and Assessing Writing: Recent Advances in Understanding, Evaluating, and Improving Student Performance. San Francisco: Jossey-Bass Publishers.

[8] Oliveri, M. E., & McCulla, L. (2019). Using the Occupational Network Database to Assess and Improve English Language Communication for the Workplace (Research Report No. RR-19-2). Princeton, NJ: Educational Testing Service.

[9] Ballard, P. B. (1923). The New Examiner. London: University of London Press.

[10] Elliot, N. (2005). On a Scale: A Social History of Writing Assessment in America. New York: Peter Lang.

[11] Weir, C. J., Vidakovic, I., and Galaczi, E. D. (2013). Measured Constructs: A History of Cambridge English Language Examinations 1913-2012. Cambridge, England: Cambridge University Press

[12] Edgeworth, F. Y. (1988). "The Statistics of Examinations". Journal of the Royal Statistical Society 51: 599-635.

[13] Edgeworth, F. Y. (1890). "The Element of Chance in Competitive Examinations". Journal of the Royal Statistical Society 53: 460-475, 644-663.

[14] Starch, D., and Elliott, E. C. "The Reliability of Grading High School Work". School Review 20: 254-259.

[15] Thomas, C. W., et al. (1931). Examining the Examination in English: A Report to the College Entrance Examination Board. Cambridge, MA: Harvard University Press.

[16] Slomp, D., Corrigan, J., and Sugimoto, T. (2014). "A Framework for Using Consequential Validity Evidence in Evaluating Large-Scale Writing Assessments." Research in the Teaching of English, 48(3): 276-302.

[17] Joughin, Gordon (2008). Assessment, Learning and Judgement in Higher Education. Cham: Springer Science & Business Media. p. 48. ISBN 978-1-4020-8904-6.

[18] Franck, Olof (2017). Assessment in Ethics Education: A Case of National Tests in Religious Education. Cham, Switzerland: Springer. p. 72. ISBN 978-3-319-50768-2.

[19] The following names are taken from Haswell, R., & Elliot, N. (2019). Early Holistic Scoring of Writing: A Theory, a History, a Reflection. Logan, UT: Utah State University Press, pp. 24-25.

[20] Coffman, W. (1971). "On the Reliability of Ratings of Essay Examinations in English". Researching in the Teaching of English 5 (1): 24-36.

[21] Hartog, P. J., Rhodes, E. C., and Burt, C. L. (1936). The Marks of examiners: Being a Comparison of Marks Allotted to Examination Scripts by Independent Examiners and Boards of Examiners, Together with a Section on a Viva Voce Examination. London: Macmillan.

[22] Wiseman, S. (1949). "The Marking of English Composition in Grammar School Selection". British Journal of Educational Psychology 19 (3): 200-209.

[23] Godshalk, F. I., Swineford, F., and Coffman, W. E. (1966). The Measurement of Writing Ability. New York: College Entrance Examination Board.

[24] Elliot (2005, pp. 158-165.

[25] Whithaus, C. (2010). "Distributive evaluation", WPA-CompPile Research Bibliography, No. 3. WPA-CompPile Bibliographies

[26] Eley, E. G. (1956). "Testing the Language Arts." Modern Language Journal 40 (6): 310-315.

[27] Scriven, M. (1974). "Checklist for the Evaluation of Products, Producers, and Proposals." In W. J. Popham (Ed.), Evaluation in Education: Current Applications, 7-33. Berkeley, CA: McCutchan Publishing Co.

[28] Bossone, R. M. (1969). The Writing Problems of Remedial English Students in Community College of the City University of New York. New York: CUNY Research and Evaluation Unit for Special Programs. ERIC Document Reproduction Service, ED 028 778

[29] Diederich, P. B. (1966). "How to Measure Growth in Writing Ability." English Journal 55 (4): 435-449.

[30] White (1985), pp. 23-26.

[31] Haswell and Elliot (2019), pp. 99-109

[32] Diederich, P. B. (1946). "The Measurement of Skill in Writing." School Review 54 (10): 584-592.

[33] Deane, P., Williams, F., Weng, V., and Trapani, C. S. (2013). "Automated Essay Scoring in Innovative Assessments of Writing from Sources". Journal of Writing Assessment 6 (2013). Retrieved 10 January 2022.

[34] Educational Testing Service. Frequently Asked Questions About the e-rater Scoring Engine. Retrieved 9 January 2021.

[35] Weir, Vidakovic, and Galaczi, p. 201.

[36] Office of Qualifications and Examinations. (2014). Review of Double Marking Research, p. 10. Coventry: Ofqual.

[37] "TOEFL iBT Test Writing Section". www.ets.org. Educational Testing Service. Retrieved 25 February 2022.

[38] Boyd, W. (1924). Measuring Devices in Composition, Spelling and Arithmetic. London: Harran.

[39] Hartog, P. J., Rhodes, E. C., and Burt, C. L. (1946).

[40] Wiseman, S.

[41] Brooks, V. (1980). Improving the Reliability of Essay Marking: A Survey of the Literature with Particular Reference to the English Language Composition. Certificate of Secondary Education Research Project Report 5. Leicester: University of Leicester.

[42] Hamp-Lyons, L. (2016). "Farewell to Holistic Scoring?" Assessing Writing 27: A1-A2; 29: A1-A5.

[43] Diederich, P. B. (1946). "Measurement of Skill in Writing." School Review 54 (10): 584-592.

[44] Haswell and Elliot, pp. 99-109

[45] Fuess, C. M. (1950). The College Board: Its First Fifty Years. New York: Columbia University Press.

[46] Haswell and Elliot, pp. 160-163.

[47] White, E., and English Council of the California State Universities and Colleges, with Friedrich, G., Burbank, R., and Cowell, W. (1973). Comparison and Contrast: The 1973 California State University and Colleges English Equivalency Examination. Los Angeles, CA: Office of the Chancellor, California State University and Colleges. ERIC Document Reproduction Service, ED 114 825.

[48] Coward, A. F. (1952). "A Comparison of Two Methods of Grading English Compositions." Journal of Educational Research 46 (2): 81-93.

[49] Elliot (2005), pp. 158-165.

[50] Rentz, R. R. (1984). "Testing Writing by Writing." Educational Measurement: Issues and Practices 3 (4): 4.

[51] McCready, M., and Melton, V. S. (1983). "Issues in Assessing Writing in a State-wide Program". Notes from the National Testing Network in Writing 2: 18, 22. Retrieved 9 January 2022.

[52] National Writing Project. (2017). National Writing Project Offers High-Quality Writing Assessment Services. Retrieved 9 January 2022.

[53] "University Writing Portfolio". writingprogram.wsu.edu. Washington State University. Retrieved 25 February 2022.

[54] Cherry, R. D., and Meyer, P. R. (1993). "Reliability Issues in Holistic Scoring." In M. M. Williamson and B. A. Huot, Validating Holistic Scoring for Writing Assessment: Theoretical and Empirical Foundations, pp. 109-141. Cresskill, NJ: Hampton Press.

[55] Williamson, D. M, Xi, X, and Breyer, F. J. (2012) "A Framework for Evaluation and Use of Automated Scoring." Educational Measurement: Issues and Practices 31 (1): 2-13.

[56] Swartz, R., Patience, W. M., and Whitney, D. R. (1985). Adding an Essay to the GED Writing Skills Test: Reliability and Validity Issues. GED Testing Service Research Studies No. 7. General Education Testing Service: Washington, D. C. Retrieved 10 January 2022.

[57] Hayes, J. R., Hatch, J. A., and Silk, C. M. (2000). "Does Holistic Assessment Predict Writing Performance? Estimating the Consistency of Student Performance on Holistically Scored Writing Assignments." Written Communication 17 (1): 3-26. [1] Wiseman (1949), p. 206.

[58] Mellon, J. C. (1972). "Review [of National Assessment of Educational Progress Reports 3 and 5]. Research in the Teaching of English 6 (1): 86-105.

[59] Gray, J. R., and Ruth, L. R. (Eds.) Properties of Writing Tasks: A Study of Alternative Procedures for Holistic Assessment. National Institute of Education final report, NIE-G-80-0034. Berkeley, CA: University of California, Berkeley. ERIC Document Reproduction Service, ED 230 576. Retrieved 12 January 2022.

[60] Charney, D. (1984). "The Validity of Using Holistic Scoring to Evaluate Writing: A Critical Overview". Research in the Teaching of English 18 (1): 65-83.

[61] Purves, A. C. (1984). "In Search of an Internationally-Valid Scheme for Scoring Compositions". College Composition and Communication 35 (4): 426-438.

[62] Hudson, S. A., and Veal, L. R. (1981). An Empirical Investigation of Direct and Indirect Measures of Writing: Report of the 1980-1981 Georgia Competency Based Education Writing Assessment Project. ERIC Document Reproduction Service, ED 205 993.

[63] Topol, B., Olson, J., and Roeber, E. (2014). "Pricing study: Machine scoring of student essays". Getting Smart. Retrieved 20 January 2022.

[64] Baron, J. B. (1984). "Writing Assessment in Connecticut: A Holistic Eye toward Identification and an Analytic Eye toward Instruction." Educational Measurement: Issues and Practice 3 (1): 27-28, 38.

[65] Elliot, N., Plata, M., and Zelhart, P. F. (1990). A Program Development Handbook for the Holistic Assessment of Writing. Lanham, MD: University Press of America.

[66] Mullis, I. V. S. (1967). The Primary Trait System for Scoring Writing Tasks. Education Commission of the States: Denver, CO. ERIC Document Reproduction Service, ED 202 761. Retrieved 12 January 2022.

[67] Hamp-Lyons, L. (1987). "From Holistic Scoring to Profile Scoring of Specific Academic Writing". Notes from the National Testing Network 7 (7). Retrieved 12 January 2022.

[68] Broad, B. (2003). What We Really Value: Beyond Rubrics in Teaching and Assessing Writing. Logan, UT: Utah State University Press.

[69] Freedman, S. W. (1979). "How Characteristics of Student Essays Influence Teachers' Evaluations". Journal of Educational Psychology 73 (3): 328-338.

[70] Feldman, D. H. (1980). Beyond Universals in Cognitive Development. Norwood, NJ: Ablex.

[71] Bever, T. G. (Ed.) (1982). Regression in Mental Development: Basic Phenomena and Theories. Hillsdale, NJ: Erlbaum.

[72] Knoblauch, C. H., and Brannon, L. (1984). Rhetorical Traditions and the Teaching of Writing. Upper Montclair, NJ: Boynton/Cook.

[73] Wilson, M. (2018), Reimagining Writing Assessment: From Scales to Stories, p. xx. Portsmouth, NH: Heinemann.

[74] Gray and Ruth (1982).

[75] Wesdorp, H., Bauer, B. A., and Purves, A. C. (1982). "Toward a Conceptualization of the Scoring of Written Composition". Evaluation in Education 5 (3): 299-315.

[76] Gere, A. R. (1980). "Written Composition: Toward a Theory of Evaluation." College English 42 (1): 44-58.

[77] Raymond, J. C. (1982). "What we Don't Know about the Evaluation of Writing". College Composition and Communication 33 (4): 399-403.

[78] Belanoff, P., and Dickson, M., Eds. (1991). Portfolios: Process and Product. Portsmouth, NH: Boyton/Cook Publishers.

[79] White, E. M., and Thomas, L. L. (1981). "Racial Minorities and Writing Skills Assessment in the California State University and Colleges". College English 43 (3): pp. 276-283.

[80] Shaefer, R., and Rankin, D. (1985), "Statewide Teacher Certification Models". Notes from the National Testing Network in Writing 5: 5-6. Retrieved 12 January 2022.

[81] Fallows, J. (1980). The Test and the 'Brightest': How Fair are the College Boards? The Atlantic 245 (2): 37-48.

[82] Rubin, S. J. (1982). "The Florida College-Level Academic Skills Project: Testing Communication Skills Statewide". Notes from the National Testing Network in Writing 1 (5): 18. . Retrieved 12 January 2022.

[83] Ruiz, A., and Diaz, D. (1983). "Writing Assessment and ESL Students". Notes from the National Testing Network in Writing 3: 5. Retrieved 12 January 2022.

[84] Breland, H., and Ironson, G. H. (1976). "DeFunis Reconsidered: A Comparative Analysis of Alternative Admissions Strategies". Journal of Educational Measurement 13 (1): 89-99.

[85] Breland, H. M. (1977). Group Comparisons for the Test of Standard Written English. ERIC Document Reproduction Service ED 146 228. Retrieved 12 January 2022.

[86] Poe, M., and Elliot, N. (2019). "Evidence of Fairness: Twenty-five Years of Research in Assessing Writing". Assessing Writing 42: 100418. Retrieved 10 January 2022.

[87] "NCEA External Assessment: Grade Score Marking". www.nzqa.govt.nz. Retrieved 2018-12-11.

[88] "How the GRE General Test is Scored (For Test Takers)". www.ets.org. Retrieved 2018-12-11.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]