I can tell you when I began losing faith in standardized testing: August 11, 2009. That was the date of my GRE. My writing prompt was as follows:
Explain the causes of war.
Wow! This could not have been any more perfect. Here I was taking the GRE so I could go to grad school and write a dissertation on the causes of war. The College Board threw me a softball!
Then I received my score: 4.5. While not terrible, the 4.5 corresponded to the 63rd percentile. According to the College Board (roughly), more than a third of GRE takers could write my dissertation better than I could!
Maybe the essay was not that great. Maybe I am a terrible writer. Maybe I don’t understand the causes of war. The academic job market will likely be the ultimate arbiter of my abilities. But until then, the 4.5 seems silly.
Years later, I came across this revealing article on standardized testing graders’ incentive structure. Many tests use some sort of consensus method. Begin by giving the test to two graders. If the marks are close, average the grades and move to the next exam. If the marks are not close, bring in a third grader and average the three grades in some pre-defined way.
Standardized testing companies are not in the business of giving correct grades–they are in the business of grading tests as quickly as possible. The potential for a third grader merely appeases a school’s desire to have some semblance of legitimacy. For the company, the third grader is a speed bump. Every test that reaches a third reader requires 50% more labor–and a non-negligible decrease in profitability for that particular test.
Realizing this problem, testing companies pay careful attention to their graders’ agreement rate. A low agreement rate is the mark of a bad employee. Supervisors might creatively limit the number of tests such an employee can grade (thus inflating the supervisor’s overall agreement rate for his or her team), while managers might fire him or her.
The testing companies deserve respect for creating mechanisms to keep employees in line with company goals. Unfortunately, those company goals do not comport to what we as consumers want out of the standardized testing scores we buy.
I have loved game shows all my life. One of my earliest memories is of watching Family Feud. The game is simple: producers survey 100 people with a variety of questions. They then tally the responses. Contestants on the show must then guess which answer was most frequently given.
Note the emphasis here is on matching, not correctness. For example, suppose the prompt was “name a cause of war.” As a contestant, I would say “greed” or “irrationality” way before I said “private information” and “commitment problems.” The first two responses are terrible, terrible answers. The second two are fantastic. Yet, because your average survey taker has not read “Rationalist Explanations for War,” I would not expect many people to give sensible responses to the survey. So commitment problems would yield a smaller score than greed. In turn, I as a contestant pander to their ignorance and say the silly thing.
Suppose you are a test grader. Better yet, say you are the test grader. God has endowed you with absolute authority in test grading matters. You know a 10 essay is a 10 essay and a 1 essay is a 1 essay. Whereas others struggle to see the difference, your observations are perfect.
But you are also broke and need a job. I hand you a test. You recognize it is a clear 10. What grade do you give it?
In a world of justice, your answer is 10. But in a world where you need to eat, you take a different route. Perhaps the essay did something strange, something you have never seen before–something like argue that bargaining problems cause war. You recognize that such an argument reflects scholarly consensus and would be the baseline for tenure at Stanford. But you also know that other graders will think that the argument is just plain bizarre. So you credit the writer for having decent organizational structure but not much substance and turn in a 7.
The other grader gives it a 6.5. The system counts it as a match and you do not get into trouble.
Grading systems are not grading systems–they coordination games with multiple equilibria. Like in the Family Feud, a reader should not give the grade the writer actually deserves but rather what he or she thinks other readers will give it.
But this leads to perverse equilibria. For example, if we all created the rule that essays beginning with a vowel are 10s and essays beginning with consonants are 2s, no one would want to break the system and risk the wrath of being labeled inefficient. So substance goes out the window. Graders instead look for focal points to coordinate their scores.
Those cues need not reflect anything of substance. To wit, consider the following review of a supervisor’s team:
[A] representative from a southeastern state’s Department of Education visited to check on how her state’s essays were doing. As it turned out, the answer was: not well. About 67 percent of the students were getting 2s.
That’s when the representative informed Farley that the rubric for her state’s scoring had suddenly changed.
“We can’t give this many 1s and 2s,” she told him firmly.
The scorers would not be going back to re-grade the hundreds of tests they’d already finished—there just wasn’t time. Instead, they were just going to give out more 3s.
3s magically appeared out of nowhere–partially because the testing company wanted a higher average, and partially because graders feared that giving a 1 or 2 would result in a mismatch and disciplining from the supervisor.
The article is full of other lovely anecdotes and worth the read. And it is also terrifying to think about.
In retrospect, I should not have written about how bargaining problems cause war. From the point of view of a grader, it is just too bizarre. I deserve my 4.5 for not properly playing the game.
(I suppose three years of grad school has made me more cynical about life. Perhaps a better subtitle for this article is “How I Learned to Stop Worrying and Love Pandering.”)
Of course, that is precisely the problem. The incentive system for grading is perverse and rewards students for writing safe–and not particularly insightful–essays. In contrast, the academic job market rewards the opposite approach. Write a dissertation that has been done a thousand times before, and you won’t find employment. Do something revolutionary, and have your pick at the top jobs.
The test companies do not have final say over the method of standardized testing grading. We do. It’s time we demand change in the system.