How Standardized Tests Are Scored (Hint: Humans Are Involved)

By Claudio Sanchez

Published July 8, 2015 at 4:30 PM EDT

We know very little about what goes into standardized tests, who really designs them, and how they're scored.

Standardized tests tied to the Common Core are under fire in lots of places for lots of reasons. But who makes them and how they're scored is a mystery.

For a peek behind the curtain, I traveled to the home of the nation's largest test-scoring facility: San Antonio.

The facility is one of Pearson's — the British-owned company that dominates the testing industry in the U.S. and is one of the largest publishing houses behind these mysterious standardized tests.

Scorers look over PARCC tests at a Pearson facility in San Antonio, Texas.

The company scores its test results in 21 centers across the country. The one in San Antonio is the largest.

The building is located in an office park right off a long stretch of highway northeast of the city.

Inside the cavernous 58,000-square-foot space, folding tables connected end to end fill the room. About 100 scorers from all walks of life work in silence, two per table.

They're here scoring PARCC tests — a test aligned with the Common Core Standards — aimed to replace old state exams. PARCC, short for Partnership for Assessment of Readiness for College and Careers, is one of two testing consortia that has helped states develop these new tests with $360 million in federal funding.

More than 5 million elementary and high school students took the PARCC test this school year in math, reading and writing.

In the facility, scorers' eyes are glued to computer screens displaying students' work from 10 states and the District of Columbia. These states belong to a consortium that pays Pearson $129 million to write the test questions and score the results.

Donna Vickers is one of the Pearson employees scoring PARCC tests.

Donna Vickers, a retired elementary school teacher who has worked for Pearson for eight years now, says the writing portion of this test must be scored by humans, not machines.

"I'm scoring third-grade compositions, probably four questions out of maybe 50 questions, a very small portion of an entire test," Vickers says.

She looks for evidence that students understood what they read, that their writing is coherent and that they used proper grammar. But it's actually not up to Vickers to decide what score a student deserves. Instead, she relies on a three-ring binder filled with "anchor papers." These are samples of students' writing that show what a low-score or a high-score response looks like.

"I compare the composition to the anchors and see which score does the composition match more closely," Vickers says.

Pearson does not allow reporters to describe or provide examples of what students wrote because otherwise, company officials say, everybody would know what's on the test.

So here's a writing exercise Pearson did approve:

It's from a book titledEliza's Cherry Trees: Japan's Gift to America. The task is for third-graders to describe how Eliza faced challenges to change something in America. Students must identify the main idea, draw evidence from the text and provide supporting details in what they write.

"Scoring supervisors" then make sure that the final scores are not out of whack with the so-called "true" scores from those anchor papers we mentioned earlier. Speed is also a concern, says Bob Sanders, Pearson's director of performance scoring.

"We monitor to make sure they're not scoring too fast, or too slow."

Sanders says some people need more training than others, but if scorers repeatedly fall short of the company's performance guidelines, they're fired. Since April, this scoring center has let 51 scorers go. They've not been hard to replace, though. Pay isn't bad — $12 to $15 an hour if you include bonuses. People without a four-year college degree need not apply.

Pearson officials say that last year, most of the 14,000 people it hired to score tests had at least one year of teaching experience.

But it's not required. So the job has attracted all kinds of folks. Many are stay-at-home parents and retired military who are allowed to work from home.

Then there are people like the ones I met, a former lawyer, a retired longshoreman and a bouncer who handles crowd control at concerts.

There are also people like Pat Squires, a college professor. She's been scoring tests for Pearson and other companies since 2002. She says some people approach the job thinking that scoring and grading are the same. They're not.

"In grading, oftentimes what we're doing is looking at what a student is doing and maybe marking them off for what they're not doing correctly," Squires says.

"But in scoring, you're working against a standard and you're looking to see what students have done correctly."

Still, some scorers point out that what test-makers say a third- or fifth-grader should be able to do sometimes doesn't seem right.

"We don't know how they decided whether this is a third-grade capable response," says Joe Bowker.

He did student-teaching in college and has been scoring tests on and off for several years.

"You have to leave your opinion outside the door," he says.

David Connerty-Marin, a spokesman for PARCC, says it's not up to a scorer or Pearson or PARCC to say, "Gee, we think this is too hard for a fourth-grader."

What is or is not developmentally appropriate, he says, is not an issue because the states have already made that decision based on the Common Core Standards.

"The states, with lots of educators, have reviewed the material and said, 'This is appropriate or not appropriate to the standards,' " says Connerty-Marin. "Our job is to write the test questions that measure whether the student is meeting those standards."

This week Pearson is supposed to wrap up its work on this batch of reading and writing tests. The client states will then get the raw scores, and together they must all agree on the same cut scores to determine which students are at grade level and which ones are not.

Andrew Thompson, the Pearson official who oversees the delivery of these raw scores to the states, says the crucial question is this: Will educators, parents and the public at large trust the results?

"They don't know what we're doing, so there's a lot of misconception about what we do," he says, "and we don't have a way right now to refute that [misconception] and show this is really what we're doing."

Most Americans have been in the dark, says Thompson. So the risk for Pearson, PARCC and the states is that by trying to be more transparent this late in the game, people may very well end up with more questions than answers.