One day, one test: Keeping score isn’t as easy as 1, 2, 3
Calculations about what it means to read at “grade level” can be confusing, and the complicated formulas can penalize children who may be on the cusp of meeting the standard. (Mark Weber/The Daily Memphian file)
In partnership with
The Institute for Public Service Reporting
The Institute for Public Service Reporting is based at the University of Memphis and supported financially by U of M, private grants and donations made through the University Foundation. Its work is published by The Daily Memphian through a paid-use agreement.
Of all the concerns about Tennessee’s new third-grade retention policies — policies that could result in thousands of local students being held back in the summer of 2023 — one of the most difficult issues to understand is the arcane way that standardized test scores are calculated.
So confusing are these calculations that there is widespread misunderstanding about what it even means to read at “grade-level.”
And maybe most concerning of all, these complicated formulas can penalize children just on the cusp of meeting the standard.
“A lot of those students just miss hitting the proficient mark by one or two questions and are labeled as significantly deficient in this bill,” Maryville City Schools Director Mike Winstead said after the new laws were passed in 2021.
One day, one test
Thousands of third-grade students in traditional public and charter schools in the Memphis area could be held back in the summer of 2023 as a result of a new Tennessee law focused on standardized reading tests. But similar laws in states across the country have shown mixed results. And in Tennessee, the law gives more power to a testing system whose methodology is widely questioned, whose approach largely ignores children with learning disabilities and emotional and socioeconomic challenges, and whose approach to scoring leaves experts frustrated and confused.
Read the full series:
New reading retention law goes into effect this month
Suburban superintendents wary of retention law
Here’s why some children struggle to read
Learning from the Mississippi Miracle
How the new retention law overlooks children with learning disabilities
That’s just one of the concerns school officials have with the new third-grade retention law: Large-scale standardized achievement tests don’t measure “grade level” performance.
Defining ‘grade level’
“The legislation is attempting to address third graders who can’t read at grade level, but the TCAP test doesn’t test to see if students can read at grade level,” Lakeland Supt. Ted Horrell told The Daily Memphian in 2021.
TNReady tests measure “performance levels” based on educated but very subjective judgments. They don’t determine whether students are reading at grade level.
The same is true of the National Assessment of Educational Progress (NAEP), which is divided into three performance levels — Basic, Proficient and Advanced.
NAEP results are reported as a percentage of students who read “at or above proficient.” Two-thirds of fourth graders nationally and in Tennessee fall below that standard.
“ ‘Proficient’ on NAEP does not mean grade level performance. It’s significantly above that,” wrote Tom Loveless, director of the Brown Center on Education Policy at the Brookings Institution. “Using NAEP’s proficient level as a basis for education policy is a bad idea.”
Lakeland Schools Superintendent Ted Horrell (at the grand opening of Lakeland Preparatory High School, Saturday, July 30, 2022) said in 2021, “The legislation is attempting to address third graders who can’t read at grade level, but the TCAP test doesn’t test to see if students can read at grade level.” (Lucy Garrett/Special to The Daily Memphian)
The National Academy of Sciences agrees.
“NAEP’s current achievement-level-setting procedures remain fundamentally flawed,” the academy concluded in 2005. “The judgment tasks are difficult and confusing; raters’ judgments of different item types are internally inconsistent; appropriate validity evidence for the cut scores is lacking; and the process has produced unreasonable results.”
The confusion is compounded by the way test scores are reported, which can lead to misunderstandings about what reading at “grade level” really means.
“Grade level has been defined as the average reading achievement at any particular grade,” wrote Richard Allington, a nationally known reading researcher at the University of Tennessee, and former president of the International Literacy Association. “As with any average, half the population is, by definition, above average and half is below average.”
But a “below-average” reader might be just slightly below or way below.
The same holds true for the state’s TNReady tests.
Scores may vary
Every spring, students in grades 3-8 take TNReady achievement tests in reading, math, social studies and science.
The tests are timed. Third graders get 195 minutes for the reading exam (known formally as the English language arts test), 115 minutes for math, 104 minutes for science and 50 minutes for social studies.
That adds up to 464 minutes of testing, or about 7½ hours over the course of several school days. Each test is broken into parts that last no longer than 50 minutes.
On the reading test, students are asked to read brief passages of text, then answer multiple-choice questions about the text.
‘Grade-level’ and ‘proficient’
TNReady scores are cut into four levels that measure a student’s understanding and ability “to apply the grade/course level knowledge and skills defined by the Tennessee academic standards.”
Level 4 (Mastered) demonstrates “an extensive understanding and expert ability.”
Level 3 (On-Track) demonstrates a “comprehensive understanding and thorough ability.”
Level 2 (Approaching) demonstrates “approaching understanding and partial ability.”
Level 1 (Below) demonstrates “minimal understanding and ability.”
According to the new law, third-grade students “achieving a performance level rating of ‘Approaching’ or ‘Below’ on the reading portion of the student’s most recent TCAP test” will be held back.
Students achieving “On-Track” or “Mastered” levels are promoted to fourth grade.
Based on 2022 TCAP results, about 23% of Tennessee’s third graders scored Below” grade-level expectations in reading, and about 41% scored “Approaching.”
Students who score Level 3 “On Track” are generally working above grade level, school officials say.
“Indeed, a high level of performance is required to achieve the ‘On Track’ rating,” Dale Lynch, executive director of the Tennessee superintendents group, wrote in 2021. “Even third graders who are performing at grade level could be subject to retention. A lot of our teachers believe you can be proficient but still be categorized as ‘Approaching’ grade level.”
It’s also notable that standardized achievement tests like TNReady aren’t graded like regular classroom tests.
On a regular classroom test versus a standardized test, if a third grader correctly answers 85 out of 100 questions on a classroom test, the student’s “raw score” of 85% will be converted to a grade. That generally means that a score between 80-90% will convert to a B.
However, on standardized tests, “a raw score by itself has no meaning ... because tests may differ in content and difficulty (year to year),” the Mississippi Department of Education explains on its website.
Changing the questions and cutting the scores
Yet another challenge for standardized tests is that test makers change up to a third of the questions every year. Otherwise, the answers to last year’s questions might find their way into this year’s classrooms.
That’s why, each year, test makers try to write new questions that are equally challenging. Yet that also means that this year’s questions might randomly be a tad easier, or a tad more difficult.
“On some tests, the student is lucky (knows more answers) and gets higher scores. On other tests, the student is unlucky (knows fewer answers) and gets lower scores,” Mississippi explains.
In other words, a third grader who scores a 69 on this year’s reading questions might have scored a 70 on last year’s questions. What does that mean? If test results aren’t “equalized” from year to year, that one point could be the difference between being promoted to fourth grade or not.
And there are even more elements of subjectivity built into the testing process.
For instance, to ensure that each year’s questions are equally valid, testing officials “cut” raw scores into scale scores. This means that if a third-grader correctly answers 85% of the questions on the TNReady test, his raw score of 85 will be converted or “cut” into “performance levels.”
The “cut score” separates one performance level from another.
The old TCAP test cut scores into four “performance” categories: below, basic, proficient and advanced.
The current TNReady cuts scores into below, approaching, on-track and mastered categories.
“There is no ‘right’ way to set cut scores, and different methods have various strengths and weaknesses,” wrote Andrew J. Rotherham, co-founder and co-director of Education Sector.
Or, as Boston College explained in its “Scores May Vary” study of cut scores: “Regardless of the method, the cut-score setting process is subjective.”
Tennessee’s ‘bookmarked’ scores
There are other ways that experts question Tennessee’s TNReady approach to standardized tests. That includes an arcane statistical technique called item response theory, (which is known is more commonly known as the Bookmark method, developed by CTB/McGrawHill in 1996.)
Under the Bookmark method, test questions are ordered along a scale of difficulty, from easiest to hardest, based on that year’s actual scores.
A question that 99% of students answered correctly would be considered least difficult.
A question that 1% of students answered correctly would be considered most difficult.
A committee of 16 or so judges (all certified grade-level teachers) evaluates each question along the scale of difficulty.
Then the judges “bookmark” the question they believe separates one performance level from another — based on the definitions of each level.
This process continues for three rounds. And after this process, the cutting of scores comes back into play. The final cut score is determined by the median value of each judge’s bookmarks.
Read the full series:
Suburban superintendents wary of retention law
Why some children struggle to read
Learning from the Mississippi Miracle
How the new retention law overlooks children with learning disabilities
Keeping score isn’t as easy as 1, 2, 3
Topics
TNReady third-grade retention third-grade reading literacy standardized tests TCAP National Assessment of Educational Progress NAEPDavid Waters
David Waters is Distinguished Journalist in Residence and assistant director of the Institute for Public Service Reporting at the University of Memphis.
Want to comment on our stories or respond to others? Join the conversation by subscribing now. Only paid subscribers can add their thoughts or upvote/downvote comments. Our commenting policy can be viewed here.