By Michael Lillis
President, Lakeland Federation of Teachers
Dear Future New York State Education Commissioner:
Welcome and congratulations on being hired for such an important position. We wish you the very best success, as the hopes and dreams of millions of school children, their parents, and their teachers hang in the balance.
You are inheriting a department which has operated without the transparency, respect, or responsiveness stakeholders deserve. No area represents this debacle more than the issue of the 3-8 Math and ELA assessments. It is important that you quickly become familiar with this problem, and we do not believe that you can rely on your staff to inform you of the most salient issues. This is partly because the source of the problem goes back to 2013, and the State Education Department has had incredible turnover which has disrupted crucial institutional memory. There has also been a significant lack of interest among the assessment staff in SED in having an honest discussion with stakeholders to address the serious assessment issues we face.
Issues concerning standardized tests are complex, and it is even debatable whether we should be using annual tests, grade level bracketed tests, or any tests at all. Though the issues concerning the state’s 3-8 assessments are vast – and we encourage you to cast a wide net as you seek input from across the state – we will focus for now on a single aspect that is crucial to fostering equity within our classrooms – the testing benchmarks.
The cut points determine who will receive a 1, 2, 3, or 4 on the test, and they are based on benchmarks established by NYSED. The benchmarks behind the cut points are the most complicated for the general public to understand, and they reveal NYSED’s deeply flawed metrics hiding behind complicated statistics and pretentious language. These metrics do real damage to children because the results are misinterpreted by parents and educators who assume that a score of 3 means “performing at grade level”, which it does not. To parents, the most important question is whether or not their child is developing at grade level. Our 3-8 tests could have measured this, but as you will see, NYSED chose to go in a different, more detrimental direction.
To begin, how do we define what a score of 1,2,3, or 4 even means on these assessments? The most recent technical report we have was issued in May of 2019 and it is for the 2017 assessments:
Student performance is classified as Level I, Level II, Level III, or Level IV for the Grades 3–8 ELA and Mathematics Tests. The definitions of performance levels are as follows:
- NYS Level I: Students performing at this level are well below proficient in standards for their grade. They demonstrate limited knowledge, skills, and practices embodied by the New York State P–12 Learning Standards for English Language Arts/Literacy or Mathematics that are considered insufficient for the expectations at this grade.
- NYS Level II: Students performing at this level are below proficient in standards for their grade. They demonstrate knowledge, skills, and practices embodied by the New York State P–12 Learning Standards for English Language Arts/Literacy or Mathematics that are considered partial but insufficient for the expectations at this grade.
- NYS Level III: Students performing at this level are proficient in standards for their grade. They demonstrate knowledge, skills, and practices embodied by the New York State P–12 Learning Standards for English Language Arts/Literacy or Mathematics that are considered sufficient for the expectations at this grade.
- NYS Level IV: Students performing at this level excel in standards for their grade. They demonstrate knowledge, skills, and practices embodied by the New York State P–12 Learning Standards for English Language Arts/Literacy or Mathematics that are considered more than sufficient for the expectations at this grade.
The performance level cut scores used to distinguish between Levels I, II, III, and IV were established during the process of standard setting in Summer 2013. The process is described in detail in Section 8 and Appendix P in the 2013 technical report (NYSED, 2013). (NYSED, page 2)
There is a lot to unpack here, but if we simply focus on the difference between a 2 and 3, the point becomes clear. The descriptor of a 3 indicates that students are “proficient” in the standards and are subsequently “sufficient” to be performing at grade level. This is simply too sloppy to be unintentional for a group of psychometricians to use in a technical report, and it gets to the heart of the matter at hand. The definitions of these terms are widely understood among testing specialists and they cannot be used in this way unless the intention is to confuse.
The Webster definition of proficient is “well-advanced”, while its definition of “sufficient” is “meets the needs.” A student who meets the needs of performing at grade level cannot automatically be considered proficient or well-advanced at grade level.
So which is it? Are students scoring a 3 well-advanced, or sufficient to be at grade level? The source of the confusion goes back to the 2013 Technical Report, as that is where the actual definition of proficient was established and it has been indefensible ever since, which is why psychometricians have been forced to confuse readers with contradictory language.
Here is the methodological summary of the study conducted by NYSED to establish the definition of “proficient”. In short, NYSED hired the College Board to do the work and the College Board came back with a study that indicated a student who is proficient would score at least a 1630 on the 2013 SAT (in 2013 the SAT was out of 2400 points). In 2013, this score was in the 66th percentile. Therefore, a student who gets a low 3 is on track to perform among the top third of SAT test takers and a student with a high 2 is not. When one applies this definition of proficient to the SAT, NYSED is saying that any SAT test taker not among the top third of SAT test takers is below grade level. This is obviously a ridiculous statement, but that is the exact standard we are applying to our 3-8 graders.
To understand how deeply flawed this study performed by the College Board for NYSED was, one only needs to review the study the College Board did to advise colleges and universities about interpreting their own test scores. For its own purposes, in 2013 the College Board defined “college and career readiness” as a score of 1550. Inexplicably, this was the same year it defined college and career readiness as a score of 1630 for NYSED. The difference is a 9 percentile point increase in expectations for students. This discrepancy has no justification and is a significant contributor to both the confusion and mismeasurement on our 3-8 assessments.
Much has been made in recent years about the degree to which teachers have participated in the test construction and development process, but again when we look at the language of the technical reports, we quickly see that all of this participation has been on the appearance of the tests, but teachers were structurally prevented from changing the cut points, or test difficulty, in any meaningful way. Teachers simply have had no way to change these tests to make them more developmentally appropriate or accurate to grade level expectations.
Page 8 of the 2017 Technical Report defined the role of teacher participants as:
New York State educators are actively involved in ELA and Mathematics test development. New York State educators provide critical input throughout all stages of the test development process, which include rangefinding, educator item review, operational forms construction, passage selection, item writing, and a “Final Eyes” meeting (a final review of the test books prior to printing).
Noticeably absent is any teacher involvement in setting cut points or discussions of overall grade level appropriateness of the exam as a whole. You simply cannot have meaningful input into a test’s difficulty by looking at each item in isolation; you must look at the test overall and examine the cut points. Page 16 of the 2017 Technical Report explains that the cut points were established in the 2013 process, and teachers have had no input into the cut points since then:
In Summer 2013, after the operational administration of the 2013 tests, a standard setting meeting occurred in Albany where 95 New York State educators went through a rigorous process, guided by the best practices indicated by this intensely studied process, to recommend performance standards for the new tests measuring the CCLS. These recommendations were presented to the Commissioner and the Board of Regents, who, in turn, adopted the recommended standards set forth by the committees. For additional details, see Section 8 and Appendix P in the 2013 technical report (NYSED, 2013).
Here is a link to the relevant Technical Report from 2013 Appendix P begins on page 237. Appendix P reads as a manual on how to manage a committee for a foregone conclusion. A summary of the process is Pearson psychometricians and NYSED staff generated Performance Level Descriptors (PLDs) based on the flawed College Board study cited above, and then a panel of educators were brought in to select test items that would correspond to these PLDs. These educators were not selecting items that they, as professionals, thought were grade level appropriate items, rather, they were being tasked with finding test items that correlated with what NYSED and Pearson thought were grade level appropriate.
There was one final stage where educators were given some freedom to adjust cut points, and it is the most insulting and damning insight into the process NYSED created. In hundreds of pages of technical reports from 2013 through 2017, the only place where teachers could make an adjustment to the test cut points is on page 244 in Appendix P of the 2013 Technical Report, where vertical articulation is discussed. The extent of teacher input on cut points is in Step 5, which reads: “If adjustments were deemed necessary, participants were provided constraints on how much they could move the cut scores (This was 4 raw-score points, which was the rounded overall test’s standard error of measurement)”
Teachers have had their input on cut points, or test difficulty, limited to be no more significant than the variables NYSED deemed so insignificant it could not be bothered to control for. This sentence summarizes much of what is wrong with both the tests and the relationship between NYSED and the state’s teachers.
It should be clear to you as the new Commissioner of Education that our 3-8 tests have generated significant controversy, perhaps the largest controversy your office confronts. What should also be clear after wading through these technical reports is that we have a testing regimen that is highly reliable, but deeply invalid.
The lack of validity is not new, and unfortunately no longer shocking. You cannot find a district in the state that has alignment between the 3-8 test scores and the scores earned by those students on high school state exams. All systems of measurement have error, but there is an extra burden that should exist when what is being measured are children, especially young children. We have a system that annually mismeasures hundreds of thousands of children as not being “sufficient” at grade level performance, when we know that they will be on track to pass their high school exams. It should not need to be stated how harmful it is for NYSED to tell parents in 3rd, 4th, 5th, 6th, 7th, and 8th grade that their child is below grade level. It is all the more damaging if that statement is, in fact, inaccurate.
NYSED has never conducted an external validation study to determine if any of the assessments conducted since 2013 actually measure what is intended to be measured. There are two relevant studies that were conducted by the Benjamin Center at SUNY New Paltz. The first is a serious critique of the assumptions that went into the initial College Board study that set New York’s cut points. The second is an analysis of the dramatic increase in students receiving a score of 0 on test items since the new cut points were changed in 2013. Combined, both studies paint a stark picture of the issues with having a system of assessment that has erroneously high cut points through solid data. Additionally, the Hechinger Report did an analysis of every state’s test benchmarks and found New York’s to be the highest. Note that it did not say it was the most accurate: “I found that 26 states set expectations that were three or more grade levels behind the eighth-grade standards of New York State, the state that had set the highest expectations back in 2013, as an early adopter of Common Core.”
The majority of the country has expectations for eighth graders three or more grade levels lower than New York. The fact that New York’s standards are the highest is not what makes them inaccurate – it’s the fact that they are so much higher and that they in no way correlate with actual student success in high school. That is what makes them inaccurate.
Fixing the test benchmarks is not the only change that needs to occur, as there are many others, not the least of which is the test length. However, it is impossible to salvage any benefit for children, parents, or educators if the results remain in their current invalid state. We are required to administer 3-8 Math and ELA tests annually, but we do not need to administer these tests.
It is our sincere hope that you help us work toward a more beneficial future for students in New York, and not cling to the flawed approaches of the past.