Though there is lots of evidence of states retreating from high-stakes teacher evaluations (a topic of a forthcoming NCTQ report), you'd be hard-pressed to find anyone who doesn't think evaluations ought to be fair and accurate, no matter how they are used. That includes being relatively consistent.
A new study calls attention to unfortunate consistency problems—even in an evaluation system that appears to align with best practices—concerning enough to warrant additional study and, perhaps, some adjustments in practices.
Examining New Mexico's NMTEACH evaluation system, researchers Sy Doan of Vanderbilt University and Jonathan Schweig and Kata Mihaly of RAND Corporation find that as many as 40 percent of teachers would have received a different composite rating (using simulated scores built from the typical range of scores teachers earn) if evaluated again that same school year.
High rates of inconsistency like these could be reduced by improving the reliability of individual measures feeding into the evaluations. For example, better consistency can be achieved by adding more observations involving multiple raters, or when student surveys are used, setting a minimum number of responses for the responses to be considered valid.
Value-added measures (VAMs) of teacher performance produced the most inconsistencies, especially when they comprised a larger share of the summative measure, but those too can be reduced by using a composite VAM score based on a three-year average, not a single year. In addition to this long-recommended practice, researchers suggest setting a minimum number of students contributing to the VAM score in order to generate more accurate VAM results.
One finding suggests that using four or five rating levels may have a downside. It turns out that a five-level system exposes the evaluation to more inconsistency than a three-level system, with the majority of these inconsistencies found in the middle—where most teachers' ratings fall. Fewer classifications may be prudent, but importantly, each rating category needs to serve a real purpose (e.g. they are attached to specific interventions). It also suggests that summative teacher evaluation ratings may still be useful for high-stakes decisions about teachers at the high and low extremes of the scale, where ratings are more consistent.
While this analysis offers helpful practical advice, it is important to note that the study isolated consistency without examining the overlap with accuracy. (The researchers themselves note that teachers' ratings should naturally be somewhat inconsistent from year to year, as teachers improve over time.) One of the solutions available to schools is to place heavier emphasis on observations to achieve more consistent results, but since observations are more open to bias, this step could result in a less accurate rating.
The emerging research from an early adopter state like New Mexico suggests that states and districts would be wise to build nimble, flexible systems that are amenable to mid-course corrections.