|
RESEARCH: When Principals Rate Teachers
By Brian Jacob and Lars Lefgren
The best—and the worst—stand out
Elementary- and secondary-school teachers in the United
States traditionally have been compensated according to salary schedules
based solely on experience and education. Concerned that this system makes
it difficult to retain talented teachers and provides few incentives for them to work to raise student achievement
while in the classroom, many policymakers have proposed merit-pay
programs that link teachers’ salaries directly to their apparent
impact on student achievement.
Until recently, only a handful of isolated districts
had attempted such programs. Now entire state systems are moving toward
merit pay, with new policies established recently in Florida and Texas
requiring districts to set teachers’ salaries based in part on the
gains their students are making on the state’s accountability exam.
Implementing a merit-pay system, however, comes with
challenges. Students often have more than one teacher but take only one
high-stakes test. How do we know which teacher to reward? If students are
not tested annually in each subject, how do we determine the merit of a
teacher in a year without testing? How do we fairly assess the impact of a
teacher during a testing year if we do not know how students performed
during the previous school year? Can a merit-pay system overcome these
obstacles?
One option is to turn to principals and ask them to
help determine the size of pay raises. Such subjective performance
assessments are already used to evaluate untenured teachers, and they play
a large role in promotion and compensation decisions in other occupations.
While principals can and do judge teachers’ performance, however,
there is little good evidence on the accuracy of their judgments.
The research reported in this paper fills this gap. We
found that principals in a western school district did a good job of
assessing teachers’ effectiveness. In fact, principals are quite good
at identifying those teachers who produce the largest and smallest
standardized achievement gains in their schools (the top and bottom
10–20 percent). They are less able to distinguish among teachers in
the middle of this distribution (the middle 60–80 percent),
suggesting that merit-pay programs that reward or sanction teachers should
be based on evaluations by principals and should be focused on the highest-
and lowest-performing teachers.
A Representative Sample
We surveyed all 13 elementary-school principals in a
midsized school district, that asked to remain anonymous, in the western
United States. We asked them to rate the teachers in their schools on a
variety of performance dimensions. The survey, conducted in February 2003,
provides evaluations by their principals of 202 elementary-school teachers
in grades 2 through 6.
The teachers included in the study are fairly
representative of elementary-school teachers nationwide. Sixteen percent of
them are men, the average age is 42, and average teaching experience is 12
years. Most of these teachers attended a local university; 10 percent
attended another in-state college; and 6 percent attended a school out of
state. Seventeen percent of them have a master’s degree or higher,
and most are licensed in either early childhood education or elementary
education. Finally, 8 percent of the teachers in our sample taught in a
mixed-grade classroom in 2002–03, and 5 percent were in a
“split” classroom, sharing a single contract and dividing the
school day with another teacher. The students in grades 2 through 6 in the
district are predominantly white (73 percent), with a sizable ethnic
minority (Latino students compose 21 percent of the elementary population);
48 percent of them receive a free or reduced-price lunch. Achievement
levels in the district are almost exactly at the average of the nation
(49th percentile on the Stanford Achievement Test).
All elementary-school students in the district take a
set of exams each year, in reading and math. These multiple-choice,
criterion-referenced tests cover topics that are closely linked to the
district’s learning objectives. While student achievement results
have not been linked to rewards or sanctions for schools until recently,
the results of the exams have been distributed to parents annually for at
least the past decade, years before implementation of the No Child Left
Behind law. This latter fact is important because our study relies on a
consistent data set covering the years 1998 through 2003. The district has
not had a merit-pay program for teachers at any time during this period.
To ensure that we could link student achievement data
to the appropriate teacher, we limited our sample to classroom teachers,
omitting music and gym teachers as well as librarians. We excluded
kindergarten and first-grade teachers because earlier achievement exams
were not available for their students; this prevented us from developing a
“value-added” measure of student learning. We retain in our
analysis the small number of teachers who share a contract, each teaching
only half of the school day. For our analysis, the gains made by students
in these classes count toward the estimated value added of each of the two
teachers.
Can Principals Identify Effective Teachers?
Principals were asked not only to provide a rating of
overall teacher effectiveness, but also to assess, on a scale from one
(inadequate) to ten (exceptional), specific teacher characteristics (ten
altogether), including dedication and work ethic, classroom management,
parent satisfaction, positive relationship with administrators, and ability
to improve math and reading achievement. Principals were assured that their
responses would be completely confidential and would not be revealed to the
teachers or to any other employee of the school district.
While there was some variation among principals, the
overall assessments they gave teachers were generally quite high, with an
average of 8.1. Only 10 percent of the assessments fell below a 6, and the
average rating for the least-generous principal was still a 6.7. At the
same time, principals did not simply assign similar scores to each of their
teachers. In fact, the principals generally used 5 to 6 different ratings
for the teachers in their school.
Because principals differ in the generosity and degree
of variation in the ratings they give, we placed all the ratings on the
same scale by subtracting from each teacher’s rating the average
rating given by that teacher’s principal and then dividing by the
principal’s standard deviation. We did this separately for each
specific aspect of teacher performance about which principals were asked.
We compared a principal’s assessment of how
effective a teacher is at raising student reading or math achievement, one
of the specific items principals were asked about, with that
teacher’s actual ability to do so as measured by their value added,
the difference in student achievement that we can attribute to the teacher.
To estimate the value added by a teacher, we examine the performance of her
students after accounting for a wide variety of student and classroom
characteristics that could affect achievement independent of the
teacher’s ability. These characteristics include race, gender,
eligibility for the federal lunch program, limited English proficiency,
and, most important, previous student achievement. We also take advantage
of the availability of data on the same teachers from as far back as the
1996–97 school year; this enables us to distinguish long-term teacher
quality from the possibly idiosyncratic performance of a class in any one
year.
We find a positive correlation between a
principal’s assessment of how effective a teacher is at raising
student achievement and that teacher’s success in doing so as
measured by the value-added approach: 0.32 for reading and 0.36 for math.
These correlations are based not on a principal’s overall rating of
the teacher, but rather on the principal’s personal assessment of how
effective the teacher is at “raising student math (or reading)
achievement.” Previous studies of evaluations by principals have used
only the overall rating of the teacher, a less direct assessment of a
teacher’s ability to raise student performance. Using the overall
rating in that way could compromise the accuracy of subjective performance
evaluations, especially if principals value characteristics of teachers
that are unrelated to their effect on student performance. Our findings
lead us to conclude that principals are able to identify accurately this
dimension of teacher effectiveness.
Why aren’t these correlations even higher? One
possible explanation is that principals focus on the average test scores in
a teacher’s classroom rather than on student improvement. There is some evidence
for this conjecture. The correlation between ratings by principals and the
average test scores of a teacher’s students is significantly higher
than the correlation between ratings by principals and the teacher’s
value-added rating in reading (0.56 versus 0.32), though not in math.
Another reason could be that principals focus on their
most recent observations of teachers. We do find, for example, that the
average achievement gains in a teacher’s classroom in 2002–03
is a modestly stronger predictor of the principal’s rating than the
gains in any previous year. In theory, it is possible that principals are
correct in assuming that a teacher’s effectiveness changes over time
so that teachers’ most recent experience is the best indicator of
their actual effectiveness. If that were the case, however, we would expect
to find that principals’ ratings are more highly correlated with
value-added measures that have been adjusted to account for the fact that
teachers tend to be less effective in their first one or two years in the
classroom. In fact, the correlation between principals’ ratings and
experience-adjusted value-added measures is no higher than the correlation
with our baseline value-added measures. The bigger mistake principals make,
it seems, is not adequately accounting for students’ incoming
ability.
While informative about principals’ overall
abilities, a simple correlation does not tell us whether principals are
more or less effective at identifying teachers at certain points on the
ability distribution. We therefore estimated the percentage of teachers
that a principal can correctly identify in the top group within his or her
school. We found that the teachers identified by principals as being in the
top category were, in fact, in the top category according to the
value-added measures about 52 percent of the time in reading and 69 percent
of the time in mathematics. If principals randomly assigned ratings to
teachers, we would expect the corresponding probabilities to be 14 and 26
percent, respectively. This suggests that principals have considerable
ability to identify teachers in the top of the distribution. The results
are similar if one examines principals’ ability to identify teachers
in the bottom of the ability distribution.
Despite their success with the top and bottom of the
distribution, principals are significantly less successful at
distinguishing among teachers in the middle of the ability distribution.
Principals correctly identify only 49 percent of teachers as being better
than the median teacher in their school in boosting students’ reading
scores, relative to the 33 percent that one would expect if
principals’ ratings were randomly assigned. Principals appear
somewhat better at distinguishing between teachers in the middle of the
distribution in math (they correctly placed 54 percent of teachers above
the median, compared with the 26 percent expected if ratings were random),
but they again appear to be better at identifying the best and worst
teachers.
One reason that principals might have difficulty
distinguishing between teachers in the middle is that the distribution of
teachers’ value-added ratings is highly compressed. However, our
analysis of the data suggests that this is not the case. Teachers who
receive ratings at or close to the median in the school have estimated
value-added measures that are quite widely dispersed.
What Characteristics of Teachers Do Principals Value?
Of course, the effects of moving to a system of
compensation based on assessment by principals depend on the relative
importance they place on a teacher’s ability to raise standardized
test scores when making overall assessments of teachers’
effectiveness. While such preferences could theoretically be set by
district administrators or other policymakers, it is likely that principals
would retain some autonomy over personnel decisions, so their preferences
are important to investigate. We therefore compared principals’
overall rating of each teacher with their assessment of various teacher
attributes to examine how principals value different dimensions of quality
in teachers.
Perhaps not surprisingly, teachers’ ratings on
many (though not all) of the individual survey items are highly correlated.
Based on the relationships between the questions, we created three groups
of teachers’ quality characteristics and reanalyzed the results. The
first group captures what might be described as traditional teaching
ability and includes the ratings of classroom management, organization, and
ability to improve students’ test scores. The second, including the
principal’s assessments of a teacher’s relationship with
colleagues and administrators, measures a teacher’s collegiality. The
third measures student satisfaction and includes the principal’s
ratings of student satisfaction and the teacher as a role model.
Ability, collegiality, and student satisfaction all
contribute independently to a principal’s overall evaluation of a
teacher, but principals weigh the set of questions measuring
teachers’ ability to improve student achievement and to manage a
classroom most heavily. An increase of one standard deviation in a
principal’s evaluation of a teacher’s management and teaching
ability, for example, is associated with an increase of 0.56 standard
deviations in the principal’s overall rating. In comparison, an
increase of one standard deviation in teacher collegiality is associated
with an increase in overall ratings of roughly one-third of a standard
deviation in overall rating. Meanwhile, teachers scoring one standard
deviation higher in student satisfaction score just 0.15 standard
deviations in their overall rating, all else being equal.
Predicting Performance
We should care about the quality of principals’
assessments of teacher quality not just for their reliability in a
merit-pay system, but also for their ability to identify teachers who will
continue to improve student achievement. In order to get a sense of how
well principals’ assessments forecast teachers’ performance, we
examined how well these assessments predict future student achievement
gains. For our February 2003 survey of principals, that meant evaluating
scores on the spring 2003 tests. We compared the predictive accuracy of a
principal’s assessment of teacher effectiveness with the predictive
accuracy of a teacher’s value-added rating. We also measured the
accuracy of the traditional determinants of teachers’ salaries,
experience and education, in predicting those scores. Throughout, we
accounted for differences in previous student achievement, student
demographics, and classroom characteristics.
Our findings suggest that ratings by principals,
both overall ratings and ratings of a teacher’s ability to improve
achievement, effectively predict a student’s future achievement gains
(see Figure 1). Students whose teachers receive an overall rating one
standard deviation above the mean are predicted to score roughly 0.06
standard deviations higher in reading than students whose teacher received
an average rating. By way of comparison, students receiving free or
reduced-price lunch in the same district experience achievement gains
approximately 0.16 standard deviations lower than similar students who are
not eligible for such programs. Assignment to a teacher with a favorable
evaluation by her principal appears to be more important for math
performance. An increase of one standard deviation in the principal’s
evaluation predicts an increase of 0.14 standard deviations in math
performance, roughly on par with the disadvantage associated with coming
from a low-income family.

Measures of teachers’ value added in previous
years are an even better predictor of future gains in students’
achievement than are principal ratings. These results, which are similar
for math and reading, suggest that teachers’ impact on student
achievement, as measured by simple value-added measures of teacher
effectiveness, remain fairly stable over time and that principals’
ratings effectively capture a substantial fraction of these stable
differences in teachers’ effectiveness.
We do not find any statistically significant
relationship between the number of years a teacher has taught and
students’ achievement, though this is probably due to the necessary
omission of first-year teachers (because we cannot measure their value
added for a previous school year). Other studies have found that first-year
teachers tend to perform worse on average than experienced teachers.
Education does have some predictive power. Teachers with advanced degrees
have students who score roughly 0.10 standard deviations higher. We
hesitate to say that education itself is producing these gains, because a
teacher’s level of education is likely to be associated with personal
characteristics not accounted for in our analysis, and these may be the
very factors responsible for the improvements in student achievement.
Perhaps our most interesting finding is that the
salaries teachers in this district received in 2002–03 bore no
relation at all to their impact on student achievement. Students with
highly paid teachers made no more progress than those with teachers who had
low salaries.
Conclusions
In sum, our results suggest that student achievement
(as measured by standardized test scores) would probably improve more under
a system based on principals’ assessments than in systems where
compensation is based solely on education and experience. This is because
principals would be able to identify and reward the very best teachers
while, at the same time, identifying the least competent teachers for
remediation or dismissal.
To the extent that the most important staffing
decisions involve sanctioning incompetent teachers and rewarding the very
best teachers, a principal-based assessment system may affect achievement
as positively as a merit-pay system based solely on student test results.
Moreover, evaluation by the principal has the potential to offset some of
the potential negative consequences of test-based accountability systems.
If principals can observe inputs as well as outputs, they may be able to
ensure that teachers increase student achievement through improvements in
pedagogy, classroom management, or curriculum rather than teaching to the
test. Principals can also evaluate teachers on the basis of a broader
spectrum of educational outputs in addition to test scores that parents may
value. At the same time, the inability of principals to distinguish between
a broad middle range of teacher quality suggests caution in relying on
principals for fine-grained performance determinations, as might be
required under certain merit-pay policies.
Two important caveats to consider when interpreting
our results. First, we conducted our analysis in a context where principals
were not being evaluated on the basis of their ability to identify
effective teachers. It is possible that principals’ ability to
identify the best-performing teachers would be enhanced by a school system
where the principals had more responsibility for monitoring teachers’
effectiveness. At the same time, social or political pressures might make
principals less willing to assess teachers honestly if their judgments
directly influenced teachers’ compensation. Second, our analysis
focuses on the source of the teacher assessment; we do not address the type of
rewards or sanctions associated with teacher performance. This is clearly
an important dimension of any performance management system, and one would
not expect either a principal-based or a test-based assessment system to
have a substantial impact on student outcomes unless it were accompanied by
meaningful consequences.
Brian Jacob is assistant professor of public policy at
the John F. Kennedy School of Government, Harvard University and a
faculty research fellow with the National Bureau of Economic Research. Lars
Lefgren is assistant professor of economics, Brigham Young University.
|