Rethinking teacher evaluation in Chicago: Lessons learned from classroom observations, principal-teacher conferences, and district implementation


Conducted by Lauren Sartain, Sara Ray Stoelinga, and Eric R. Brown with the assistance of Stuart Luppescu, Kavita Kapadia Matsko, Frances K. Miller, Claire E. Durwood, Jennie Y. Jiang, and Danielle Glazer.

Purpose of study

To summarize findings from Chicago’s Excellence in Teaching Pilot, which aimed to improve instruction by providing teachers with feedback on their strengths and weaknesses, and highlight broad implications for districts and states working to design and develop more effective teacher evaluation systems.

Research questions

  • What are the characteristics of principal ratings of teaching practice?
    • Do evaluators rate the same lesson in the same way? Do principals rate teaching practice consistently across schools?
    • Are the classroom observation ratings valid measures of teaching practice? Is there a relationship between ratings and student learning outcomes?
  • What are principal and teacher perceptions of the evaluation tool and conferences?
    • Do participants find the system to be useful? To be fair?
    • What is the perceived impact on teacher practice?
  • What factors facilitated or impeded implementation of the teacher evaluation system?


Randomly-selected teachers and administrators at half of the elementary schools in 4 areas of Chicago participated in the first year of the study (2008-09), while the others joined the following year. Sample sizes and participants varied by aspect of the study: 499 observations of 257 teachers were made by principals and highly-trained external observers to assess the reliability of FfT ratings, while principals made 955 observations of 501 teachers to assess the validity of observations. Teacher value-added scores were calculated for 417 reading teachers and 340 math teachers to determine the relationship between observational assessments and student learning. Thirty-seven pilot and control principals completed surveys to determine their engagement. All pilot schools also completed principal interviews (39), teacher interviews (26), and principal focus groups (23). A subset of 8 case study schools provided additional in-depth administrator interviews, teacher focus groups, and observations.

Major results

  • The data showed a strong relationship between classroom observation ratings on the FfT and value-added measures of student learning growth in both reading and math. The students of highly-rated teachers showed the most growth in their test scores, while students of teachers with low ratings on the FfT students showed the least growth. These results support the validity of observational ratings of teaching practice using the FfT.
  • In terms of score reliability, most principals assigned the same ratings to observed teachers as highly-trained external observers, although small percentages consistently rated teachers lower (11%) or higher (17%) than the external observers. Administrators tended to rate teaching practice reliably at the low end of the scale (Unsatisfactory and Basic) but rated teachers’ practice as Distinguished more often than observers.
  • Some principals struggled to learn and engage with the process of using the FfT to observe and rate their teachers’ practice.
  • Qualitatively, administrators and teachers thought the FfT helped lead to more reflective, evidence-based discussions about teaching practice during post-observation conferences. The FfT provided a shared language about instructional practice and improvement that guided conversations.
  • The effectiveness of conferences varied, however, depending on the principal’s skills and buy-in to the evaluation process. Many principals struggled to engage in deep coaching conversations, instead dominating conversations and/or using low-level questions that required minimal responses and didn’t push teachers’ understanding.


Classroom observations using the Danielson FfT can provide valid and reliable assessments of teaching practice while also assessing specific strengths and weaknesses. Use of the FfT helped to guide more meaningful conversations about instruction. While most principals were engaged in the evaluation process, there is a need for more training and support for deep coaching conversations that translate ratings on an instructional rubric into improved performance in the classroom. Another challenge is the time and energy required to conduct evidence-based evaluation well, from analyzing observations to preparing for thoughtful conferences. Both teachers and administrators must make a long-term commitment to using evaluation to promote teacher development and improve practice in the classroom.

FfT focus

The study validated the FfT as being associated with teacher effectiveness as measured by student growth on test scores in reading and math and established that principals can rate teachers reliably as compared with highly-trained external observers. Use of the evidence-based rubric provided a common language for discussions about instructional improvement and fostered more objective and meaningful conversations. However, the study also demonstrated some concerns about use of the FfT as a tool for teacher development and instructional improvement, including principal buy-in and engagement, political concerns about teacher ratings, time constraints, and the need for support for administrators to translate FfT ratings into deep coaching for their teachers.

View Report