Gathering Feedback For Teaching: Combining High-Quality Observations With Student Surveys And Achievement Gains

October 7, 2012

Conducted by the Bill & Melinda Gates Foundation.

Purpose of study

To test the value of five different instruments for conducting teacher observations by comparing them on reliability and a range of student outcomes, such as gains on achievement tests and self-reported effort and enjoyment in class.

  • Framework for Teaching (or FFT, developed by Charlotte Danielson of the Danielson Group),
  • Classroom Assessment Scoring System (or CLASS , developed by Robert Pianta, Karen La Paro, and Bridget Hamre at the University of Virginia),
  • Protocol for Language Arts Teaching Observations (or PLATO, developed by Pam Grossman at Stanford University),
  • Mathematical Quality of Instruction (or MQI, developed by Heather Hill of Harvard University), and
  • UTeach Teacher Observation Protocol (or UTOP, developed by Michael Marder and Candace Walkington at the University of Texas-Austin).

Research questions

  • Is it possible to describe high- and low-quality questioning techniques sufficiently clearly so that observers can be trained to recognize strong questioning skills?
  • Would different observers come to similar judgments?
  • Are those judgments related to student outcomes measured in different ways?


Almost 7,500 videos of the classroom practice of 1,333 teachers from Charlotte-Mecklenburg, Dallas, Denver, Hillsborough County, New York City, and Memphis were observed and rated at least 3 times by trained observers. This group of teachers represents the subset of MET project volunteers who taught math or English language arts (ELA) in grades 4 through 8 and who agreed to participate in random assignment during year 2 of the project. Each video-recorded lesson was scored using each of the cross-subject instruments, CLASS and the FFT, and a third time using one of the subject-specific instruments, either MQI or PLATO. A subset of 1,000 videos of math lessons was scored a fourth time with the UTOP instrument. Data on state test scores, supplemental tests, and student surveys from more than 44,500 students were included in analyses.

Major results

  • All five observation instruments were positively associated with student achievement gains.
  • Reliably characterizing a teacher’s practice required averaging scores over multiple observations.
  • Combining observation scores with student achievement gains and survey data improved reliability and predictive power. When teachers were ranked on the combined measure, the difference between having a top- and bottom-quartile teacher was nearly 8 months in math and 2.5 months in ELA.
  • Teaching experience and graduate degrees were associated with much smaller gains in state test scores than the combined measure.
  • Teachers with strong performance on the combined measure also performed well on other student outcomes—the tests of conceptual understanding and student self-reported levels of effort and enjoyment in class.


The authors highlight three implications of their findings:

  • Achieving high levels of reliability of classroom observations will require quality assurances such as observer training and certification, system-level “audits” by a second set of observers, and multiple observations to improve reliability, especially when stakes are high.
  • Evaluation systems should include multiple measures, not just value-added scores or classroom observations.
  • Classroom observations have the potential to identify specific strengths and weaknesses in teachers’ practice. Schools should look for ways to use observations for teacher development.

FFT focus

The Framework for Teaching was one of the five observational instruments assessed for reliability and association with student outcomes. The researchers found that the FFT was related to student achievement growth on state test scores, tests of conceptual understanding, and student surveys of effort and enjoyment in the classroom. Several subsequent analyses focused on the FFT as a component of the combined measure.