The Reliability of classroom observations by school personnel


Conducted by Andrew Ho and Thomas Kane.

Purpose of study

To evaluate the accuracy and reliability of school personnel in performing classroom observations.

Research questions

  • What characteristics of lessons and observers are associated with accurate and reliable observation scores?
  • Do lessons chosen by teachers (a proxy for notification prior to an observation) earn higher observation scores than those chosen at random?
  • Are observation scores based on short segments of instruction (15 minutes) comparable to observation scores based on an entire lesson?
  • Does the order in which various lessons and segments are observed influence scores?


Video recordings of eight lessons (four chosen by the teachers and four chosen at random) by 67 teachers in Hillsborough County, Florida were observed and scored by 53 administrators and 76 teacher peers. The teachers and administrators represented 32 schools at the elementary, middle, and high school levels. Each of 129 observers rated four video-recorded lessons by six teachers (24 lessons) on 10 items from Domains 2 (Classroom Environment) and 3 (Instruction) of Danielson’s Framework for Teaching (FFT), yielding more than 3,000 video scores. Each teacher’s instruction was scored an average of 46 times by different types of observers.

Major results

  • Observers rated five percent of lessons as “unsatisfactory” and only two percent “advanced.” Most scores were in the middle two categories, “basic” and “proficient.” Due to the compressed scale, a .1 point score difference could move a teacher up or down 10 points in percentile rank.
  • Administrator scores differentiated more among teachers than scores assigned by peer teachers. The standard deviation among teacher scores was 50 percent larger when scored by administrators than when scored by peers.
  • Administrators rated their own teachers .1 points higher than administrators from other schools and .2 points higher than peers. While this difference seems small in value, it was magnified by the compressed scale of scores noted above.
  • Although administrators scored their own teachers higher, they ranked their teachers similarly as administrators from other schools. This implies that administrator scores were not driven by prior impressions, favoritism, or personal bias. The correlation between same-school and different-school administrators’ scores for a given teacher was .87.
  • Allowing teachers to choose their own videos generated higher average scores, but their ranking remained the same as when videos were not chosen. That is, choosing their own videos did not mask the differences in their practice. In fact, variance in observation scores was greater among self-selected lessons.
  • A positive (or negative) impression of a teacher in the first several videos tended to linger—especially when one observation immediately followed the other.
  • Using multiple observers makes a difference in promoting reliability. The reliability of a single observation by a single observer ranged between .27 and .45. One observation by an administrator and another by an external peer increased reliability to .59 and additional observations and observers increased it further, up to .72.
  • The cost of involving multiple observers can be mitigated by supplementing observations of full lessons with shorter observations. The reliability of a 15-minute observation was 60 percent that of a full lesson observation and required less than one-third of the time.


Findings highlight the importance of multiple observations by multiple observers to achieve an acceptable level of reliability, especially upon which to base consequential decisions.

Variance was also found based on inconsistencies between observers, which highlights the importance of training, certification tests, and regular calibration. To ensure a fair and reliable system for teachers, districts should employ multiple observers and set up a system to check and compare feedback given to teachers by different observers.

Findings also suggest that the element of surprise in teacher evaluations does not increase reliability or accuracy while heightening anxiety and the impression that evaluations are primarily concerned with teacher accountability. In fact, self-selected lessons showed greater variation among teachers.

Finally, the observation instrument (he Framework for Teaching) did not discern large absolute differences in practice. Most teachers fell in the middle of the scale, so that small differences in scores translated into large differences in percentile rankings. This may reflect observers’ reluctance to make sharp distinctions among teachers, a need for finer distinctions in performance level standards, or a lack of variance in teaching practice on the scales used in this study. The authors assert that the field needs instruments that allow for clearer distinctions, such as content-based observational protocols, those that assess subject-specific pedagogical best practices, or instruments for assessing instruction on specific standards in the Common Core State Standards.

FFT focus

The teacher observations in this study scored lessons on 10 items from Domains 2 and 3 of the Framework for Teaching. Analyses support the reliability and validity of the FFT to differentiate teaching practice, especially under the conditions recommended.