Researchers Bao, Howard, Spielholz, Silverstein, and Polissar conducted a study designed to investigate interrater reliability — the ability of different observers/assessors to reach the same conclusions when visually estimating posture. In their introduction, the authors review the three postural observation methods that are available to ergonomists:
- direct measurement, which involves instrumentation (e.g., goniometers) that can increase the cost and complexity of an evaluation, but is likely to produce the most valid and reliable results when performed properly;
- self reported posture, collected through survey or interview, which is economical, but has lower validity and reliability; and
- visual postural observations, conducted by an observer (a “rater”) on site or via video, the validity and reliability of which is the central question for this research.
The researchers tested a small group of professional ergonomists with little experience using the posture rating tool/method, comparing them to a small group of technicians with limited theoretical knowledge, but more experience applying the rating tool. The central question they investigated was how reliably the raters reached the same conclusions when recording postures. It’s very important to note that they did not test how reliable the raters were at accurately assessing actual body postures, but instead how consistent the raters were, whether or not they were accurate.
The authors present a detailed review of their methods and results and the possible reasons for those results, and interested readers are encouraged to review the entire article. However, the following select findings are presented here:
- Raters were in greater agreement (more reliable) when categorizing posture angles into larger categories, 30° increments vs. 10° increments.
- Raters were in greater agreement when estimating larger body segment angles (e.g., trunk) than smaller segments angles (e.g., wrist).
- A variety of factors influence raters’ ability to estimate a posture angle, including such things as video quality, rater location relative to the subject being observed, rater experience level, and more.
- With a few exceptions, raters were in greater agreement when rating postures that were near neutral than they were when rating non-neutral joint postures.
- When considering interrater reliability for 10° increment estimates, the percentage of agreement between raters ranged from a low of 16% (forearm supination/pronation) to a high of 96% (neck lateral flexion).
- When considering interrater reliability for 30° increment estimates, the percentage of agreement between raters ranged from a low of 32% (forearm supination/pronation) to a high of 100% (neck lateral flexion).
- On average, the percentage of agreement was 78% lower for the 10° increment estimates that for the 30° categories.
- More experience with the rating tool created more consistency between the raters, suggesting that training and experience with the posture tool may outweigh theoretical knowledge. The technicians demonstrated more consistency than the professional ergonomists, but when all data were categorized, there was typically less than 10% difference between the groups for most postures.
The Bottom Line — How This Applies to Ergonomists
Visual observation and recording of posture is something that occupational ergonomists apply regularly in the assessment/evaluation of ergonomic risk. Observational assessment tools like RULA, for example, rely on such postural estimates. This study indicates that the angle range categories, 10° vs. 30° in this study, has a significant effect on interrater reliability, as does the training and experience one has applying an observational postural recording tool. Other factors, such as camera placement, will also affect how consistent raters are when assessing the very same posture.
Although not explicitly discussed by the authors, my own interpretation of these results is that visual observational posture recording tools are subject to substantial variation between observers, bringing their validity into question. Further, this study does not address the question of accuracy in recording the actual postures. Therefore, even if various raters were to reach consistent posture estimates, there is no way to know if those estimates accurately capture the true posture. If you need accuracy, you are much better off carefully using direct measurement techniques (e.g., goniometers) than you are applying an observational tool.
The researchers tested seven observers: three of them professional ergonomists with extensive theoretical background, but limited experience applying the postural recording tool/method; and four technicians with limited theoretical background, but more experience applying the posture tool. Each participant was asked to record postures from 37-38 randomly selected video frames taken from four different video-recorded jobs using a posture recording system developed by Bao. The system included two camera angles, set as close to perpendicular to each other as possible within the constraints of the industrial environments. The raters estimated the approximate joint angles of the various body parts by clicking on a point on a posture diagram displayed on a computer screen instead of entering a numerical angle value in degrees. The system automatically entered the numerical value, in degrees, and the participant was able to modify the value if desired.
In their analysis, the authors categorized the results three different ways:
- fixed width category with 10° increments;
- fixed width category with 30° increments; and
- a predifined category method with ranges such as <-5°, between -5° and 30°, >90°.
A great deal of discussion is provided regardeing the various statistical tests that can be applied to better characterize and understand interrater reliability. The authors reject the kappa statisitic used by some other researchers in favor of a raw percentage agreement among participants, whcih is easy and straightforward to calulate and understand, and intraclass correlation coefficient (ICC), which provides a more complex analysis, but is sensitive to the postural variations from frame to frame, meaning that jobs with greater postural variations may result in hgiher ICCs than those with little variation, even when participants demonstrate high percentage of agreement.
Bao, Stephen; Howard, Ninica; Spielholz, Peregrin; Silverstein, Barbara; Polissar, Nayak. Interrater Reliability of Posture Observations. Human Factors, Volume 51, Number 3, June 2009 , pp. 292-309(18). Retreived January 25 from http://www.ingentaconnect.com/content/hfes/hf/2009/00000051/00000003/art00003.