This study investigated the interobserver reproducibility of the assessment of active and passive RoM of the knee in patients after TKA. The results show that there was considerable variation in the agreement between observers in all movements tested. The lowest limit of agreement was found for active extension. As regards flexion, the lowest limit was found for the measurement of passive flexion whilst sitting with 90° flexion of the hip.
With respect to individual measurement over time ICC values have to be considered as poor.
However for group comparison the reliability of the flexion measurement was satisfactory for group comparison with exception of the measurement of extension. This may have been caused by the wider RoM when measuring flexion. The ICC depends on the range of the true quantity in a sample, so if this range is wide, the correlation will be greater than if it is narrow [23].
A similar explanation can be given for the narrower limits of agreement found for extension. The range of outcomes for extension is generally narrow, leading to narrower limits of agreement.
The results show that both observers measured less flexion RoM with patients in supine position than in sitting position. A possible explanation for the difference in measurement positions may be that the supine position is less stable and allows more degrees of freedom of movement in both hip and knee compared to the sitting position, in which the upper leg is fixed on the examination table, as already indicated by Gajdosik [4].
Linden-Peters et al. [11] were the first to report on agreement in the measurement of knee RoM.
They reported better results for extension measurements. We believe this is probably caused by differences in the selection of the study population. Our population had a wider range of extension, leading to a wider limit of agreement.
Although the measurement procedure was standardised, differences in effort by the patient are a potential source of bias in these measurements, and the influence of the examiner is added as a possible source of variation when measuring passive RoM. The amount of force used by the different examiners to reach full RoM may well influence the reproducibility of the measurements [4].
Unlike those of others [5, 10, 11] our results do not show any differences between active and passive measurements. This might be caused by the level of experience, and the training of the participating therapists prior to the study, as well as the standardisation of measurements used by all therapists working with TKA patients in our hospital. Another explanation may be that in the acute phase after surgery, patients tend to guide passive measurement, for fear of extreme pain when going into extreme flexion or extension. This may well cause measurements of passive and active RoM to be more alike.
Although we concentrated on measuring in a clinical situation, our results with respect to reliability are similar to results reported in the literature, which were obtained in a laboratory environment [5, 6, 10, 12, 17]. This may have been partly caused by the use of a predefined testing procedure by experienced therapists, who had trained the procedures on healthy subjects and patients before the start of the study. Van Genderen et al. [16] also described positive effects of including these steps in the measurement design.
Hence, we believe that RoM measurement in a clinical situation is possible without loss of reliability. However, it may be questioned whether agreement is not of greater importance in clinical measurement situations.
Limitations
The decision to use only two testers in this study might be debatable. For a correct simulation of everyday practice, we would have preferred to include all staff members involved in the follow-up of TKA patients at our clinic. We chose to include only two testers, however, because we believe that the inclusion of agreement is of the utmost importance in analyses of measurement in a clinical situation. This inclusion of limits of agreement as an outcome was only possible when using two testers.
Our research focused on the difference between observers, and did not include intraobserver reproducibility. We believe that studying intraobserver reproducibility would involve more interference with everyday practice, as individual observers would need to perform multiple measurements on the same patient and would have to be blinded for the outcome of each of their measurements. This would interfere with our intention to mimic clinical measurement procedures and not to create a laboratory environment for our measurements.
In spite of our use of a measurement procedure with standardised measurement and pre-study training, we still found differences in RoM measurement between the two observers. Observer A persistently measured greater flexion RoM. We believe that, despite training of the observers this might still be caused by persistent differences between the testers in the choice of the fulcrum of rotation. Brosseau et al. [5, 9]already mentioned the choice of the fulcrum as being the Achilles heal of RoM measurement in the knee. This problem might be overcome by using the parallelogram goniometer introduced by Brosseau et al [5, 9].
We didn't standardise the amount of force used for measurement of passive RoM. This may be a cause for the differences we found between observers.
Since this study was conducted in one physical therapy department, by therapists who measure RoM in TKA patients daily, the results may not necessarily be generalizable to all physical therapists.
Relation between reproducibility and responsiveness
To be useful for outcome assessment in clinical practice or research, an instrument should have high responsiveness, which is strongly related to the level of agreement [24] Limits of agreement should be smaller than the minimum clinically relevant difference one wants to detect. As regards clinical practice, the large limits of agreement for all measurements in our study indicate that we should be very careful in comparing and interpreting results obtained by different examiners. As regards research, we suggest to try and use only one observer. Unfortunately, practical reasons make it very difficult to investigate the level of intraobserver reproducibility. Future studies should investigate whether further standardisation or the use of other measurement tools such as a parallelogram goniometer might lead to smaller limits of agreement.