In summary, based on the predefined cut points for Kappa and ICC values, clinically acceptable inter-rater reliability was found for most hypermobility tests, inter-segmental mobility with pain response and lumbar end range pain in flexion and extension. Results for FFD in forward flexion were difficult to interpret and all other test variables showed poor to slight Kappa values or unacceptable ICCs and LoA of inter-rater assessment. The intra-rater values were on average in the middle between the inter-rater values and 1.00 so there are discrepancies both between the observers and at the single observer level.
The two tests for scoliosis did not perform well in our study with slight inter-rater reliability and moderate intra-rater reliability with very wide CI. The examiners, however, did not report any contradictory results, where one assessed a higher left and the other a higher right shoulder (data not shown). To our knowledge, no comparable studies have investigated the reliability of the shoulder height difference test. One study evaluated the reliability of Adam’s Forward Bend Test in a population already defined with scoliosis . They reported a Kappa value somewhat higher than in the current study; however this is likely to be explained by the difference in population in the study. The prevalence rate was in excess of 70%, whereas in our study population, the prevalence rate was around 10%, resulting in small cell sizes and thus less trustworthy Kappa values. The poor result for the Adam’s Forward Bend Test might also have been due to the subjects wearing a shirt during the assessment.
Most of the tests for hypermobility were reliable. The exceptions were tests for knee and fifth finger extension. It should be noted that the prevalence of positive findings for these two conditions were very low with cell sizes ranging from 1 to 10 which could contribute to the low Kappa values. Knee hyperextension was probably influenced by the subjects wearing pants while being assessed. Earlier studies have shown good reliability when the tests were evaluated as a whole (index sum score) but these studies were either performed on adults or used other statistical methods [11–14]. Since the cut-off level for a positive index score is debatable , we calculated reliability using the Beighton score with cut-off points at both 4 and 5, and found moderate to almost perfect agreement for both, which is in line with another study using the same cut-off points . However, we observed that even if the raters agreed with a cut-off point at 5, they did not necessarily agree on which of the joints were included in this score, e.g. one elbow, two thumbs and two fifth fingers were compared to two knees, two elbows and two fifth fingers in the inter-rater reliability study; that means they only agreed in three of the nine joints, but both had a Beighton score ≥ 5. Although this might discredit the individual tests, it shows that the index score is robust.
Some studies demonstrated excellent inter- and intra-rater reliability for FFD both for forward and lateral flexion[15–18]. However, these studies used adult subjects and furthermore, a different approach in performing the tests and/or the statistical measures was applied. The inter-rater reliability of FFD in forward flexion was also high in our study, indicating an ability of this test to distinguish subjects from each other. However, the LoA were wide compared with the range of assessments, implying that the scores of repeated measurements differed substantially. Because ICC is affected by the total variance , a high variance in our subject population could somewhat obscure the measurement error in the ICC value, explaining why a clinically unacceptable LoA is accompanied by a high ICC. This means that the positive results should be interpreted with caution.
The poor reliability and large measurement error in lateral flexion in our study might be related to the difficulty in performing a pure spinal lateral flexion. We suggest modifying this test to have the subject standing against a wall during the assessment, as performed in another study . This would probably reduce the negative influence of combined flexion/extension or rotation with lateral flexion on measurement accuracy, as was observed in many cases.
Regarding the Schober test, a study using a similar approach in adult men with known ankylosing spondylitis has shown excellent reliability: ICC = 0.93 and 0.96 for inter- and intra-rater reliability respectively . For intra-rater reliability and measurement error, we see that both ICC and LoA differ substantially between the raters, probably because of the mistake made by Rater 2 who rounded up or down to the nearest whole cm e.g. LoA was calculated to be -1.3 – 1.3 for Rater 1 vs. -2.9 - 2.8 for Rater 2. We believe that the results in our study may also have been negatively influenced by a slight variation in starting position due to lack of agreement in how to locate the bony landmarks, a difficulty described elsewhere .
Assessment for inter-segmental mobility showed better reliability when assessing for pain than when assessing for restricted movement. In general, the intra-rater reliability was higher than the inter-rater reliability. This is consistent with an earlier review .
The outcome for end range pain in the cervical and lumbar spine showed inconsistent results. Again, better results were seen with intra-rater assessment. An earlier study evaluating inter-rater reliability for end range pain in the cervical spine also demonstrated high variability of results and unacceptable ICCs for most of the variables . Lumbar pain in extension scored the highest Kappa values, but the pain in this position was relatively common, reaching a prevalence of 28% (cell size: 30), while a maximum of 8% (cell size: 9) was reported for pain in the other lumbar and cervical movement directions. This fact could partly explain the difference in Kappa values. However, there were some difficulties and discrepancies connected with the performance of these tests, which also led to decreased reliability. The observers noted frequent uncertainty when classifying the responses. Some subjects used the term “soreness”, which was interpreted differently between the raters. We believe that there is more to be gained with refined practice and further standardisation of these tests.
The study’s main strength is its school-based population which reflects the target population, i.e. the age where the prevalence of spinal pain escalates. In addition, we nearly reached a 100% participation rate with an almost equal distribution between genders, minimising possible bias due to gender disproportion.
The pre-study training session, the presence of the observers and our use of two well-experienced chiropractors as raters could have contributed to the relatively high reliability measures of this study. One could argue that this is not representative of the true situation in a school screening setting, however, we think the tests are easy to perform, the interpretation relatively easy (“yes”/“no”) and that the tests do not need special skills except for the inter-segmental mobility, where long term experience is beneficial. One would assume after a few training sessions that the tests could be used by any practitioner dealing with spinal examinations.
The major limitation of our study is the sample size. Although we exceeded our goal of 100 subjects, the prevalence rates of a positive test were very low for some of the tests. We believe a larger sample size and thus more positive findings would result in more precise reliability estimates, which means either higher or lower than estimated in this study. Despite the standardization of the tests, the observers occasionally noted discrepencies in the instructions to perform a maximal forward flexion. This could have resulted in less precise estimates of ICC and measurement error of forward flexion finger-floor-distance and the Schober test if the protocol was followed. On the other hand, the same would probably occur in a screening setting at schools or in the clinics and therefore probably gives a more realistic estimate of the test’s reliability and measurement error. The relatively short time period between examinations in the intra-rater part of the study could also be a limitation by increasing the risk of raters recalling an individual’s test results from an earlier examination. However, we judged this influence to be less detrimental than the potential risk of changes in the subjects’ biomechanical state arising as a result of a longer interval between the two examinations where injuries and/or new onsets of spinal pain could occur. Furthermore, the large battery of tests in the protocol and the many subjects assessed in between the first and the second examination of each subject is deemed to have minimised the risk of memory bias. The time limitation could also have been a factor that negatively affected our results. The raters, however, believed that their performance would not have been any different under other time conditions, as they seldom needed the 4 minutes to complete a single examination. We therefore consider this time consideration to be of minor relevance.