Inter- and intra-observer reliability of clinical movement-control tests for marines
© Monnier et al.; licensee BioMed Central Ltd. 2012
Received: 9 March 2012
Accepted: 23 December 2012
Published: 29 December 2012
Skip to main content
© Monnier et al.; licensee BioMed Central Ltd. 2012
Received: 9 March 2012
Accepted: 23 December 2012
Published: 29 December 2012
Musculoskeletal disorders particularly in the back and lower extremities are common among marines. Here, movement-control tests are considered clinically useful for screening and follow-up evaluation. However, few studies have addressed the reliability of clinical tests, and no such published data exists for marines. The present aim was therefore to determine the inter- and intra-observer reliability of clinically convenient tests emphasizing movement control of the back and hip among marines. A secondary aim was to investigate the sensitivity and specificity of these clinical tests for discriminating musculoskeletal pain disorders in this group of military personnel.
This inter- and intra-observer reliability study used a test-retest approach with six standardized clinical tests focusing on movement control for back and hip. Thirty-three marines (age 28.7 yrs, SD 5.9) on active duty volunteered and were recruited. They followed an in-vivo observation test procedure that covered both low- and high-load (threshold) tasks relevant for marines on operational duty. Two independent observers simultaneously rated performance as “correct” or “incorrect” following a standardized assessment protocol. Re-testing followed 7–10 days thereafter. Reliability was analysed using kappa (κ) coefficients, while discriminative power of the best-fitting tests for back- and lower-extremity pain was assessed using a multiple-variable regression model.
Inter-observer reliability for the six tests was moderate to almost perfect with κ-coefficients ranging between 0.56-0.95. Three tests reached almost perfect inter-observer reliability with mean κ-coefficients > 0.81. However, intra-observer reliability was fair-to-moderate with mean κ-coefficients between 0.22-0.58. Three tests achieved moderate intra-observer reliability with κ-coefficients > 0.41. Combinations of one low- and one high-threshold test best discriminated prior back pain, but results were inconsistent for lower-extremity pain.
Our results suggest that clinical tests of movement control of back and hip are reliable for use in screening protocols using several observers with marines. However, test-retest reproducibility was less accurate, which should be considered in follow-up evaluations. The results also indicate that combinations of low- and high-threshold tests have discriminative validity for prior back pain, but were inconclusive for lower-extremity pain.
Musculoskeletal disorders, especially in back and lower extremities, are common in marines [1, 2] both during their basic military training  and later during service . For marines, this could, as for many other military branches , reduce their operational efficiency and end their service prematurely [4, 6–8]. Since it appears common that such problems lead to shift changes, the use of back-up personnel  or increased workloads for remaining personnel, one marine's disorders could affect the operational efficiency of an entire unit. Back and lower-extremity disorders have in addition been found to be a major contributor to reduction of marine unit strength before deployment, due to medical downgrading of the sufferers to non-deployment status . Further, during deployment, musculoskeletal disorders are the most common causes of medical evacuation [9, 10] and marines that suffer incidents of musculoskeletal disorder or spinal pain show less than 20% likelihood of returning to operational duty . Early physical screening tests focusing on recruits' musculoskeletal health and function in relation to military duty are commonly used in modern armed forces. In the literature, however, no data on reliability exists on such clinical tests in marines.
Several studies in civilian populations have demonstrated a link between musculoskeletal disorders, pain and the ability to adequately control movements and muscular activation in clinical tests [11–13]. Some of these clinical tests are designed for focusing on movement control of a certain defined body region whilst actively moving an adjacent one. Such tests of movement control, also referred to in the literature as motor-control tests [14–16], low-load  or low-threshold  movement-control tests, are suggested for identifying deficits associated with repetitive low-load activity or static positioning . It is suggested that such non-fatiguing movement-control tests will predominantly recruit slow motor units activated at a low threshold . Clinical tests that on the other hand include high load or speed will involve recruitment of fast motor units , which are thus activated at a higher threshold and are less resistant to fatigue . These high-threshold tests have therefore been suggested for identifying the risk of injuries in activities involving fatiguing or repeated high loads . In our experience, based on clinical findings and empiric field observation with marines, tests covering low- and high-threshold movement control of the lower back and hip may adequately challenge weak-links in marines' musculoskeletal system as relevant in their operative duty. We believe such assessments to be suitable to include in protocols screening for deficits that may relate to musculoskeletal disorders, induced by exposures from various work tasks or postures in marines. Therefore, in this study, the tests included were selected to evaluate marines' ability to control or prevent defined movements of the lumbar spine and hip while performing specific lower extremity movement.
However, as the results of clinical testing may influence the testee′s future service or career, e.g. by possibly resulting in medical downgrading, it is of great importance for the tests to be reliable and valid for the specific group and its purpose. Specifically, since screening tests for military purposes are commonly used by multiple testers, good inter-observer reliability is required. If the test is to be used with follow-up evaluations, it needs also to show good intra-observer reproducibility . Here, three important aspects influence measurement variability: variation related 1) to the observer(s), 2) to the instrument and the measuring procedure, and 3) to the subject tested .
Further, a clinically convenient test should show good validity, i.e. measure the entity that it purports to measure . Specifically, evidence of discriminative validity is required to justify the use of clinical tests of musculoskeletal pain, i.e. how far the tests are able to differentiate between those with back- and lower-extremity pain and those without. Such testing accuracy may complement simple pain ratings, particularly in the work with early recognition of disorders, and possibly for planning further clinical examination and intervention. Useful clinical tests that aim to screen marines' physical function need, at the same time, to be simple and fairly brief since generally many personnel are being tested. In addition, methodological evaluations of clinical tests should advantageously be contingent on clinical contextual factors that reflect the testees' natural environment. Although clinical experience suggests that findings of impaired movement control relate to musculoskeletal disorders and pain episodes in the back and lower extremities, we have found no studies on movement-control tests that address such discriminative validity. Further, a few studies in civilian populations (subjects with back pain, subjects with musculoskeletal pain but not back pain and healthy controls) have evaluated the reliability of movement-control tests for the lower back and/or extremities, with inter-tester reliability ranging from poor to almost perfect/excellent [14, 15, 21–23], and intra-tester reliability from fair to excellent . However, to our knowledge, there is no such published data on reliability of clinical tests in marines. The present aim was therefore to determine the inter- and intra-observer reliability of clinically convenient tests for assessing movement control of back and hip in marines. A secondary aim was to investigate the discriminative validity of the best fitting combination of tests for identifying back and lower-extremity pain disorders in this group of military personnel.
This inter- and intra-observer reliability study used a test-retest approach and in-vivo testing methodology. The study protocol included six standardized clinical tests that emphasize active movement control for back and hip. Performances were scored simultaneously by two, well-experienced physiotherapists (observers) who were familiar with the tests. The observers were blinded to each other's scores and to the subjects' health and background information. The procedure was repeated 7–10 (mean = 7.4) days thereafter. The six tests were to be assessed as ″correct″ (pass) or ″incorrect″ (fail), thus generating binominal data. Based on this, the sample size was calculated to approximately 34 subjects at a presumed agreement of 90% (CI 20%; chance agreement: 50%)  and enrolment was planned to meet this criterion. Written informed consent was obtained from all subjects, who received both written and oral information prior to participation. Confidentiality and voluntary participation were strongly stressed. The study was approved in advance by the Regional Medical Research Ethics Committee, Stockholm.
Thirty-three marines on active duty (assault infantry, combat craft crews and coastal rangers) were recruited from a combined company of the 2nd Amphibious Battalion, 1st Marine Regiment, Berga, Sweden, the main marine regiment in Sweden. Eligible subjects had to be in service during the test period. Excluded were subjects on limited duty due to illness (full- or part-time sick leave) and subjects temporarily posted or under training at the 2nd Amphibious Battalion. After receiving oral and written information, volunteering subjects were enrolled in the study and scheduled for testing. Of the 33 subjects enrolled, 32 were male and one female. Means (SD) for age, weight and height were: 28.7 (5.9) yrs, 82.5 (9.4) kg and 1.81 (0.059) m.
Standardized self-report questionnaires were used to collect demographic information and medical history for the previous six months, including numerical rating scales of ‘pain at present’ and for ‘the previous six months’ , specified by anatomical body region . The numerical pain-rating scale has been found reliable and sensitive for the assessment of pain and has been suggested to be appropriate for use in clinical practice  and research . For the purpose of this study, back- (lumbar, thoracic-back) and lower-extremity (hip/thigh, knee, ankle/foot regions) pain was defined as any reported pain experience (pain, ache or discomfort), and for pain at present this was ≥ 1 on the numerical pain-rating scale.
After filling in initial questionnaires, each subject was instructed to wear only underwear so that movements of the lumbar spine, hips and lower extremities could be properly observed. The tests were performed in a standardized order (as specified from Figures 1, 2, 3, 4, 5, 6). The observers instructed every other enrolled subject as scheduled for test one (randomized for first subject), and this order was kept in the re-testing procedure. At first, the subjects received standardized instructions orally for the test and the instructing observer also demonstrated the test (see Additional file 1 for examples). All subjects then performed one trial with feedback and, if needed, received further visual, oral and/or manual instructions/guidance by the instructing observer. This was done in order to ensure full understanding of the test performance and to ensure that the test result did not reflect the subject's unfamiliarity with the movement. Subjects then performed the tests and the performance was assessed simultaneously by the two observers using a standardized assessment rating protocol (see Additional file 1 for details) and then dichotomized (i.e. ″fail″ if one direction in any region was uncontrolled). The test protocol took approximately 30 minutes to conduct. To familiarize themselves with the test procedure and protocol, the observers trained on in vivo observations on a total of nine subjects together, dispersed over two occasions before study start. Here, the observers discussed and synchronized test performance, instructions and rating procedure. During the testing, however, there was no such communication, and the observers were blinded to each other's scores.
where k is the number of variables (or tests) in the statistical model and L is the log likelihood of the model. A regression model with more than three variables/tests or a sensitivity less than 60% test was considered to be of limited value. A subject with both back- and lower-extremity pain was analysed in both groups. To check for systematic differences between tests one and two (systematic bias), McNemar analyses were applied with a p value of 0.05 as significant.
Rated back and lower-extremity (LE) pain the previous six months and at present (n = 33)
Pain previous six months
Pain at present
Pain at presenta
Only Back pain
Only LE pain
Back and LE painb
Inter-observer reliability: kappa coefficient, 95% confidence intervals, percent agreement and standard error
Kappa coefficient (CI 95%)
Kappa coefficient (CI 95%)
Mean kappa coefficienta
Intra-observer reliability: Kappa coefficient, 95% confidence intervals, percent agreement and standard error
Kappa coefficient (CI 95%)
Kappa coefficient (CI 95%)
Mean kappa coefficient a
Test results presented for each observer/test and number of cases identified that failed the test
Number of cases identified at test one with pain, that failed the test
Back pain prev. 6 mo/at present
LE pain prev. 6 mo/at present
Discriminative analysis: Akaike information criterion (AIC), p -value and sensitivity/specificity of model variables for pain ratings
prev. 6 month
prev. 6 month
We sought to determine the inter- and intra-observer reliability of six clinical tests targeted for screening and following marines' ability to perform accurate movement control. The tests had moderate-to–almost-perfect inter-observer reliability while intra observer reliability was fair-to-moderate. Discriminative regression revealed that combinations of low- and high-threshold tests had discriminative validity for previous back pain, but were inconclusive for lower-extremity pain.
Since the recruited marines were on active duty, and not recruited from subjects seeking care, the external validity extends only to a population of marines on active service. This was felt to be a strength of the study since the selected tests were intended and limited for this operational group. The results could however be of interest for researchers and clinicians alike, particularly for those working with similar military units. Further, we used an in-vivo study procedure similar to our clinical or preventive work in respect of settings and rating criteria, hence strengthening the ecological validity of the study protocol. Here, also, a large number of military personnel are usually tested and screened in a short time frame, and we therefore applied one practice round for each test. Some of our tests included sub-scores on observations and ratings for several body regions (SLKB+LL and DLL-ALE) and more than one direction of movement (SLKB+LL, DLL-L, DLL-ALE and DSLL). We believe, however, that when applying clinical tests for screening purposes, for logistical reasons an overall screening examination should be used, for example, to set priorities for further individual clinical action, but also to collect data for epidemiological analyses and follow-up. Notably, the procedure with two observers scoring the same subject simultaneously, here with one observer instructing the subject, limits the inter-observer reliability to test performance only. Considering our discriminative regression, the number of subjects was rather small (n = 33), and this was why we pooled lumbar and thoracic pain as back pain, and hip/thigh, knee, ankle/foot pain as lower-extremity pain, respectively, in this analysis. Further, for defining pain at present, a cut-off of ≥1 NRS may seem low. However, our experience with marines is that they underestimate their level of pain, also learned in other groups , and therefore a cut-off point of 1, which equals ″any pain experience″, was selected. In addition, a study on US Army soldiers  showed that the prevalence of back pain history may be underestimated in long-term recall surveys compared to monthly follow-ups. Developing a standardized operational definition to determine functional limitation, including pain ratings and pain interference with activity (operational efficiency), may improve the reliability of a future outcome construct in marines, thus possible improving its potential on discriminative and predictive validity.
Our data on inter-observer reliability ranged from moderate to almost perfect agreement (Table 2). While no such reliability data exist on movement control tests in marines, our results on agreement between observers are consistent with [14, 15, 22], or somewhat better than [21, 23], most other reliability studies of movement control conducted in the civilian population. Here, Enoch et al.  and Roussel et al.  presented good (moderate-to-excellent) inter-observer reliability with their in-vivo collected data, though few tests were similar to ours in terms of test protocol (c.f. BKFO and SB). We believe, however, that our results indicate that the present six clinical tests are reliable for use in screening programs with multiple observers in marines.
Our results on intra-observer reliability were fair-to-moderate. Surprisingly few studies report on the intra-observer reproducibility of movement control tests, particularly since such clinical tests are commonly used for follow-up evaluation. The results of Luomajoki et al.  ranged from fair to excellent intra-observer reliability for ten movement-control tests. Two of our tests (c.f. BKFO and SB) were similar to theirs, though our corresponding kappa coefficients indicated lower reproducibility than theirs. However, their test-retest ratings were based on video recordings of one test occasion.
Interestingly, for two of the tests in the present study, i.e. DLL-L and DL-ALE, more subjects ″passed″ the re-test procedure than on the initial test occasion (Table 4). Such results may reflect a learning effect of the tests themselves (or systematic bias), and can only be manifested using observation from repeated testing. This probably also applies to our lower kappa coefficients on these tests. There were no clear indicators for any specific test being more difficult to instruct or evaluate relating it to poor re-test reproducibility. Further, only one other study  of movement control tests discloses how many practice times the subjects were allowed for each test. However, their study design did not include test-retest measurements, thus no intra-reliability analyses. Even so, we believe that repeated practice rounds may reduce learning effect, thus influence test reproducibility positively. In addition, improvement on the repeated test emphasizes the importance of including within-subject variation in test-re-test data relevant for clinical interpretation. Future studies, however, need to consider a trade-off between ″realistic″ amount of practice rounds related to their clinical work and sufficient elimination of learning effects.
One of the tests, the SLKB+LL, showed substantial inter-observer reliability at the re-test, with a kappa coefficient of 0.63, but with a percentage agreement as high as 94%. This discrepancy was probably due to an uneven number that passed/failed the test (Table 4), and it demonstrates how the kappa coefficient could be affected by such prevalence . In order to adjust prevalence effects on kappa values, different types of adjustment have been discussed . For example, with the prevalence-adjusted bias-adjusted kappa (PABAK), the adjusted kappa may be calculated with a maintained level of agreement, hence creating a ″hypothetical population″ with optimal distribution of pass/fail ratio . Such adjusted coefficients may indeed add to the understanding of external validity extended to other populations such as other military units or possibly in civilian contexts. However, prevalence effects on kappa coefficients are themselves informative in a particular population  and, within the present study aim, we elected to report conventional kappa only. Further, the SLKB+LL and the DSLL showed lower 95% CIs of the kappa coefficient of less than 0.2 on one test occasion, respectively, thus indicating an increased risk of measurement error. For intra- observer reliability, this was also so for most of the tests for both observers, here probably affected by the present learning effects. This should be considered in follow-up evaluation and interpretation with the present clinical tests.
Regarding discriminative validity, our results indicate that combinations of low- and high- threshold movement-control tests had some discriminative validity for previous back pain, but not for present pain. Concerning lower extremity pain, there were differences in sensitivity/specificity between observer A and B, also for tests included in best fitting model, thus limiting the discriminative power of these observations. While we have learned that the AIC auto-regression rather accurately separates tests that do not really relate/contain properties with the dependent variable, pre-selection of tests with good kappa-coefficients may have strengthened our regression model. However, we believe our discriminative findings are an important complement to pain ratings, particularly since altered motor control may persist after pain relief  and long-term recall of pain may be underestimated , as indicated above. Our results somewhat support the use and interpretation of test combinations, rather than information from single tests. Since the BKFO and DSLL model discriminated prior back pain if the BKFO (low-threshold test) was passed, and the DSLL (high-threshold test) failed, the clinical and physiological implication of such results should be further validated. Even so, this is interesting since Roussel et al.  showed that two movement-control tests could predict injury in the back or lower-extremity over six months in professional ballet dancers. However, within the limits of the present study, the direction of causality is uncertain. In other words, does the pain experience cause certain results with movement control or vice versa. Also, our results say little about future incidents, and we believe therefore that further research should address the predictive validity of movement-control tests for musculoskeletal disorders in marines. Such knowledge would certainly contribute to the evidence for use of such screening tests in this group of military personnel.
Clinical tests that emphasize movement control for back and hip had moderate-to-almost-perfect inter-observer reliability, indicating that these tests are reliable as screening tests using several observers with marines. However, test-retest reproducibility was not as accurate, with intra-observer reliability ranging from fair to moderate. This should be considered in follow-up evaluation. Our results also indicated that combinations of low- and high-threshold movement-control tests had discriminative validity for earlier back pain, but were inconclusive for lower-extremity pain. Further studies should emphasize predictive validity with clinically convenient tests for musculoskeletal disorders among marines.
Funding from the Swedish Armed Forces PhD programme and financial support from Svenska Militärläkarföreningen are gratefully acknowledged. We also thank the 1st Marine Regiment, Swedish Armed Forces, for funding and overall support, and we especially extend our thanks to the marines who participated. The funding organizations had no authority over or input into any part of the study. We would also like to thank Mark Comerford, Sarah Mottram and Movement Performance Solutions for their help with test design and modification.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.