Skip to main content

Reliability, construct and discriminative validity of clinical testing in subjects with and without chronic neck pain



The reliability of clinical tests for the cervical spine has not been adequately evaluated. Six cervical clinical tests, which are low cost and easy to perform in clinical settings, were tested for intra- and inter-examiner reliability, and two performance tests were assessed for test-retest reliability in people with and without chronic neck pain. Moreover, construct and between-group discriminative validity of the tests were examined.


Twenty-one participants with chronic neck pain and 21 asymptomatic participants were included. Intra- and inter-reliability were evaluated for the Cranio-Cervical Flexion Test (CCFT), Range of Movement (ROM), Joint Position Error (JPE), Gaze Stability (GS), Smooth Pursuit Neck Torsion Test (SPNTT), and neuromuscular control of the Deep Cervical Extensors (DCE). Test-retest reliability was assessed for Postural Control (SWAY) and Pressure Pain Threshold (PPT) over tibialis anterior, infraspinatus and the C3-C4 segment.


Intraclass Correlation Coefficient (ICC) for intra- and inter-examiner reliability was highest for ROM (range: 0.80 to 0.94), DCE (0.75 to 0.90) and CCFT (0.63 to 0.86). JPE had the lowest ICC (0.02 to 0.66). Intra- and inter-reliability for GS and SPNTT showed kappa ranging from 0.66 to 0.92, and 0.57 to 0.78 (prevalence adjusted), respectively. For the test-retest study, ICC was 0.83 to 0.89 for PPT and 0.39 to 0.79 for SWAY. Construct validity was satisfactory for all tests, except JPE. Significant between group discriminative validity was found for CCFT, ROM, GS, SPNTT and PPT, however, differences were within the limits of the minimal detectable change.


The majority of the tests evaluated showed satisfactory reliability and construct validity supporting their use in the clinical evaluation of patients with chronic neck pain.

Peer Review reports


Musculoskeletal disorders are the most common form of long-term illness and neck pain is a frequent complaint [1]. The point prevalence of neck pain is around 20% [2, 3] and the one-year prevalence around 35% [2, 4].

People with chronic neck pain present with a number of objective findings including alterations in the structure and function of the deep cervical flexor [5, 6] and extensor muscles [7, 8], reduced range of neck motion [9], proprioceptive deficits [10, 11], occulomotor disturbances [12, 13], impaired postural control [14, 15], and general sensitization of the central nervous system [16, 17].

Several clinical tests have been described to test for these deficits, however, the reliability of such tests have not been adequately evaluated or have only been evaluated when implemented with advanced technologies which would not be available within a clinical setting. For instance, several studies conducted on people with chronic neck pain, cervicogenic headache or asymptomatic controls [1821], have found satisfactory intra- and inter examiner reliability of the cranio-cervical flexion test, a low-load test, measuring the patient’s ability to activate the deep cervical flexor muscles. However, a systematic review concluded that the reliability of this test was under the acceptable level [22]. In contrast, no standardized clinical test has been described to test the neuromuscular control of the deep cervical extensors. Reliability of measuring repositioning error during tests of relocation accuracy has been examined in people with whiplash-induced neck pain and in asymptomatic controls with advanced equipment only, and the results are conflicting - ranging from high levels of reliability [23], to very low levels [24]. Gaze stability and the smooth pursuit neck torsion test, tests of oculomotor control, have been widely described and applied in the assessment of people with neck pain [12, 25]. The test-retest reliability of Gaze stability has been reported to be fair to good in asymptomatic controls when using wireless 3D sensors to monitor neck movement [12, 26]. Postural control has been examined extensively on a force platform, but only one study has evaluated the reliability of measuring postural control in adults in a clinical setting (using a Wii Balance Board) and this study evaluated healthy individuals only [27]. Satisfactory reliability has been reported for the measure of pressure pain threshold (PPT) using a hand-held algometer in patients with acute neck pain [28], however, this has not been replicated in patients with chronic neck pain.

Thus, although widely used clinically, very few clinical tests applied during the assessment of a person with chronic neck pain have been evaluated for their reliability.

This study, therefore, investigates intra- and inter-examiner reliability of six clinical tests – Cranio-Cervical Flexion Test (CCFT), Cervical Range of Movement (ROM), Joint Position Error (JPE), Gaze Stability (GS), Smooth Pursuit Neck Torsion Test (SPNTT), neuromuscular control of the deep cervical extensors (DCE), and test-retest reliability of postural control (SWAY) and Pressure Pain Threshold (PPT) in patients with and without chronic neck pain. As a secondary aim, the construct and between groups discriminative validity of these tests is examined.



The study was a reproducibility study of six clinical tests with two examiners, and a test-retest study of two physical performance tests, conducted by a third examiner. The study followed a strict three-phase reproducibility protocol, including a “training”, an “overall agreement”, and a “study” phase, as recommended for nominal and ordinal data [29]. Since the clinical tests included primarily ratio interval data, the protocol was adjusted to a two-phase study by excluding the overall agreement phase [30]. This standardized protocol included a case as well as a control group, to confirm that both groups could be tested reliably. Tests were described in detail by examiner C (Experienced Physiotherapist (PT) and Manual Therapist) during phase one. Afterwards, examiner A and B (final year bachelor PT students), and examiner C tested 10 subjects with and without neck pain in an open study, to become familiar with and to standardise and equalise the test procedure and interpretation of results. During phase two, the examiners applied all tests on included subjects. Examiners were blinded to the status of the subjects, except for examiner C, since this examiner was involved in the recruitment of cases and controls. Although examiner C was aware of the subject’s status, this examiner only performed the PPT and SWAY tests, which are two fairly objective tests, thus limiting potential bias. All examiners were mutually blinded to the results of other examiners.

Study sample

Patients were recruited at physiotherapy clinics and controls via local advertisements. Inclusion criteria for neck pain patients: adults (>18 years), neck pain >6 months, reduced neck function (Neck Disability Index; NDI, minimum 10/50), pain primarily in the neck, and ability to read and understand Danish.

Inclusion criteria for controls: no present pain in neck, shoulder, elbow or hand, no neck pain lasting more than one week during the last year, matched on gender and age (+/-3 years to one of the patients), and ability to read and understand Danish.

Exclusion criteria for both groups: neuropathies/radiculopathies (defined by positive Spurling, cervical traction and plexus brachialis tests) [31], neurological deficits, being in an unstable social and/or working situation, pregnancy, known fractures, and depression according to the Beck Depression Inventory (BDI) (score >29) [32].Overall, 31 patients with chronic neck pain were recruited, and 21 were included in the final sample (Figure 1). All 21 controls were matched on gender and age (+/-3 years). Subjects received oral and written information about the project and gave their written informed consent to participate. The Regional Scientific Ethical Committee of Southern Denmark approved the study (S-20100069). The study conformed to The Declaration of Helsinki 2008.

Figure 1
figure 1

Participant flow and retention.

Questionnaires and self-reported outcomes

Subjects completed self-reported questionnaires prior to enrolment, and demographics (age, gender, height and weight, type of accident, medication, symptom development over the last two months, employment and educational status) were registered. Questionnaires included the NDI (range: 0 to 50) [33], Medical Outcomes Study Short Form 36 (SF36)(range: 0 to100), with emphasis on the Physical Component Score (SF36-PCS) [34, 35], Numeric Rating Scale (NRS) for present pain (P.P.) and average pain during the last week (Week) [36], Modified Global Perceived Effectiveness (GPE) to evaluate stability of the condition with the question: “Compared to your first visit, how would you describe your neck today?” (-5 = vastly worse, 0 = unchanged, and 5 = completely recovered) [37]. Only subjects answering 0, representing unchanged, were included for intra- examiner and test-retest reliability.

Clinical tests

Subjects were not allowed to practice the tests, except for the tests of neuromuscular control; CCFT and DCE. For the CCFT, subjects performed three practice trials and for DCE one practice trial for a maximum of 30s. Test instruction followed an instruction manual, however, the amount of instruction and feedback varied among subjects, depending on the subject’s ability to understand the procedure.

Cranio Cervical Flexion Test (CCFT) was performed, using a Pressure Biofeedback Unit (Stabilizer; Chattanooga Group, South Pacific), as described by Jull et al. [38]. The subject was asked to perform cranio-cervical flexion in five incremental stages guided by the pressure sensor. The activation score has six scoring options; 20,22,24,26,28 and 30 mmHg.

Deep Cervical Extensor (DCE) test was performed in prone with their head over the edge of the bed. A laser was fixed to the top of the subject’s head and was projected to a target. The duration of time the laser beam was kept within the centre of the target was measured in seconds (sec.).

Range of movement (ROM) was examined using a bubble inclinometer (Baseline Bubble Inclinometer, Fabrication Enterprises Inc, USA) for flexion/extension and lateral flexion, and custom-made equipment for neck rotation (Figure 2). All scores were registered to the nearest degrees, except for rotation which was registered to the nearest 5 degrees.

Figure 2
figure 2

Device for the measurement of rotation ROM.

Joint Position Error (JPE) JPE was examined following return from active rotation, flexion, and extension movements by measuring the reposition error. A laser beam was positioned 1 meter behind the subject, and the laser was projected to a cm ruler attached to a cap which the subjects wore. Data was registered in millimetres (mm).

Gaze stability (GS) was registered during rotation, flexion and extension movements as positive/negative based on the patients report of symptoms such as nausea, dizziness, disturbed vision.

Smooth Pursuit Neck Torsion Test (SPNTT) was tested in both a neutral head position and with the trunk rotated 45 degrees and was registered as positive/negative based on the patients report of symptoms such as nausea, dizziness, disturbed vision.

Postural control was measured during one-legged stance (eyes open and eyes closed) using a Wii balance board (Nintendo, Kyoto, Japan) and quantified with the SwayWithWii software program. Data was registered in millimetres (mm).

Pressure Pain Threshold (PPT) was examined at three sites (neck, m. infraspinatus and m. tibialis anterior) using a hand-held algometer (Wagner, FPX algometer, USA) and was registered in kilogram-force (kgf). For further test descriptions, see Table 1 and Additional file 1: Test description.

Table 1 Summary of tests included in the reproducibility and test-retest study


Questionnaires were sent to participants before their first appointment. At the first visit, participants reported NRS (P.P./Week). Examiner C then screened for in- and exclusion criteria, after which SWAY and PPT tests were conducted. Following a rest period of ~2 min., tests for intra- and inter-reliability were performed. Testing order of examiner A and B was randomized for the first test round, and test order was always CCFT, ROM, JPE, GS, SPNTT and DCE. After a rest period of ~2 min. the other examiner performed the same tests in the same order on the same subjects (Figure 3). Duration between the two test occasions was 1–7 days. At the second visit GPE was added, and test order of examiner A and B was reversed. Cases and controls followed the same procedure throughout the testing session.

Figure 3
figure 3

Test procedure.

Sample size

Sample size was calculated based on the JPE test [23], since larger standard deviations were expected for this test. Sample size was estimated based on the 95% confidence interval according to the recommendation from Hopkins [39]. In a two one-sided test analysis for additive equivalence of paired means with bounds -5 and +5 for the mean difference and a significance level of 0.05, assuming a mean difference of 0, a common standard deviation of 16 and correlation 0.9, a sample size of 19 pairs, was required to obtain a power of at least 0.8.

Statistical analysis

Data analysis was performed blinded. Summary statistics are based on whole group-mean scores from examiner A and examiner B at the first test occasion. For the test-retest study, whole group-means are based on the first test examination. Mean values from the three repetitions for DCE, JPE, PPT and SWAY was used for analysis of reproducibility.

For calculation of intra- and inter-examiner reproducibility for ratio interval data, ICC (2,1) and Bland and Altman’s with 95% limits of agreement (LOA) were used. Interpretation of ICC was 1.00-0.76 (good to excellent), 0.75-0.41 (fair to good), and 0.40-0.00 (poor) [40]. The minimal detectable change (MDC) that is not due to error was calculated for all parametric tests, as 1.96 * √2 * SEM [41]. The standard error of measurement (SEM) was calculated as SEM = standard deviation of the mean difference between tester A and B divided by √2 [42].

For ordinal data, Cohen’s κ statistics with 95% confidence interval were calculated, with the interpretation 1.00-0.81 (almost perfect), 0.61-0.80 (substantial), 0.41-0.60 (moderate), 0.21-0.40 (fair), 0.00-0.20 (slight) and below 0.00 (poor) [43]. Furthermore, observed agreement, prevalence and expected agreement were calculated. Prevalence of the index condition was calculated as ( a + b + c / 2 n . The prevalence-adjusted-bias-adjusted kappa (PABAK) was calculated for the SPNTT, in which the values in cells a and d from the contingency table are replaced with the mean values from these cells, and values from cells b and c are replaced with the mean values from these cells [44]. Whole group results will be displayed if there is no systematic bias for cases or controls.

Construct validity between each of the clinical tests and NDI, NRS and SF36 were calculated using Spearman’s correlation coefficient (rho), due to non-normal distributions. Correlations were interpreted as for the ICC: 1.00-0.76 (strong), 0.75-0.41 (moderate), and 0.40-0.00 (weak). Positive correlation coefficients indicate positive associations, negative indicate negative associations. Since interpretation from the spearman is known to be difficult, statistical significance testing was included. Between-groups discriminative validity of the clinical tests was evaluated by a t-test and a Mann–Whitney U-test in normally and non-normally distributed data, respectively. Calculations of construct and discriminative validity were based on mean scores from the first test occasion. STATA statistical software was used for all analyses (Stata Corp., 2000, College Station, TX).


A total of 42 subjects (age: 45.0 ± 15.6 years) were recruited, with 21 in each group. The groups did not differ in demographics (age, gender, height or weight). The patient group (cases) reported higher scores on the NDI (p < 0.01) and BDI (p < 0.01), lower scores on the PCS (p < 0.01) and Mental Component Score of the SF36 (SF36-MCS) (p < 0.01), compared to controls (Table 2). Eight cases had whiplash-induced neck pain and 13 had idiopathic neck pain. A total of 11 cases and 16 controls had a GPE of 0 and were included for intra- examiner and test-retest reliability (see Table 3 for the summary statistics of clinical tests.).

Table 2 Self-reported demographic data for cases and controls
Table 3 Summary statistics (mean, median, SD and Range) of clinical and performance tests

Bland Altman plots revealed that differences between examiners did not depend systematically on mean score for any of the tests, but LOA were generally wide. Highest ICC for clinical tests were found for ROM (ICC: 0.80 to 0.93), DCE (0.75 to 0.90) and CCFT (0.63 to 0.86) and lowest for JPE (0.02 to 0.66) (Table 4). Intra- and inter-reliability for GS and SPNTT showed kappa ranging from 0.66 to 0.92, and 0.57 to 0.78 (prevalence adjusted), respectively (Table 5). Overall agreement and κ-values were generally high for GS and SPNTT. PABAK calculation for SPNTT (low prevalence) increased kappa from 0.46 and 0.74, to 0.57 and 0.78 (Table 5). In the test-retest study of performance tests highest reliability was obtained for PPT (ICC: 0.83 to 0.89) compared to SWAY (0.39 to 0.79) (Table 6).

Table 4 Intra- and inter examiner reliability of Cranio-Cervical Flexion Test (CCFT), Deep Cervical Extensor test (DCE), Range of Movement (ROM) and Joint Position Error (JPE)
Table 5 Inter- and intra examiner reliability of Gaze Stability (GS) and Smooth Pursuit Neck Torsion Test (SPNTT)
Table 6 Test-retest reliability of Pressure Pain Threshold (PPT) and Balance/Postural Control (SWAY)

All tests, except for JPE, and to some extent SWAY, correlated significantly with self-reported variables of NRS, NDI and SF36-PCS (Table 7). CCFT, ROM, GS, SPNTT and PPT showed significant between group differences; however, all differences were within the limits of MDC (Table 8).

Table 7 Construct validity of clinical and performance tests
Table 8 Discriminative validity of clinical and performance tests


This study evaluated the reliability of clinical and performance tests commonly applied in the assessment of individuals with neck pain disorders, in a group of people with neck pain and a group of age and gender matched asymptomatic volunteers. Bland Altman plots revealed no systematic bias for any of the tests, but LOA were generally wide, with high MDC for most tests, indicating a relatively high degree of inherent variability. Highest ICC values were found for ROM and PPT variables, and lowest for the JPE variables. High MDC values were found for most tests, indicating a relatively high degree of inherent variability. Overall agreement and κ-values were generally high for GS and SPNTT. All tests, except for JPE, correlated significantly with at least one of the self-reported variables, meaning that poor clinical values correlated with subjective responses of poor conditions. However, the mean differences between cases and controls fell within the respective MDC on all tests.

Cranio-cervical flexion test

Bland Altman plots revealed no systematic differences depending on mean scores. The intra- and inter-reliability for the CCFT was between “fair to good” and “good to excellent” (ICC: 0.63 to 0.86), in line with previous studies [45, 46]. Studies of reliability on the CCFT in asymptomatic subjects have reported slightly higher ICC values, ranging from 0.81 to 0.98 [18, 20, 21]. Studies including both symptomatic and asymptomatic populations usually have higher within subject- and day-to-day variation. The relatively large LOA and MDC, (intra-examiner: 2.9 and 3.9 mmHg; inter-examiner: 3.1 and 4.7 mmHg) (Table 4), is a future challenge for interventions. An MDC of two target levels (4.0 mmHg) is considered to be insufficient in a test with only five target levels. However, the significant correlation between CCFT and NDI, SF36-PCS and NRS, makes the test clinically relevant. The test needs improved psychometric properties for clinical use if implemented as in the present study. However, it should be noted that scoring of the CCFT may also include a measurement of endurance, that is, the number of 10 seconds holds that the subject can do at their achieved pressure level to generate a performance index. For example if a patient can achieve the second level of the test (24 mmHg) and perform six, 10 seconds holds with the correct action of cranio-cervical flexion, their performance index is 4 × 6 = 24. Highest activation score is 10 and highest performance index 100. Different results may have been achieved with this scoring approach.

Deep cervical extensors

Bland Altman plots revealed no systematic differences depending on mean scores. ICC for the intra- and inter-examiner reliability measures ranged from 0.75 to 0.90 (“good to excellent”). This is the first study to examine the reliability of this test. Although the results are promising, the large LOA and MDC, ranging from 37 to 59 seconds, indicate higher variation than expected from the ICC. The high ICC is probably due to large between subject variability, thus disguising large test-retest differences [47]. Although testing in a cranio-cervical neutral position, as recommended for the deep neck extensors [48, 49], the validity of this test has not been confirmed. Lower scores on the test correlated with higher levels of pain (NRS) and disability (NDI), but not with SF36. The present DCE test needs improved psychometric properties for clinical use.

Range of movement

Bland Altman plots revealed no systematic differences depending on mean scores. ICC measures for the intra- and inter-examiner reliability ranged from 0.80 to 0.93 (“good to excellent”), in line with previous studies using an inclinometer [5052]. Since, “good to excellent” reproducibility was obtained using the custom-made rotation device; future clinical use of this device seems promising. LOA and MDC were large for neck flexion and extension (13 to 21°), reflecting some variation. Significant correlation was found between ROM variables and NDI, SF36-PCS and NRS. This test has satisfactory psychometric properties and can be recommended for clinical use.

Joint position error

Bland Altman plots revealed no systematic differences depending on mean scores. ICC for the intra- and inter-examiner reliability measures ranged from 0.02 (“poor”) to 0.52 (“fair to good”). Previous studies have reported varying results, however, most studies report ICC above 0.75 [23, 46, 5355]. In this study the laser light was positioned behind the subject whilst in most previous studies the laser light was attached to the head of the subject, which may explain such differences. Other explanations for differences in results could be the equipment used, or the current number of three repetitions performed, since six repetitions are recommended for stable estimates and higher reliability [53]. This was further supported by the fact that other studies using three repetitions have reported ICC similar/lower than the present study [24, 56]. JPE revealed large LOA and MDC ranging from approximately 7 to 10 mm, and did not reach significant correlations with any of the self-reported outcome measures. As described, this current test cannot be recommended for further use.

Gaze stability

GS κ-values ranged from 0.66 (“substantial”) to 0.92 (“almost perfect”) for intra- and inter-reliability. This is the first study to examine reproducibility of the GS test in a clinical setting without sophisticated equipment. However, the present results are in line with previous studies using advanced equipment, such as wireless 3D sensors with ICC ranging from 0.40 to 0.89 [12, 26]. GS discriminated significantly between cases and controls, in line with other studies [12, 13]. Since the GS test showed significant correlations with NDI, SF36-PCS and NRS, this test has satisfactory psychometric properties and can be recommended for clinical use.

Smooth pursuit neck torsion test

κ ranged from 0.46 (“moderate”) to 0.74 (“substantial”) for intra- and inter-reliability of the SPNTT. No other studies have evaluated the reproducibility of the SPNTT in a clinical setting. The κ-values obtained for SPNTT were lower than expected, probably due to the low prevalence of the condition, known to affect the κ-value [29]. Therefore, PABAK was used to adjust for this, and κ ranging from 0.70 to 0.78 and 0.57 to 0.72 was obtained for intra- and inter-reliability, respectively. The SPNTT test was able to discriminate significantly between cases and controls, as also shown in earlier studies of whiplash patients [5760], while other studies reported no differences [25, 61, 62]. A positive test correlated with higher NDI and NRS scores, and lower SF36-PCS. The SPNTT has satisfactory psychometric properties and can be recommended for clinical use.


Bland Altman plots revealed no systematic differences depending on mean scores. Highest ICC values were obtained for the Romberg eyes closed condition with 0.79 (“good to excellent”) for 95% Confidence Ellipse Area (CEA), 0.60 (“fair to good”) for Anterior/Posterior (A/P), 0.69 (“fair to good”) for Medial/Lateral (M/L) and 0.77 (“good to excellent”) for Centre of Pressure path length (COP). All COP variables were above 0.75 (“good to excellent”). These results are in line with one previous study using the Wii balance board in healthy subjects reporting ICC above 0.75 [63]. The present study found no systematic bias, and significant correlations were found between NDI/SF36-PCS/NRS and area/range of displacement in the Romberg eyes closed condition and single-leg stance, but not for COP. Overall, the SWAY test has satisfactory psychometric properties and can be recommended for clinical use.

Pressure pain threshold

Bland Altman plots revealed no systematic differences depending on mean scores. ICC was “good to excellent” for all variables (Tibialis anterior: 0.86, C3-C4: 0.89, Infraspinatus: 0.83), in concordance with previous studies on patients with acute neck pain [28, 64]. MDC was 1.90 kgf (Tibialis Anterior), 0.90 kgf (C3-C4), and 1.65 kgf (Infraspinatus), also in accordance with an earlier study using hand-held algometry [28]. Since significant correlations were found between PPT and NDI/NRS (all sites) and SF36-PCS (C3-C4), this test has satisfactory psychometric properties and can be recommended for clinical use.


A strength of the study is that it followed a standardized protocol, including a training phase, in which examiners were able to standardise and calibrate performance and interpretation of tests. The inclusion of a thorough training phase enabled inexperienced examiners to obtain satisfactory results. Further, by including a case as well as a control group, we demonstrated that both groups could be tested in a clinical manner reliably. Since ICC did not differ between groups, except for a minor tendency in the CCFT for neck pain subjects to have lower ICC, only the pooled data set was presented. Furthermore, primarily using quantifiable variables may have reduced variability which is usually introduced by more subjective estimates e.g. presence of co-contraction, breathing etc. It is unclear whether superior results would have been obtained if more experienced examiners were involved.

A weakness of the study may have been the duration of the test procedure. The entire test procedure had a length of approximately 1.5 hours, possibly imposing a fatigue effect. However, post-hoc analysis of data from the CCFT and DCE tests, comparing mean values for each of the tests in the order they were performed for each of the examiners, revealed that a significant fatigue effect was not present. Additionally, no significant learning effect was evident. The GPE was included in order to control for within-subject change between test occasions. Nevertheless, it cannot be entirely excluded that residual fatigue, not altering the subject’s perception, biased the results on the second test occasion.

Most patients were receiving treatment, and may have been familiar with the tests. Since one purpose of a reliability study is to examine all aspects of a clinical test, including information, instruction and position of subjects, this might have biased the data, possibly resulting in higher reliability estimates, since some of the patients may have been familiar with the tests. However, this may have decreased the possible difference between cases and controls, and thus resulting in lower estimates for discriminative validity.

Generally, interpretation of the data on discriminative validity must be performed with caution, since the sample size was not powered to estimate this. The general finding of the present study was that although significant differences were found for most variables, they were all within the MDC. From a clinical perspective, this naturally complicates the interpretation of tests, since “positive” findings may thus be attributed to measurement error. Sufficiently powered future studies are therefore needed, especially on discriminative and predictive validity, in addition to responsiveness, and establishment of relevant cut-off points for abnormality by investigating the normal variation in a healthy population.


The majority of the examined clinical and performance tests were reliable and showed satisfactory construct validity. Although examiners were inexperienced with the tests, this standardised protocol showed that with training high reliability measures were obtained for most tests. Wide LOA and high MDC values were found, indicating a relatively high degree of inherent variability. All tests, except for JPE, correlated with variables such as NRS, NDI and SF36-PCS. None of the measures were able to differ significantly between groups within their respective MDC. Future challenges are to test the discriminative and predictive validity, in addition to responsiveness of each test in different patient populations.


  1. National Institute of Public Health: Folkesundhedsrapporten. 2007, Denmark

    Google Scholar 

  2. Huisstede BM, Wijnhoven HA, Bierma-Zeinstra SM, Koes BW, Verhaar JA, Picavet S: Prevalence and characteristics of complaints of the arm, neck, and/or shoulder (CANS) in the open population. Clin J Pain. 2008, 24 (3): 253-259. 10.1097/AJP.0b013e318160a8b4.

    Article  PubMed  Google Scholar 

  3. Picavet HS, Hazes JM: Prevalence of self reported musculoskeletal diseases is high. Ann Rheum Dis. 2003, 62 (7): 644-650. 10.1136/ard.62.7.644.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Fejer R, Kyvik KO, Hartvigsen J: The prevalence of neck pain in the world population: a systematic critical review of the literature. Eur Spine J. 2006, 15 (6): 834-848. 10.1007/s00586-004-0864-4.

    Article  PubMed  Google Scholar 

  5. Jull G, Kristjansson E, Dall’Alba P: Impairment in the cervical flexors: a comparison of whiplash and insidious onset neck pain patients. Man Ther. 2004, 9 (2): 89-94. 10.1016/S1356-689X(03)00086-9.

    Article  CAS  PubMed  Google Scholar 

  6. Falla DL, Jull GA, Hodges PW: Patients with neck pain demonstrate reduced electromyographic activity of the deep cervical flexor muscles during performance of the craniocervical flexion test. Spine (Phila Pa 1976). 2004, 29 (19): 2108-2114. 10.1097/01.brs.0000141170.89317.0e.

    Article  Google Scholar 

  7. Elliott J, Jull G, Noteboom JT, Darnell R, Galloway G, Gibbon WW: Fatty infiltration in the cervical extensor muscles in persistent whiplash-associated disorders: a magnetic resonance imaging analysis. Spine (Phila Pa 1976). 2006, 31 (22): 847-855. 10.1097/01.brs.0000240841.07050.34.

    Article  Google Scholar 

  8. Schomacher J, Farina D, Lindstroem R, Falla D: Chronic trauma-induced neck pain impairs the neural control of the deep semispinalis cervicis muscle. Clin Neurophysiol. 2012, 123 (7): 1403-1408. 10.1016/j.clinph.2011.11.033.

    Article  PubMed  Google Scholar 

  9. Woodhouse A, Vasseljen O: Altered motor control patterns in whiplash and chronic neck pain. BMC Musculoskelet Disord. 2008, 9: 90-10.1186/1471-2474-9-90.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Treleaven J, Jull G, Sterling M: Dizziness and unsteadiness following whiplash injury: characteristic features and relationship with cervical joint position error. J Rehabil Med. 2003, 35 (1): 36-43. 10.1080/16501970306109.

    Article  PubMed  Google Scholar 

  11. Feipel V, Salvia P, Klein H, Rooze M: Head repositioning accuracy in patients with whiplash-associated disorders. Spine (Phila Pa 1976). 2006, 31 (2): 51-58. 10.1097/01.brs.0000194786.63690.54.

    Article  Google Scholar 

  12. Treleaven J, Jull G, Grip H: Head eye co-ordination and gaze stability in subjects with persistent whiplash associated disorders. Man Ther. 2011, 16 (3): 252-257. 10.1016/j.math.2010.11.002.

    Article  PubMed  Google Scholar 

  13. Grip H, Sundelin G, Gerdle B, Karlsson JS: Variations in the axis of motion during head repositioning--a comparison of subjects with whiplash-associated disorders or non-specific neck pain and healthy controls. Clin Biomech (Bristol, Avon). 2007, 22 (8): 865-873. 10.1016/j.clinbiomech.2007.05.008.

    Article  Google Scholar 

  14. Sjostrom H, Allum JH, Carpenter MG, Adkin AL, Honegger F, Ettlin T: Trunk sway measures of postural stability during clinical balance tests in patients with chronic whiplash injury symptoms. Spine (Phila Pa 1976). 2003, 28 (15): 1725-1734.

    Google Scholar 

  15. Madeleine P, Nielsen M, Arendt-Nielsen L: Characterization of postural control deficit in whiplash patients by means of linear and nonlinear analyses - A pilot study. J Electromyogr Kinesiol. 2011, 21 (2): 291-297. 10.1016/j.jelekin.2010.05.006.

    Article  PubMed  Google Scholar 

  16. Scott D, Jull G, Sterling M: Widespread sensory hypersensitivity is a feature of chronic whiplash-associated disorder but not chronic idiopathic neck pain. Clin J Pain. 2005, 21 (2): 175-181. 10.1097/00002508-200503000-00009.

    Article  PubMed  Google Scholar 

  17. Sterling M, Hodkinson E, Pettiford C, Souvlis T, Curatolo M: Psychologic factors are related to some sensory pain thresholds but not nociceptive flexion reflex threshold in chronic whiplash. Clin J Pain. 2008, 24 (2): 124-130. 10.1097/AJP.0b013e31815ca293.

    Article  PubMed  Google Scholar 

  18. James G, Doe T: The craniocervical flexion test: intra-tester reliability in asymptomatic subjects. Physiother Res Int. 2010, 15 (3): 144-149. 10.1002/pri.456.

    Article  PubMed  Google Scholar 

  19. Chiu TT, Law EY, Chiu TH: Performance of the craniocervical flexion test in subjects with and without chronic neck pain. J Orthop Sports Phys Ther. 2005, 35 (9): 567-571. 10.2519/jospt.2005.35.9.567.

    Article  PubMed  Google Scholar 

  20. Arumugam A, Mani R, Raja K: Interrater reliability of the craniocervical flexion test in asymptomatic individuals–a cross-sectional study. J Manipulative Physiol Ther. 2011, 34 (4): 247-253. 10.1016/j.jmpt.2011.04.011.

    Article  PubMed  Google Scholar 

  21. Jull G, Barrett C, Magee R, Ho P: Further clinical clarification of the muscle dysfunction in cervical headache. Cephalalgia. 1999, 19 (3): 179-185. 10.1046/j.1468-2982.1999.1903179.x.

    Article  CAS  PubMed  Google Scholar 

  22. de Koning CH, van den Heuvel SP, Staal JB, Smits-Engelsman BC, Hendriks EJ: Clinimetric evaluation of methods to measure muscle functioning in patients with non-specific neck pain: a systematic review. BMC Musculoskelet Disord. 2008, 9: 142-10.1186/1471-2474-9-142.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Loudon JK, Ruhl M, Field E: Ability to reproduce head position after whiplash injury. Spine (Phila Pa 1976). 1997, 22 (8): 865-868. 10.1097/00007632-199704150-00008.

    Article  CAS  Google Scholar 

  24. Strimpakos N, Sakellari V, Gioftsos G, Kapreli E, Oldham J: Cervical joint position sense: an intra- and inter-examiner reliability study. Gait Posture. 2006, 23 (1): 22-31. 10.1016/j.gaitpost.2004.11.019.

    Article  PubMed  Google Scholar 

  25. Kongsted A, Jorgensen LV, Bendix T, Korsholm L, Leboeuf-Yde C: Are smooth pursuit eye movements altered in chronic whiplash-associated disorders? A cross-sectional study. Clin Rehabil. 2007, 21 (11): 1038-1049. 10.1177/0269215507075519.

    Article  CAS  PubMed  Google Scholar 

  26. Grip H, Jull G, Treleaven J: Head eye co-ordination using simultaneous measurement of eye in head and head in space movements: potential for use in subjects with a whiplash injury. J Clin Monit Comput. 2009, 23 (1): 31-40. 10.1007/s10877-009-9160-5.

    Article  PubMed  Google Scholar 

  27. Clark RA, McGough R, Paterson K: Reliability of an inexpensive and portable dynamic weight bearing asymmetry assessment system incorporating dual Nintendo Wii Balance Boards. Gait Posture. 2011, 34 (2): 288-291. 10.1016/j.gaitpost.2011.04.010.

    Article  PubMed  Google Scholar 

  28. Walton DM, Macdermid JC, Nielson W, Teasell RW, Chiasson M, Brown L: Reliability, standard error, and minimum detectable change of clinical pressure pain threshold testing in people with and without acute neck pain. J Orthop Sports Phys Ther. 2011, 41 (9): 644-650. 10.2519/jospt.2011.3666.

    Article  PubMed  Google Scholar 

  29. FIMM Academy of Manual/Musculoskeletal Medicine: Protocol Format for Diagnostic Procedures in Manual/Musculoskeletal Medicine. 2007, []

    Google Scholar 

  30. Enoch F, Kjaer P, Elkjaer A, Remvig L, Juul-Kristensen B: Inter-examiner reproducibility of tests for lumbar motor control. BMC Musculoskelet Disord. 2011, 12: 114-10.1186/1471-2474-12-114.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Rubinstein SM, Pool JJ, van Tulder MW, Riphagen II, de Vet HC: A systematic review of the diagnostic accuracy of provocative tests of the neck for diagnosing cervical radiculopathy. Eur Spine J. 2007, 16 (3): 307-319. 10.1007/s00586-006-0225-6.

    Article  PubMed  Google Scholar 

  32. Carter CL, Dacey CM: Validity of the Beck Depression Inventory, MMPI, and Rorschach in assessing adolescent depression. J Adolesc. 1996, 19 (3): 223-231. 10.1006/jado.1996.0021.

    Article  PubMed  Google Scholar 

  33. Vernon H, Mior S: The Neck Disability Index: a study of reliability and validity. J Manipulative Physiol Ther. 1991, 14 (7): 409-415.

    CAS  PubMed  Google Scholar 

  34. Bjorner JB, Damsgaard MT, Watt T, Groenvold M: Tests of data quality, scaling assumptions, and reliability of the Danish SF-36. J Clin Epidemiol. 1998, 51 (11): 1001-1011. 10.1016/S0895-4356(98)00092-4.

    Article  CAS  PubMed  Google Scholar 

  35. McCarthy MJ, Grevitt MP, Silcocks P, Hobbs G: The reliability of the Vernon and Mior neck disability index, and its validity compared with the short form-36 health survey questionnaire. Eur Spine J. 2007, 16 (12): 2111-2117. 10.1007/s00586-007-0503-y.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Cleland JA, Childs JD, Whitman JM: Psychometric properties of the Neck Disability Index and Numeric Pain Rating Scale in patients with mechanical neck pain. Arch Phys Med Rehabil. 2008, 89 (1): 69-74. 10.1016/j.apmr.2007.08.126.

    Article  PubMed  Google Scholar 

  37. Kamper SJ, Ostelo RW, Knol DL, Maher CG, de Vet HC, Hancock MJ: Global Perceived Effect scales provided reliable assessments of health transition in people with musculoskeletal disorders, but ratings are strongly influenced by current status. J Clin Epidemiol. 2010, 63 (7): 760-766. 10.1016/j.jclinepi.2009.09.009.

    Article  PubMed  Google Scholar 

  38. Jull GA, O’Leary SP, Falla DL: Clinical assessment of the deep cervical flexor muscles: the craniocervical flexion test. J Manipulative Physiol Ther. 2008, 31 (7): 525-533. 10.1016/j.jmpt.2008.08.003.

    Article  PubMed  Google Scholar 

  39. Hopkins WG: Measures of reliability in sports medicine and science. Sports Med. 2000, 30 (1): 1-15. 10.2165/00007256-200030010-00001.

    Article  CAS  PubMed  Google Scholar 

  40. Fleiss J: Reliability of Measurement. The Design and Analysis of Clinical Experiments. 1986, New York: John Wiley & Sons, 1

    Google Scholar 

  41. Kovacs FM, Abraira V, Royuela A, Corcoll J, Alegre L, Tomas M, Mir MA, Cano A, Muriel A, Zamora J, Del Real MT, Gestoso M, Mufraggi N, Spanish Back Pain Research Network: Minimum detectable and minimal clinically important changes for pain in patients with nonspecific neck pain. BMC Musculoskelet Disord. 2008, 9: 43-10.1186/1471-2474-9-43.

    Article  PubMed  PubMed Central  Google Scholar 

  42. de Vet HC, Terwee CB, Knol DL, Bouter LM: When to use agreement versus reliability measures. J Clin Epidemiol. 2006, 59 (10): 1033-1039. 10.1016/j.jclinepi.2005.10.015.

    Article  PubMed  Google Scholar 

  43. Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics. 1977, 33 (1): 159-174. 10.2307/2529310.

    Article  CAS  PubMed  Google Scholar 

  44. Sim J, Wright CC: The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther. 2005, 85 (3): 257-268.

    PubMed  Google Scholar 

  45. Hudswell S, Von Mengersen M, Lucas N: The cranio-cervical flexion test using pressure biofeedback: A useful measure of cervical dysfunction in the clinical setting?. Int J Osteopath Med. 2005, 8: 98-105. 10.1016/j.ijosm.2005.07.003.

    Article  Google Scholar 

  46. Juul T, Langberg H, Enoch F, Sogaard K: The intra- and inter-rater reliability of five clinical muscle performance tests in patients with and without neck pain. BMC Musculoskelet Disord. 2013, 14: 339-10.1186/1471-2474-14-339.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Bland JM, Altman DG: Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986, 1 (8476): 307-310.

    Article  CAS  PubMed  Google Scholar 

  48. Elliott JM, O’Leary SP, Cagnie B, Durbridge G, Danneels L, Jull G: Craniocervical orientation affects muscle activation when exercising the cervical extensors in healthy subjects. Arch Phys Med Rehabil. 2010, 91 (9): 1418-1422. 10.1016/j.apmr.2010.05.014.

    Article  PubMed  Google Scholar 

  49. O’Leary S, Cagnie B, Reeve A, Jull G, Elliott JM: Is there altered activity of the extensor muscles in chronic mechanical neck pain? A functional magnetic resonance imaging study. Arch Phys Med Rehabil. 2011, 92 (6): 929-934. 10.1016/j.apmr.2010.12.021.

    Article  PubMed  Google Scholar 

  50. Cleland JA, Childs JD, Fritz JM, Whitman JM: Interrater reliability of the history and physical examination in patients with mechanical neck pain. Arch Phys Med Rehabil. 2006, 87 (10): 1388-1395. 10.1016/j.apmr.2006.06.011.

    Article  PubMed  Google Scholar 

  51. Hole DE, Cook JM, Bolton JE: Reliability and concurrent validity of two instruments for measuring cervical range of motion: effects of age and gender. Man Ther. 1995, 1 (1): 36-42. 10.1054/math.1995.0248.

    Article  CAS  PubMed  Google Scholar 

  52. Williams MA, McCarthy CJ, Chorti A, Cooke MW, Gates S: A systematic review of reliability and validity studies of methods for measuring active and passive cervical range of motion. J Manipulative Physiol Ther. 2010, 33 (2): 138-155. 10.1016/j.jmpt.2009.12.009.

    Article  PubMed  Google Scholar 

  53. Swait G, Rushton AB, Miall RC, Newell D: Evaluation of cervical proprioceptive function: optimizing protocols and comparison between tests in normal subjects. Spine (Phila Pa 1976). 2007, 32 (24): 692-701. 10.1097/BRS.0b013e31815a5a1b.

    Article  Google Scholar 

  54. Pinsault N, Fleury A, Virone G, Bouvier B, Vaillant J, Vuillerme N: Test-retest reliability of cervicocephalic relocation test to neutral head position. Physiother Theory Pract. 2008, 24 (5): 380-391. 10.1080/09593980701884824.

    Article  PubMed  Google Scholar 

  55. Kristjansson E, Dall’Alba P, Jull G: Cervicocephalic kinaesthesia: reliability of a new test approach. Physiother Res Int. 2001, 6 (4): 224-235. 10.1002/pri.230.

    Article  CAS  PubMed  Google Scholar 

  56. Lee HY, Teng CC, Chai HM, Wang SF: Test-retest reliability of cervicocephalic kinesthetic sensibility in three cardinal planes. Man Ther. 2006, 11 (1): 61-68. 10.1016/j.math.2005.03.008.

    Article  PubMed  Google Scholar 

  57. Treleaven J, Jull G, LowChoy N: Smooth pursuit neck torsion test in whiplash-associated disorders: relationship to self-reports of neck pain and disability, dizziness and anxiety. J Rehabil Med. 2005, 37 (4): 219-223.

    Article  PubMed  Google Scholar 

  58. Treleaven J, LowChoy N, Darnell R, Panizza B, Brown-Rothwell D, Jull G: Comparison of sensorimotor disturbance between subjects with persistent whiplash-associated disorder and subjects with vestibular pathology associated with acoustic neuroma. Arch Phys Med Rehabil. 2008, 89 (3): 522-530. 10.1016/j.apmr.2007.11.002.

    Article  PubMed  Google Scholar 

  59. Heikkila HV, Wenngren BI: Cervicocephalic kinesthetic sensibility, active range of cervical motion, and oculomotor function in patients with whiplash injury. Arch Phys Med Rehabil. 1998, 79 (9): 1089-1094. 10.1016/S0003-9993(98)90176-9.

    Article  CAS  PubMed  Google Scholar 

  60. Montfoort I, Van Der Geest JN, Slijper HP, De Zeeuw CI, Frens MA: Adaptation of the cervico- and vestibulo-ocular reflex in whiplash injury patients. J Neurotrauma. 2008, 25 (6): 687-693. 10.1089/neu.2007.0314.

    Article  PubMed  Google Scholar 

  61. Prushansky T, Dvir Z, Pevzner E, Gordon CR: Electro-oculographic measures in patients with chronic whiplash and healthy subjects: a comparative study. J Neurol Neurosurg Psychiatry. 2004, 75 (11): 1642-1644. 10.1136/jnnp.2003.031278.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Dispenza F, Gargano R, Mathur N, Saraniti C, Gallina S: Analysis of visually guided eye movements in subjects after whiplash injury. Auris Nasus Larynx. 2011, 38 (2): 185-189. 10.1016/j.anl.2010.08.007.

    Article  PubMed  Google Scholar 

  63. Clark RA, Bryant AL, Pua Y, McCrory P, Bennell K, Hunt M: Validity and reliability of the Nintendo Wii Balance Board for assessment of standing balance. Gait Posture. 2010, 31 (3): 307-310. 10.1016/j.gaitpost.2009.11.012.

    Article  PubMed  Google Scholar 

  64. Persson AL, Brogardh C, Sjolund BH: Tender or not tender: test-retest repeatability of pressure pain thresholds in the trapezius and deltoid muscles of healthy women. J Rehabil Med. 2004, 36 (1): 17-27. 10.1080/16501970310015218.

    Article  PubMed  Google Scholar 

Pre-publication history

Download references


Authors wish to thank the outpatient clinics, students and patients for their participation.

Author information

Authors and Affiliations


Corresponding author

Correspondence to René Jørgensen.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

RJ was involved with the planning of the study and acquisition of data, the data analysis, the writing and revision of the paper. BJK, IR and DF were involved in the planning, methodological considerations, analysis of data and revision of the paper. All authors read and approved the final manuscript.

Electronic supplementary material

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jørgensen, R., Ris, I., Falla, D. et al. Reliability, construct and discriminative validity of clinical testing in subjects with and without chronic neck pain. BMC Musculoskelet Disord 15, 408 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Neck pain
  • Reliability
  • Validity assessment