Inter- and intra-rater reliability for measurement of range of motion in joints included in three hypermobility assessment methods

Background Comparisons across studies of generalized joint hypermobility are often difficult since there are several classification methods and methodological differences in the performance exist. The Beighton score is most commonly used and has been tested for inter- and intra-rater reliability. The Contompasis score and the Hospital del Mar criteria have not yet been evaluated for reliability. The aim of this study was to investigate the inter- and intra-rater reliability for measurements of range of motion in joints included in these three hypermobility assessment methods using a structured protocol. Methods The study was planned in accordance with guidelines for reporting reliability studies. Healthy adults were consecutively recruited (49 for inter- and 29 for intra-rater assessments). Intra-class correlations, two-way random effects model, (ICC 2.1) with 95% confidence intervals, standard error of measurement, percentage of agreement, Cohen’s Kappa (κ) and prevalence-adjusted bias-adjusted kappa were calculated for single-joint measured in degrees and for total scores. Results The inter- and intra-rater reliability in total scores were ICC 2.1: 0.72–0.82 and 0.76–0.86 and for single-joint measurements in degrees 0.44–0.91 and 0.44–0.90, respectively. The difference between ratings was within 5 degrees in all but one joint. Standard error of measurement ranged from 1.0 to 6.9 degrees. The inter- and intra-rater reliability for prevalence of positive hypermobility findings the Cohen’s κ for total scores were 0.54–0.78 and 0.27–0.78 and in single joints 0.21–1.00 and 0.19–1.00, respectively. The prevalence- and bias adjusted Cohen’s κ, increased all but two values. Conclusions Following a structured protocol, the inter- and intra-rater reliability was good-to-excellent for total scores and in all but two single joints, measured in degrees. The inter- and intra-rater reliability for prevalence of positive hypermobility findings was fair-to-almost perfect for total scores and slight-to-almost-perfect in single joints. By using a structured protocol, we attempted to standardize the assessment of range of motion in clinical and in research settings. This standardization could be helpful in the first part of the process of standardizing the tests thus avoiding that assessment of GJH is based on chance. Electronic supplementary material The online version of this article (10.1186/s12891-018-2290-5) contains supplementary material, which is available to authorized users.


Background
Generalized joint hypermobility (GJH), defined as an increased range of motion (ROM) in several joints [1], is associated with longstanding musculoskeletal problems [2]. Many people with GJH seek primary care for pain and activity limitations [3,4].
Joint ROM varies greatly in the general population [5,6] and a joint ROM above two standard deviations from the average is suggested to be hypermobile [7]. The prevalence of GJH varies across gender, age, ethnicity and according to assessment methods and their cut-off points [8]. In Sweden, GJH is estimated to be present in approximately 10% of the general population [9].
Although GJH is an important criterion in the diagnosis of many heritable connective tissue disorders [3,5] no agreed criteria exist [5,10,11]. Furthermore, which joints to include in diagnosing GJH has been debated [12]. The Beighton score (BeS) [13], which is a development of the Carter and Wilkinson score [14], is the most common diagnostic test for GJH worldwide [8,15]. The BeS demonstrates good inter-and intra-rater reliability [15][16][17] but with conflicting evidence and methodological flaws [18]. Advantageously, the BeS is quick and easy to perform. However, the BeS only covers five joints particularly hinge joints and is an "all-or-none-test" with no indication regarding the degree of hypermobility [13]. Commonly used cut-off levels in the BeS vary between ≥4 and ≥ 5 for diagnosing GJH in adults [18].
Another assessment method is the Contompasis score (CS), a modification of the BeS which includes one additional joint. The CS is measured by grading the ROM and might be considered more time-consuming [19]. Furthermore, the Hospital del Mar criteria (HdM), which is a development of the Rotés-Querol, offer a wider view of joint mobility by assessing nine joints, including ball-and-socket-joints [12]. To our best knowledge, the inter-and intra-rater reliability of the CS and the HdM scores have not yet been evaluated.
Comparisons across studies of GJH assessments are hampered because a structured protocol is often lacking [20][21][22][23]. Neither the literature nor the criteria for diagnosis of GJH [3] and heritable connective tissue disorders [24] describes the test performances in detail [10,18]. Although ROM measured in degrees using a goniometer has shown better inter-rater reliability, assessment of GJH is often based on visual assessment [15,17,25] with a dichotomous principle of judgement. The reliability is also affected by the joint structure, the level of pre-training and experience among the raters [26].
To identify people with GJH and subsequently tailor suitable interventions, reliable clinical assessment methods are important. Thus, there is a need for international consensus regarding performance, cut-off levels and interpretation of clinical assessments based on reliability studies of high quality [11,18] to reduce the likelihood that the assessment of GJH is based on chance. Before deciding on the validity of these tests the reliability needs to be investigated in a standardized manner [18].
The aim of this study was to investigate the inter-and intra-rater reliability for measurements of ROM in joints included in three hypermobility assessment methods using a structured protocol.

Design
An inter-and intra-reliability study.
This study was planned and developed in accordance with "Guidelines for Reporting Reliability and Agreement Studies" (GRRAS) and "Quality Appraisal of Reliability Studies" (QAREL) [27,28].

Structured protocol and instruments
This study assessed inter-and intra-rater reliability of three hypermobility assessment methods, the BeS, the CS and the HdM for measuring joint ROM using a test-retest design which comprised in total 12 single joints. A protocol was developed to standardize the measurement of joint ROM (Additional file 1), which was further expanded from the original versions of the BeS, the CS and the HdM (Additional file 2). Starting position, positioning of the goniometer, anatomical landmarks, stabilization of adjacent structures and performances, using active or passive movement, were described and illustrated using photographs in the new protocol.
The BeS [13] comprises assessments of five joints, passive dorsiflexion of the fifth finger metacarpophalangeal joint, passive apposition of the thumb, passive hyperextension of the elbow and knee as well as forward flexion of the trunk. The first four joints are assessed bilaterally yielding a total score ranging from 0 to 9 [13]. The BeS scores of ≥4 and ≥ 5 points were used as cut-off levels for GJH.
The CS [19] comprises the assessment of six joints, which is similar to the BeS but with one additional joint, the foot flexibility test. Five joints are assessed bilaterally with each joint graded from two to six/or eight points with a total score range from 22 to 72 [19]. A cut-off level of ≥30 points for the CS was used to define GJH. The CS scores was modified because ROM in degrees for the elbow, knee and fifth finger were insufficiently graded and some degrees were represented in two score levels in the original description (Additional file 2).
The HdM [12] comprises the assessment of 10 items, passive apposition of the thumb, passive dorsiflexion of the fifth finger, passive hyperextension of the elbow, external shoulder rotation, hip abduction, patella hypermobility, ankle and foot hypermobility, first metatarsophalangeal joint, knee hyperflexion and easy bruising. Nine joints are assessed unilaterally on the non-dominant side. The last item deals with bruising; "Do you get bruises easily after minimal trauma?" Each hypermobile item scores one point, yielding a total score ranging from 0 to 10. The HdM ≥4 and ≥ 5 were set as cut-off levels for GJH [12]. The HdM measurement was modified by measuring passive opposition of the thumb with a goniometer instead of a ruler where < 15 degrees on the goniometer corresponds to < 21 mm on the ruler as used in the original description. Due to the lack of a reference value regarding a positive hypermobility finding for the ankle and the patella, ≥45 degrees was considered as hypermobile for the ankle [29,30]. In addition, the measurement of the patella was standardized to make objective assessment possible (Additional file 1).
A goniometer (Medema Brodin, Kista Sweden, 31 cm or 21 cm with a 180°protractor and movable arms) was used. The small goniometer was used for measurements of the fifth finger and the big toe. Each joint was registered to the nearest 1-degree.

Raters
Two physiotherapists, rater A (KA) and rater B (AS) assessed all of the participants. Both raters had clinical experience in the physical examination of patients with joint hypermobility attending primary care (27 and 24 years of experience respectively). To standardize the performance and to assure similar interpretations of the assessments, the two raters trained un-blinded on three occasions until consensus was reached, for a total of 24 h, before data collection. The training cohort included 21 persons.

Participants
Information regarding the study was sent by e-mail to all 250 employees in a rehabilitation company within primary care in Stockholm, Sweden. The inclusion criteria were men and women aged between 18 and 65 years. For the inter-reliability study, we recruited the first consecutive 50 individuals who agreed to participate and who met the inclusion criteria. Of these, the first 30 participants were included in the intra-rater reliability study. Individuals with joint inflammatory signs, spasticity, joint-replacement, musculoskeletal injuries during the past 3 months and those who were not fluent in the Swedish language were excluded.

Procedures
Self-reported sociodemographic data concerning gender, age and country of birth were obtained using a questionnaire. The raters examined the participants in separate examination rooms without the presence of other employees. The participants wore shorts and tank tops. No warming-up sessions were done before assessments. Reference dots were marked by the assessing rater on anatomical landmarks (Additional file 1), and were removed after each assessment session. The rater started each assessment with both oral and visual instructions about how the test would be performed.
The rater instructed the participant to stop the passive movement when they experienced that their joints were at an end-range position. The rater examined if it was possible to move the joint further without causing pain. In measurement of active ROM, the participant was asked: "Is this your maximum ROM?" For inter-raterreliability, the raters assessed the same participant with a minimum of 30 min and a maximum of 7 h between assessments. The raters were blinded with respect to each other's results. To avoid recall bias in the intra-raterreliability study, rater B conducted the repeated assessments 7 to 14 days after the first assessment. The second assessment was performed at the same time of the day as the first.
A timetable assured that the time intervals between assessments were achieved and that the order in which raters assessed the participants varied. The order of the joint assessments changed every third assessment day for both inter-and intra-rater reliability examinations by starting from the end of the protocol (Additional file 1).

Statistical analysis
Statistical analysis was conducted with R.3.3.1 (The R Project for Statistical Computing, Vienna, Austria). Intra-class correlations, two-way random effects model, ICC (2.1) with 95% confidence intervals (CI), were used to measure the inter-and intra-rater reliability for the quantitative measurements joint ROM (degrees) and total scores of the three hypermobility assessment methods [31]. The two-way models allow the error to be partitioned between systematic and random error [31,32]. The ICC specific to the total score of the hypermobility assessment methods was used as the majority of these values were based on measured degrees. An ICC-score of < 0.40 was considered poor, 0.40-0.59 = fair/moderate, 0.60-0.74 = good and ≥ 0.75 = excellent [32]. The standard error of measurement (SEM) quantifies absolute reliability [33] and is referred to as the "typical" error [34]. The SEM was calculated using the residual mean square error from two-way repeated measures ANOVA. The SEM is important since a smaller SEM indicates more reliable results [33]. The value of an accepted SEM is a clinical decision.
For binary variables, the total percentage of agreement (P a ) for prevalence of positive findings was calculated. To assess the proportion of agreement beyond that expected by chance Cohen's Kappa (κ) was used [35]. A kappa value of κ = < 0.00 is considered as poor, 0.00-0.20 = slight, 0.21-0.40 = fair, 0.41-0.60 = moderate, 0.61-0.80 = substantial and ≥ 0.81almost perfect [36]. Since prevalence and bias affect the magnitude of the kappa coefficient, the prevalence-adjusted bias-adjusted kappa (PABAK) was calculated in addition to the obtained value of kappa [35].
With a significance level at 0.05 and a power of 80%, the sample size in this study was based upon an ICC score of at least 0.82 where a score of 0.6 or higher would be acceptable [37].

Results
Forty-nine adults, 38 women and 11 men, mean (SD) age 39.8 (13.5) years participated in the inter-raterreliability study. Twenty-nine adults, 23 women and 6 men, mean (SD) age 39.9 (12.5) years participated in the intra-raterreliability study. The majority were Europeans, 96% and 97% respectively. One participant was excluded because of injury. The time interval from assessments in the inter-raterreliability study varied from 30 min to 7 h and between eight to 8 days in the intra-rater reliability study.
The inter-and intrarater-reliability for the total score of all assessment methods, using ICC 2.1, was good-to-excellent 0.72-0.82 and 0.76-0.86, respectively ( Table 1).
The inter-rater reliability for measurements of joint ROM in degrees was good-to-excellent in all but three of the assessed joints (ICC 2.1: 0.67-0.91). For the hips and right calcaneus the reliability was moderate (ICC 2.1: 0.44-0.59). The differences between raters were within 5 degrees (0.1-4.3) in all but one measurement. The SEM ranged from 1.1 to 6.2 degrees ( Table 2).
The intra-rater reliability for measurements of joint ROM in degrees was good-to-excellent in all but three of the assessed joints (ICC 2.1: 0.60-0.90). For left hip and the calcaneus bilaterally the reliability was moderate (ICC 2.1: 0.44-0.51). The differences between test-retest assessments were within 3 degrees (0.0-2.7) in all but one of the measurements. SEM ranged from 1.0 to 5.7 degrees (Table 3).
Regarding prevalence of positive hypermobility findings for separate joint assessments, the P a ranged from 80 to 100%, except for the calcaneus. Cohen's (κ) was substantial-to-almost perfect for 13 of the 21 joint assessments (κ = 0.63-1.00) while the PABAK was substantial-to-almost perfect in all but three joint assessment (κ = 0.63-1.00), (Table 4).
For intra-rater reliability, the P a for prevalence of positive hypermobility findings ranged from 72 to 97% for all total assessment scores.
The inter-and intra-rater reliability for the prevalence of positive hypermobility findings for the hip-abduction are not reported since none of the participants reached the cut off limit of > 85 degrees (Tables 4 and 5).

Discussion
To the best of our knowledge, this is the first study to investigate the inter-and intra-rater reliability of the Beighton score, the Contompasis score and the Hospital del Mar criteria. We used a structured protocol including descriptions of testing positions, starting positions, goniometer positions, anatomical landmarks, stabilization of adjacent structures and performance illustrated by photos.
Following this structured protocol with use of a goniometer, all of the three hypermobility assessment methods, the BeS, the CS and the HdM, showed good-to-excellent inter-and intra-rater reliability for the total scores and for the majority of the single-joint measurements in degrees. The SEM for inter-and intra-rater reliability ranged from 1.0 to 6.2 degrees.
Previous reliability studies of the BeS using a protocol have presented similar results to those in this study [12, 15-17, 25, 38, 39]. However, comparisons with these studies are complicated as the testing The hypermobility instruments used in this study were: BeS Beighton score, CS Contompasis score, HdM Hospital del Mar procedures vary. This will affect the measurement of joint ROM [40] and thus influence the results [10]. In addition, many studies reported the use of no [21,22] or an insufficient protocol [23,25,41,42]. Comparisons are further hampered due to differences regarding the use or lack of use of a goniometer, reference lines for the goniometer and for anatomical landmarks, insufficient stabilization of adjacent structures, active or passive testing, testing positions, cut-off levels and statistical methods [15-17, 21, 22, 25, 38, 39, 41, 42].
To the best of our knowledge, this was the first interand intra-rater reliability study of the CS and the HdM and the first reliability study using measurement in degrees for joints included in the three hypermobility assessment methods. The inter-and intra-rater reliability was good-to-excellent for the majority of the single-joint assessments. Since prevalence and bias affect the magnitude of Cohen's (κ), it is recommended to also calculate the PABAK [35]. Due to adjusting for prevalece and bias, higher PABAK than Cohen's (κ) was found across all the results (Tables 4 and 5).
The difference between and within the raters in the present study was less than five degrees in all but one measurement which is in accordance with other studies [38,43]. This is within an acceptable measure, as a variation of ±5 degrees in goniometric measurements is generally accepted in the clinic [44,45].
The inter-and intra-rater reliability was moderate for some joints, indicating difficulties in the performance of these assessments. Joints without ROM end points, such as the elbow, the fifth finger and the knee might be considered more challenging to measure. This could be the reason why these joints in the BeS showed the lowest kappa values and the lowest P a for the prevalence of positive hypermobility findings in this study and as well as in other studies [15,17,25,42]. We stabilized the wrist and the fourth finger when measuring the fifth finger ROM since the test phase showed an increased ROM when the adjacent structures were not stabilized. This may affect the prevalence. Therefore, there is a need for consensus in the performance.
We have not found any documentation regarding the selection of joints for the criteria of the GJH.
In addition to study reliability of the BeS with a structured protocol, this study also aimed to establish the interand intra-rater reliability for the measurement of ROM in joints other than those included in the BeS. Children with joint hypermobility assessed with the BeS were equally hypermobile in their ball-and-socket-joints [43]. Thus, the importance of ball-and-socket-joints in adults with GJH requires further study.
Following this structured protocol with standardized assessments provided an excellent inter-and intra-rater reliability for the measurement of external rotation of the shoulder ICC 2.1: 0.89-0.90 and 0.86-0.87, respectively. The hypermobility instruments used in this study were: BeS Beighton score, CS Contompasis score, HdM Hospital del Mar In accordance with another study [15], we reported low inter-and intra-rater reliability in measurements of hip-abduction, which may be due to insufficient stabilization of the pelvis. Furthermore, as in the hip-abduction measurement of elbow and calcaneus showed wide confidence intervals. The lack of precision in these measurements, as displayed by the wide CIs, suggests that the reliability should be interpreted with care. For the elbow, this could depend on a large valgus angle that falsely might give an impression of hypermobility [17]. Moreover, it is difficult to evaluate the reliability of the calcaneus tilt since the ROM is within the measurement error of the goniometer. This finding suggests that the calcaneus tilt should be excluded in the assessment of GJH. Other disputable tests included in the HdM are the knee-hyperflexion and the big toe-extension test. Most participants scored positive on these tests even though they were not hypermobile in other joints, suggesting that the risk of a false positive finding in the general population is high. Despite good-to-excellent inter-and intra-rater reliability, these tests are not adequate to identify joint hypermobility, as also confirmed in another study [23]. We therefore propose that these tests should be removed from the HdM. The remarkably high prevalence of positive hypermobility findings for knee-flexion and big toe-extension may have resulted in a higher prevalence of hypermobility in the HdM compared to the BeS in this study.
There was a difference in big toe-extension between right and left side for both inter-and intra-rater reliability, indicating a systematic error. This may be explained by the fact that both raters were right-handed.
None of the participants had hypermobile hip-abduction and few had hypermobile external rotation of the shoulder even though measurements showed hypermobility in other joints. This may indicate that the cut-off value for hypermobility in these joints is too high in the HdM. A too high cut-off value increases the risk of underdiagnosing a possible hypermobility. In accordance with another study [15] cut-off levels for hypermobility above 55 degrees for hip-abduction [30,46] and above 68 degrees for the shoulder external rotation [46] are supported.
We defined cut-off levels for the three hypermobility assessment methods. A cut-off level of the CS ≥ 30 for GJH was used in this study which corresponds to the BeS cut-off level of ≥4 points [47]. Previous reliability studies concerning the CS also used other cut-off levels [47,48] than in the original description [19]. A cut-off level of ≥30 for the CS had a lower kappa value compared to a cut-off level of ≥4 or ≥ 5 when using the BeS and the HdM in this study. This may be due to the The hypermobility instruments used in this study were: BeS Beighton score, CS Contompasis score, HdM Hospital del Mar fine-scale grading of the CS, suggesting that the CS is more sensitive to measurement differences. Another possible explanation could be the small ROM of the calcaneus tilt and the cut-off levels for hypermobility making the judgement less reliable as mentioned above. The strength of this study is that it was planned and developed in accordance with GRRAS [27] and QAREL [28]. It included a structured protocol with use of size-adjusted goniometers and a comprehensive description of the procedures for performing the assessments illustrated by photographs as recommended [18]. Two experienced physiotherapists, who had trained before the study, performed the measurements. The experience of the rater is important [15] as confirmed in another study showing that inter-rater variability increased as the level of medical education decreased [42]. Furthermore, the stability of joint ROM was taken into account for time intervals of assessments.
The raters stabilized adjacent structures to reduce the risk of false positive hypermobility findings and mainly used passive tests to assure that the end-range position was reached, since passive ROM is greater than active [30].
This study described testing positions since this impact the ROM and an optimal position should facilitate reaching the end-range position. Testing position Table 4 Inter-rater reliability for prevalence of positive hypermobility findings for total score and for single-joints of adjacent joints is also important. For example, the position of the wrist and the elbow will impact the ROM of the thumb and the fifth finger [13,38,39].
A limitation in the present study is that the degree of agreement set at 80% in the training phase was not specified as recommended by "The International Federation for Manual/Musculoskeletal Medicine" (FIMM) [49]. The rater only measured each subject once to imitate clinical practice. Additionally, another study reported that mobility of joints increased significantly in consecutive measurements [38]. Furthermore, our aim was to measure the participant at the same time point at all testing occasions as it might be important to take this into consideration. However, about half of the participants were not assessed at the same time of the day. This may have influenced the results.
Since both raters were experienced, the use of a third, less experienced rater might have increased the generalizability in a clinical context. However, the generalizability also depends on the raters´ability to follow the testing procedures in a structured protocol. In our study, the raters were experienced. Still, the reliability was not excellent for all measures. For instance, a ROM measurement close to the cut off level for a positive hypermobility finding could be interpreted as positive by one rater and negative by the other. Future implementation of new The hypermobility instruments used in this study were: BeS Beighton score, CS Contompasis score, HdM Hospital del Mar, NA Not applicable, none of the participants reached the cut off limit