Test-retest reliability and measurement errors of grip strength test in patients with traumatic injuries in the upper extremity: a cross-sectional study

Background Grip strength (GS) test is an essential aspect of clinical practice with patients with upper extremity injuries. The random error of GS test was hypothesized to be proportional to the level of GS. The purpose of the current study was to estimate a precise range for the measurement error of GS in patients following traumatic injuries in the upper extremity. Methods Following traumatic injuries in the upper extremity, 109 participants completed GS tests twice one weekend apart. The Bland-Altman plot analysis was adopted to estimate the precise limits of agreement with 95% confidence interval (CI). Results The mean of three consecutive trials had a higher intraclass correlation coefficient of 0.974 (95% CI = 0.963, 0.982) than those of one trial and the mean of the first two trials in injured upper extremities. When GS was ≤20 kg, the upper limit of agreement with 95% CI was estimated as (0.41 × average GS + 1.24), while the lower limit was estimated as (− 0.41 × average GS − 0.39). A table of one-to-one matches between averaged GS ≤ 20 kg and transformed ranges of random errors with 95% certainty was created; the standard error of measurement and minimal detectable change with 95% certainty of GS test were 1.8 and 4.9 kg, respectively. When GS was > 20 kg, the width of agreement with 95% CI ranged from − 4.9 to 5.3 kg, and the standard error of measurement and minimal detectable change with 95% certainty were 1.8 and 5.1 kg, respectively. Conclusion The one-to-one match table can be considered as a practical tool to judge a change in GS score is real or due to random errors when it is ≤20 kg.


Background
In clinical practice, therapists concern several aspects for patients with upper extremity injuries, including pain, scarring, swelling, and range of motion of involved and adjacent joints, sensibility, muscle strength, and fine motor abilities. Among these aspects, grip strength (GS) is an essential indictor of hand function because it is a basic requirement for the performance of sports, daily activities, and work tasks [1][2][3]. Additionally, GS can also reflect general health status and, more specifically, it is negatively associated with cardiovascular mortality, myocardial infarction, and stroke [4]. Therefore, reliable GS measures are important for evaluating the severity of a disability and for monitoring clinical progress.
The intraclass correlation coefficient (ICC) is traditionally used to estimate the agreement between two repeated administrations [5,6]. Previous studies related to the measurement properties of GS showed that hand dynamometer has satisfactory test-retest reliability in upper extremities with physical dysfunction [7][8][9]. To determine patients' changes in a specific measurement are real or due to random errors, the minimal detectable change with 95% certainty (MDC 95 ) is used as a parameter to estimate the size of random errors [10][11][12]. Therefore, by knowing the MDC of GS in patients with upper extremity injuries, clinicians can determine the change in GS score is likely to be the result of a real improvement or caused by random measurement errors. Schreuders et al. [8] estimated the test-retest reliability of GS test in patients with hand injuries and reported an ICC of 0.97, with an MDC 95 of 61 N (≈ 6.22 kg). This shows that differences between two consecutive measurements greater than 61 N can be interpreted as real changes in GS, with 95% certainty. Nevertheless, in clinical practice with patients with upper extremity injuries, many patients at sub-acute stage may experience a very low GS score of only a few kilograms. To the present authors' knowledge, the MDC 95 of 61 N may be too large for patients with only a few kilograms of GS, and we consider that it is quite impossible for patients with extremely poor GS to have such relatively large random errors. Although clinicians may have high confidence in determining if patients' changes are real when GS scores are greater than the large MDC 95 , it will result in high possibility of false-negative interpretations.
The Bland-Altman plot complements the role of ICC and MDC in determining test-retest reliability of measurement tools. The plot, usually presented as differences of two measurements against the mean of two measurements, can reveal the 95% limits of agreement (LoA 95 ), which is the width of the differences with 95% certainty. The LoA 95 defines a range within which most differences will lie, and a narrow range of LoA 95 indicates that the scores of two measurements are close together [13]. Whether and how a relationship exists between them can be identified through statistical analysis and visual inspection [13]. In a study evaluating the test-retest reliability of the Jamar Dynamometer in a healthy population of 76 participants, the Bland-Altman plot seemed to indicate that the differences were proportional to the mean [14]. In addition, we found similar scatters in the Bland-Altman plot of another study with 19 healthy participants [15]. Therefore, we hypothesized that the random errors between two administrations are also proportional to GS in patients with upper extremity injuries. Furthermore, it is not appropriate to determine the change in GS score is real or due to random errors by using MDC 95 alone in patients with different levels of GS ranging from several kilograms to tens of kilograms.
The purpose of this study was to estimate the test-retest reliability and the precise range of measurement errors of GS test in patients with upper extremity injuries using the Bland-Altman plot analysis to help clinical practitioners to determine that patients' changes in GS indicate real progress or are due to random errors.

Design
This research was a clinical measurement and cross-sectional study. Participants received GS tests twice; more specifically, the first test was administered on Friday and we carried out the second test on the following Monday.

Participants
Patients with upper extremity dysfunction due to traumatic occupational injuries were recruited in a rehabilitation center. All patients were receiving inpatient rehabilitation services in the rehabilitation center when they were recruited. The following inclusion criteria were applied: (1) aged 18 years or above; (2) having a traumatic injury in unilateral upper extremity; (3) being capable of being evaluated for GS, confirmed by an occupational therapist experienced in hand therapy; (4) remaining dysfunction in injured upper extremities; and (5) having good compliance with occupational therapists' daily treatment instructions. The following exclusion criteria were applied: (1) having concurrent injuries in any other parts of the body; (2) experiencing pain when performing maximal isometric GS (visual analogue scale > 3); and (3) not being able to attend the second GS test.
All participants signed an informed consent form in accordance with the Declaration of Helsinki, and the study was approved by the medical ethics committee of the rehabilitation center.

Procedures
In this rehabilitation center, all patients receive rehabilitation services five days per week, from Monday to Friday. During weekends, they usually go home or stay in wards and do not receive any formal rehabilitation services from clinical practitioners. The aim of this study was to estimate the test-retest reliability and the range of measurement errors of GS test. To avoid any bias from interventions, we arranged the first test on Friday and the second test on the following Monday. Therefore, we hypothesized that because no effective interventions were delivered in the short interval between the two tests, none of the participants would have experienced a real change in GS. After signing the consent form, demographic data including gender, marital status, age, height, body weight, and dominant hand were collected from each participant. In addition, injured sides, injury sites, and the number of days since injuries were confirmed.

GS test
Prior to starting the first test, participants were instructed to sit on a chair and maintain the posture recommended by the American Society of Hand Therapy [16] and Roberts et al. [17]. The participants sat with their feet flat on the floor, the shoulder adducted 0 degree, the elbow flexed at 90 degrees, the forearm in a neutral position, and the wrist extended to 30 degrees. The dynamometer used in this study was a calibrated Jamar Hydraulic Hand Dynamometer (model SH5001, Saehan Corp, Masan, Korea) which was the most commonly used one and showed excellent reliability for the measurement of GS in previous studies [17]. Verbal instructions and demonstration about how to perform GS test were provided to each participant prior to the test. Once everything was ready, the participants were instructed to exert maximum grip at the second handle position and to maintain the contraction for five seconds. Three consecutive trials were performed with both injured and healthy upper extremities and there was 15 s of rest period among trials to prevent muscle fatigue. All participants started the test with their healthy hands. The value at which the needle of the dynamometer stopped was recorded for each trial. The second test followed the above procedures and used the same dynamometer for all patients. In the current study, the same occupational therapist experienced in hand therapy was responsible for all participants' GS tests.

Statistical analysis
Descriptive statistics were computed to illustrate participants' demographic characteristics. Both the one-sample Kolmogorov-Smirnov test and histogram plot were applied to check for the normality of continuous variables. We used the data of the first trial, the mean of the first two trials (mean 2 ), and the mean of the three trials (mean 3 ) to estimate the test-retest reliability and the measurement error of GS of injured and healthy upper extremities. ICC 2,1 as well as their 95% confidence intervals (CI) were calculated [5]. An ICC value higher than 0.9 was considered excellent. In addition, a paired t-test was applied to verify if there was any systematic bias between the first and second tests. The MDC 95 and standard error of measurement (SEM) were calculated using the following formulas [18]: To verify whether there were any other relationships between GS and measurement errors, the Bland-Altman plots were created based on the values of mean 3 . A systematic error is confirmed if the 95% CI for the mean value of differences does not include 0. The LoA 95 was calculated by using the Bland-Altman plots which present the scatter of differences between the first and second tests (y-axis) against the average of the first and second GS tests (average GS) (x-axis) [19]. If the differences are normally distributed and do not show any associations with the average GS, limits of the LoA 95 are computed as where mean difference is the mean of differences between the two tests, and SD difference is the standard deviation of the differences. This implies that 95% of the differences will lie between the upper and lower limits.
In injured upper extremities, the Spearman's correlation coefficient ρ between the observed differences, which were not normally distributed, and the average GS was 0.118 (p = 0.310). Therefore, residuals were defined as the differences between observed differences and the mean of differences. It was observed that the absolute values of residuals (|R|), which were the distances between the observed differences and mean difference , tended to increase as the average GS increased in upper extremities with poor GS. However, in upper extremities with high GS, this trend was not distinct. To identify the most appropriate cutoff point on the average GS to separate the above two conditions, the Spearman's correlation coefficient ρ between the |R| and the average GS lower than each possible cutoff point on the average GS was calculated. This was because the |R| was not normally distributed. The cutoff was defined as the point where the relationship between the |R| and the average GS had the highest Spearman's correlation coefficient. The Bland-Altman plots were then constructed again for the two conditions according to Bland and Altman's recommendations [13]. First, we regressed the |R| on the average GS to derive Second, the LoA 95 was calculated using the following formula: Once upper and lower limits of the LoA 95 were calculated, one-to-one matches between integral GS scores and transformed ranges of random errors with 95% certainty were created for convenience in clinical application. The transformed lower and upper limits of the ranges of random errors were calculated using the integral GS scores plus the upper and lower limits of LoA 95 , respectively.
All statistical analyses were performed with the IBM SPSS Statistics 20. The level of significance was set at p < 0.05 for all statistical analyses performed.

Sample characteristics
A total of 111 patients participated in the current study between February and September 2017, of whom two patients did not attend the second test. Therefore, 109 patients were included in the statistical analysis, and their median age was 37 years. In terms of the types of injuries, 52 (47.7%) patients had fractures, followed by 18 (16.5%), 14 (12.8%), 14 (12.8%), and 11 (10.1%) for finger replantation, finger amputation, complex injuries involving tendons, and complex injuries involving nerves, respectively. The median days since being injured was 133. Characteristics of the participants are presented in full in Table 1.

The test-retest reliability of GS test
The indices of test-retest reliability of GS test in injured and healthy upper extremities based on the data of the first trial, mean 2 , and mean 3 are presented in Table 2. No significant differences were observed between the first and second test scores, which ranged from 0.1 (95% CI = − 0.7, 0.8) kg to 0.7 (95% CI = 0, 1.4) kg. The test-retest reliabilities of GS in injured and healthy upper extremities were excellent, with high ICCs ranging from 0.936 (95% CI = 0.908, 0.956) to 0.974 (95% CI = 0.963, 0.982). Although mean 3 had the highest ICC of 0.956 (95% CI = 0.936, 0.970) in healthy upper extremities, it did not differ significantly from those of mean 2 and the first trial, 0.949 (95% CI = 0.926, 0.965) and 0.936 (95% CI = 0.908, 0.956), respectively. In contrast, in injured upper extremities, mean 3 had the highest ICC of 0.974 (95% CI = 0.963, 0.982), which was significantly higher than that of the first trial, 0.945 (95% CI = 0.920, 0.962). However, the ICC of mean 3 was not significantly different from that of mean 2 , 0.970 (95% CI = 0.956, 0.979). Mean 3 had the lowest SEM and MDC 95 , 1.8 kg and 5.0 kg, respectively, compared with mean 2 and the first trial in injured upper extremities ( Table 2).

The Bland-Altman plot analysis based on mean 3
The Bland-Altman plot analysis was conducted for further examination of the differences. In healthy upper extremities, the Bland-Altman plot showed no systematic trend (Fig. 1 a).
The mean difference between the second and first tests was 0.2 (95% CI = − 0.4, 0.7) kg. The width of LoA 95 was − 5.6 to 5.9 kg, and 101 (92.7%) cases fell within the 95% limits of agreement. In injured upper extremities, the mean difference between the second and first tests was 0.4 (95% CI = − 0.1, 0.8) kg. The width of LoA 95 was − 4.6 to 5.3 kg (Fig. 1 b), and 97 (89.0%) cases fell within the 95% limits of agreement. Visual inspection suggested a trend whereby the absolute value of residuals seemed to be proportional to the value along the average GS in injured upper extremities with poor GS. However, in injured upper extremities with high GS, this trend was not distinct.
The Spearman's correlation coefficients between residuals and average GS scores lower than possible cutoffs were calculated (Fig. 2). The results indicated that 20 kg was the most appropriate cutoff with the highest correlation coefficient (Spearman's ρ = 0.566,  Fig. 3, for injured upper extremities with average GS ≤ 20 kg, the Bland-Altman plot showed a narrow LoA 95 in upper extremities with poor GS, but the LoA 95 width increased as the average GS increased. The mean difference between the second and first tests was 0.43 (95% CI = − 0.15, 1.00). The limits of the agreement estimated were: Upper limit ¼ 0:41 Â average GS þ 1:24 ð6Þ One-to-one matches between GS scores ≤20 kg and ranges of random errors with 95% certainty were created and are presented in Table 3.
In injured upper extremities with GS scores > 20 kg, the Bland-Altman plot based on mean 3 showed no systematic difference between the first and second tests and no distinct trend between residuals and average GS scores. The mean difference between the second and first tests was 0.2 (95% CI = − 0.7, 1.1) kg, and the width of LoA 95 was − 4.9 to 5.3 kg. In total, 104 (95.4%) cases

Discussion
The inter-rater reliability of GS test has been shown excellent in previous study [20]. The current study estimated the test-retest reliability of GS test based on results of the first trial, the mean of the first two trials, and the mean of three consecutive trials, and calculated the SEM as well as MDC. In the current study, Bland-Altman plot analysis was adopted to explore the relationship between measurement errors and GS in healthy and injured upper extremities.
In healthy upper extremities, our findings were consistent with previous studies, which evaluated the test-retest reliability of hand-held dynamometers [21][22][23][24][25]. Our study confirmed that the GS test using Jamar Dynamometer had excellent reliability and was not affected by practice effect. In addition, although mean 3 had the highest ICC, it was not significantly different from those of the first trial and mean 2 . This indicated that these three methods had comparable reliability and supported the one-trial protocol for assessing GS in healthy upper extremities [26]. Visual inspection of the Bland-Altman plot for healthy upper extremities did not show signs of any systematic bias in the relationships between differences and GS scores. Therefore, the MDC 95 could be considered as an ideal criterion to determine that the changes in GS of healthy upper extremities are real or due to random error.
However, in injured upper extremities, mean 3 had a significantly higher ICC than that of the first trial, but its difference from mean 2 was non-significant. The SEM and MDC 95 of mean 3 were also the lowest in injured upper extremities. In particular, the paired t-test for the first trial of injured upper extremities showed a p-value close to the significance threshold, and the lower limit of 95% CI for the mean difference was zero. This revealed that there might be a systematic bias which influenced the reliability. Kennedy et al. [27] found that both one trial and mean 3 had comparable test-retest reliability based on a sample of 25 participants with rheumatoid arthritis. However, we recruited 109 participants in the current study, which resulted in narrow 95% CIs for the ICC. Therefore, the ICC values of the first trial (ICC = 0.945) and mean 3 (ICC = 0.974) did not exhibit a large difference, but it was significantly different. Accordingly, we considered that the first trial, mean 2 , and mean 3 had excellent test-retest reliability, among which the mean 3 method was the most reliable. Therefore, we support the use of the mean 3 method to test patients' GS in clinical practice, as recommended by the American Society of Hand Therapy [16], even though it would entail extra time for the test.  The Bland-Altman plot is a graphical method to identify any relationships between the differences and averages of scores on two tests [13]. In our study, a trend was observed whereby the difference was proportional to the average of two GS tests in injured upper extremities with poor GS. However, the width of the differences in injured upper extremities with high GS was stable. We used the Spearman's correlation coefficient to identify the ideal cutoff point where the relationship between the absolute values of residuals and average GS had the highest Spearman's correlation coefficient. Our findings showed that 20 kg was the most appropriate cutoff point to separate injured upper extremities into the above two conditions. Additionally, we estimated the width of LoA 95 for injured upper extremities with GS ≤ 20 kg according to the recommendations of Bland and Altman [13]. The graph of LoA 95 looked like a "horn," which indicated that the measurement error increased as GS increased when GS was ≤20 kg. To take the example of a GS score of 5 kg, the width of the measurement error was − 3.5 to 3.4 kg, according to the horn-like LoA 95 . However, according to the MDC 95 , the width of the measurement error was − 4.9 to 4.9 kg. Therefore, when the MDC 95 was used to determine whether a patient's change was beyond the threshold of random error, the possibility of a false-negative interpretation would be increased.
To simplify clinical application of the equations we proposed, we transformed them into a table giving one-to-one matches between GS scores and ranges of random errors with 95% certainty. For the clinical application of this table, clinicians can first find their patients' current level of GS in the left column of the table, and then the lower and upper limits of the corresponding range of random errors can be determined. Specifically, a patient's change may not be real if the result of the second test lies within the corresponding range. In addition, upon combining the Bland-Altman plot for GS ≤ 20 and GS > 20 kg, a higher percentage of cases was found to fall within the 95% limits of agreement.
Our study may be the first one with a large sample size which constructed the Bland-Altman plots for poor and high GS scores. In the past, many authors adopted the Bland-Altman plots to analyze the reliability of GS test, but they did not conduct the analysis presented in our study. The most important reason might be their small sample sizes [15,28,29]. Scatter plots of small sample sizes usually cannot easily indicate obvious relationships, and statistical analyses may easily yield non-significant results. Therefore, such analyses were ignored in previous studies. In our study, the Spearman's ρ between the absolute values of residuals and the average of the first and second tests was 0.566 in 76 patients with GS ≤ 20 kg, and post hoc analysis showed a statistical power of 0.999, indicating a powerful statistical significance. On the other hand, previous researchers commonly evaluated the reliability of devices for GS test in healthy populations showing high level of GS score [14]. However, our study revealed that the GS score of injured upper extremities ranged from several kilograms to tens of kilograms, which covered the full range of GS scores. This was another reason why we could identify the relationship between measurement errors and GS scores.
This study also had some limitations. Firstly, the present participants received rehabilitation services on weekdays. We hypothesized that they did not undergo any real change over the weekend because they did not receive any formal interventions during this period. However, two confounding factors might have influenced the results of the current study: the lasting effect of interventions received during weekdays and additional exercises done by the participants during the weekend. Generally speaking, the lasting effect and additional exercises could improve participants' GS scores and increase the extent of disagreement between the two tests. This may be the reason why the paired t-test for the first trial of injured upper extremities showed a p-value close to the significance level. Secondly, only 33 participants Table 3 One-to-one matches between grip strength and transformed ranges of random errors with 95% certainty when the grip strength is ≤20 kg had GS score > 20 kg. Therefore, we could not make conclusions with strong confidence as to whether the appropriate cutoff point had changed if we recruited more participants with GS > 20 kg. Thirdly, to avoid any learning effect, some researchers employed a warm-up practice prior to GS test in addition to verbal instructions and demonstration [30,31]. However, in the current study, we provide verbal instructions and demonstration only and this may have a negative influence on the reliability of GS test. Lastly, we only sampled participants who had traumatic injuries and only used one commercial hand-hold dynamometer to estimate the measurement error in the current study. Therefore, we cannot be certain that our results can be generalized to other disorders and devices to asses GS.

Conclusions
In summary, the GS test was found to have excellent test-retest reliability in healthy and injured upper extremities. We also recommend that clinical practitioners should use mean 3 for GS test, particularly in cases with injured upper extremities. When the GS is ≤20 kg, clinicians can use the one-to-one match table to judge a change in GS is real or due to random errors.