Research questions
The study had the following objectives
-
1.
To investigate the psychometric properties of the FreBAQ-G using Item Response Theory. Based on IRT modelling item functioning characteristics, such as item difficulty, discrimination and item information were examined. In addition the distribution of the items over the scale as well as test-score properties including reliability parameters and measurement error were evaluated. Finally item invariance of the FreBAQ-G was assessed to examine whether the questionnaire behaves in the same way in different subgroups of the German speaking population.
-
2.
To investigate cross-cultural validity / item invariance of the FreBAQ-G in patients with NSCLBP. Based on IRT techniques differential item functioning (DIF) was used to evaluate whether the translated version behaves in the same way in the German speaking population as the original version in the English speaking population.
-
3.
To investigate hypothesis based construct validity of the FreBAQ-G by evaluation of the correlations of back specific self-perception with other back pain related parameters such as pain intensity, function and fear avoidance beliefs corresponding to the English validation study [20].
Study design
The study was designed as a multicentre, cross-sectional study. Data was collected as part of a study evaluating lumbar movement control in persons with NSCLBP. All participants provided written informed consent and all procedures conformed to the Declaration of Helsinki. To investigate cross-cultural validity the data of this study were pooled with those collected by Wand et al. [20] .
Setting
Participants were recruited in seven physiotherapy practices in Germany between April and September 2019.
Participants
Participants had to meet the following inclusion criteria: age ≥ 18 years; sufficient German language ability to complete the questionnaire; currently experiencing NSCLBP with or without leg pain (leg pain above the knee and main pain had to be localized below the costal margin and above the inferior gluteal folds) and duration of symptoms ≥3 months. The pain level, calculated as the mean of the actual pain intensity and the average pain intensity during the last 3 months, measured on an 11-point numeric rating scale (NRS), needed to be above 0. Participants were excluded if they had signs and symptoms indicating specific spinal pathology [24].
Bias
Data collection and data analysis was conducted by different persons to minimize potential risk of bias.
Procedure
Participants provided basic demographic information and completed a self-developed questionnaire to collect information about LBP characteristics. Pain related disability during daily activities, leisure time and work, as well as pain intensity, were assessed using 11-point numerical rating scales (NRS 0 = no pain / disability - 10 = worst imaginable pain / disability). For overall pain related disability we calculated the mean of the impairment scores during daily activities, leisure time and work. Pain related fear was estimated using the German version of the Fear Avoidance Beliefs Questionnaire (FABQ) [25]. Finally, the participants completed the FreBAQ-G [22]. The FreBAQ-G consists of nine items measuring back specific self-perception on a five point rating scale with a range 0–36 (higher values indicating greater levels of impairment).
Sample size
For questionnaires with ordinal scaled items, polytomous item response models are recommended [26]. The COSMIN (Consensus-based Standards for the selection of health status measurement instruments) checklist advocates a minimum sample size of 200 participants for IRT based Rasch analyses [23]. However, for polytomous IRT models the sample size should be at least 250, but 500 for accurate parameter estimates is recommended [27]. To assess the psychometric properties of the German Version we aimed to recruit a sample greater than 250. To evaluate cross cultural validity we pooled our German data set with the English-language data set collected by Wand et al. [20]. The sample size of the English data set consists of 251 participants with NSCLBP. So the overall sample size to investigate cross cultural validity meets the recommendation of 500 participants.
Data analysis
Descriptive statistics
Descriptive statistics were used to describe the demographic and clinical characteristics of the sample. The FreBAQ-G was summarized using range, median, mean and standard deviation for the total score. The frequencies in each response category were also reported.
IRT modelling was used to assess cross cultural validity and the psychometric properties of the FreBAQ-G. Because the 9 items of the FreBAQ-G are ordinal scaled, a polytomous IRT model, should be used [26]. Based on statistical analysis the graded response model (GRM) was selected [26]. The assumptions of the statistical IRT model, local independence, dimensionality and model fit statistics were investigated. Details about the model selection and test of the IRT assumptions are given in the Appendix.
Psychometric properties of the FreBAQ-G
Psychometric properties, including scalability, internal consistency, item characteristics, test characteristics and test reliability of the FreBAQ-G were calculated. Differential item functioning (DIF) was used to evaluate item invariance, which means whether different subgroups of the German speaking sample have the same chance to answer the items of the FreBAQ-G.
Internal consistency was estimated using Cronbach’s α. Acceptable internal consistency is reached if α is > 0.7 [28]. Loevinger’s Hj scalability coefficient is reported as a measure of homogeneity. The coefficient can be considered as an accuracy measure for the ability of items to order the respondents in the measured latent trait (back specific self-percetion) [29]. As a rule of thumb, items with values of Loevinger’s Hj < 0.3 are indicative of poor/no scalability, values between 0.3 and 0.4 indicate useful but weak scalability, values between 0.4 and 0.5 are indicative of moderate scalability and values > 0.5 indicate strong scalability [30].
After fitting the GRM model, the test- and item-characteristics were evaluated. In IRT modelling, a person’s ability in the latent trait -in this study” back specific self-perception”- is measured on a logit scale which follows a Z-distribution with a mean of 0 and a SD of 1 (range from − 4 to 4) [26]. This logit scale is called Theta (θ) and is represented on the x-axis of every IRT graph. The θ -scale is not sample specific [26, 31, 32], so that even when the questionnaire is administered to other groups or languages, the items should have the same properties, yielding comparable scores. Hence, the item and test characteristics of the current study should be comparable to those of the original English speaking version reported by Wand et al. [20].
The test characteristic curve visualizes the relationship between the IRT-based estimated ability in the latent trait “back specific self-perception” for each person and the expected classical sum-score, based on the classical test theory [26]. This helps to understand which FreBAQ-G sum-score is expected for a person with NSCLBP with a certain trait level on the current scoring system.
The test information function shows how precisely the FreBAQ-G can estimate the level of the respondent’s ability in the latent trait [26]. Thereby, the test information function helps to decide which region on the latent trait continuum can be estimated most precisely (or most poorly). This concept is closely related to the concept of reliability [32], therefore the test information function also visualizes the standard error (SE). In IRT, the SE varies for each level of the latent trait. The SE can be used to calculate the estimated overall mean reliability often described as marginal reliability, using the formula: reliability = 1-mean (SE)2 [33].
The item characteristics include item discrimination (slope), item difficulty (threshold) and item information [26]. The item discrimination parameter (a) describes the slope of the item characteristic curve. Higher values are indicating better item discrimination, which means items with higher values are more sensitive to detect a difference in the latent trait (back specific self-perception). Values > 1 are desirable [26]. Item discrimination and item information are very closely related [26, 32]. The item difficulty parameter (b) describes the point on the x-axis (θ value), where the probability of choosing a response option is 50% (threshold). Because of the underlying statistical nature of the GR model the item difficulty parameters are cumulative [26]. Item difficulty parameters are calculated for each item. A person whose back specific self-perception is not impaired will choose the response option 0 (never feels like that), whereas a person with highly impaired back specific self-perception should have a high probability to choose response option 4 (always, or most of the time feels like that). The highest probability of which response option will be answered by a person with a certain trait level is visualized in the category characteristic curve.
Finally, differential item function (DIF) was used to assess the assumption of item invariance [26]. Item invariance implies that the FreBAQ-G is independent to particular sample characteristics. Differential item function (DIF) is present for a given item if individuals with the same ability level (back specific self-perception), but belonging to different groups (e. g. gender), do not have the same probability (chance) of responding to the item with the same rating [26]. Therefore, item invariance can be considered as a measure of fairness.
Cross cultural validity
Cross cultural validity refers to the equivalence of measurement across different cultural groups [28]. Cross cultural validity was investigated using IRT techniques. In a first step we pooled the data of the German version (FreBAQ-G, N = 271) with those collected for the English-language validation study (FreBAQ, N = 251) in an Australian study population [20]. To detect differential item function (DIF) we first separately investigated the item properties (difficulty and discrimination) for the German and English version using graded response model (GRM). To differentiate between uniform (difference in item difficulty only) and non-uniform (difference in item difficulty and discrimination) differential item function (DIF), the mean item difficulty was calculated per polytomous item when the slopes over all items were set to 1 [34]. The calibrated mean item difficulties were plotted with the German items on the y-axis and the English items on the x-axis. To facilitate interpretation an identity line was drawn through the origin of the plot with a slope of 1. Additionally control lines representing 95% CI are drawn around the identity line. Items that fall outside these control lines are suspected to demonstrate differential item function (DIF) [28, 31]. In the same way the item discrimination parameters were plotted. In addition we used the IRT-LR test (likelihood ratio test) to confirm both uniform and non-uniform differential item function (DIF) [34, 35]. The IRT-LR test procedure compares hierarchically nested IRT models; with one model fully constraining the IRT parameters to be equal between the German and the English version and other models that allows the item parameters to be freely estimated between groups. Finally we used a multiple-group graded response model (GRM) model with a correction for observed differential item function (DIF) to validate the performance of the classical sum-score of the English and German version [34].
Construct validity: associations of self-perception of the back with back pain related parameters
The relationship between the IRT-based estimated FreBAQ-G score (Theta) and pain intensity, disability and fear avoidance beliefs was calculated using correlation statistics (Pearsons r coefficient). Finally multiple linear regression with the FreBAQ-G (estimated with the Theta) as the dependent variable was performed to find the best predictors.
For statistical analyses Stata 16.1 (StataCorp LLC, USA) was used. The IRT model fit statistics was calculated using the student version of IRTPRO 4.2 (Scientific Software International Inc., USA).