Patients and setting
Data was collected in the Department of Physical Medicine and Rehabilitation at the Medical Faculty of Ankara University, Turkey, from February 2007 to November 2007. A total of 399 outpatients with low back pain were included in the study. Patients with non-mechanical back pain resulting from inflammatory, infectious, malignant or visceral diseases were excluded. In the first stage of the study 266 patients answered all the questions in the total item set obtained from the selected questionnaires (given below). After development of the item banks, the second stage involved another group of 133 patients completing the item banks (items determined after Rasch analysis) under a CAT version and by 'paper and pencil'.
In all cases, questionnaires were either self-completed by literate patients, or where patients were illiterate, the questionnaires were administered by one of the authors (DÖ). At the CAT stage, the same author also helped the patients who were unfamiliar with computer use. All patients gave informed consent to take part in the study and the study was carried out in compliance with Helsinki Declaration.
Selection of questionnaires
Initially, contents of both generic and specific questionnaires commonly used for outcome measurement in LBP were reviewed. The candidate item sets to be used as an item bank in CAT was designed to be applicable to patients with a spectrum of LBP problems and to represent the ICF components of disability [7] and the ICF core set for LBP [25]. Another requirement was the existence of a validated Turkish version of the outcome measure to be selected. After considering these requirements, 4 questionnaires were selected: The Oswestry Disability Index (ODI), the Roland Morris Low Back Pain Disability Questionnaire (RDQ), the World Health Organization Disability Assessment Schedule (WHODAS II), and the Nottingham Health Profile (NHP).
The WHODAS II was developed by the World Health Organisation to assess functioning and disability [26]. Based on the ICF model, it is a 36-item, generic, multidimensional questionnaire which is used for measuring the levels of disability in terms of activities and participation. It includes six domains: understanding and communicating (6 items), getting around (5 items), self care (4 items), getting along with others (5 items), household and work activities (8 items), and participation in society (8 items). It has a 5-point rating scale on all items in which "1" indicates no difficulty and "5" indicates extreme difficulty or inability to perform the activity. Raw scores are transformed into standardized scores. The total score and subscale scores range between 0–100, with higher scores reflecting greater disability. A previously adapted Turkish version of the WHODAS II instrument was used [27].
The Oswestry Disability Index (ODI) is a self-completed questionnaire designed for assessing the degree of functional limitation and pain in patients with LBP [28]. It includes 10 items (pain intensity, personal care, lifting, walking, sitting, standing, sleeping, sex life, social life, and travelling), each of which has 6 ordinal responses. The scale has a total score ranging between 0 and 100 with a high score showing higher disability. The Turkish adaptation was used in this study [29].
The Roland & Morris Disability Questionnaire (RDQ) is a self-completed questionnaire designed to assess physical disability due to LBP [28]. It includes 24 items, each with a dichotomous response category of yes or no. The scale has a total score ranging between 0 and 24 with a high score showing higher disability. The Turkish version of the RMDQ was used in this study [30].
The Nottingham Health Profile (NHP) is a generic health status measure developed to record the perceived distress of patients in physical, emotional and social domains [31]. It comprises 38 statements (answered 'yes' or 'no') in six sections: physical mobility (8 items), pain (8 items), sleep (5 items), emotional reactions (9 items), social isolation (5 items) and energy level (3 items). The Turkish version of NHP was used [32]. In this version the score on each section of the NHP is the percentage of items affirmed by the respondent (that is, the number of 'yes' responses multiplied by 100 and divided by the number of items in that section). Possible scores could range from 0 to 100, with a higher score indicating greater distress.
As seen above, response options and corresponding scores of items across the scales were different. While the items of WHODAS II and ODI were polytomous, those of RDQ and NHP were dichotomous.
The contents of these questionnaires were examined by the investigators regarding their links to the categories of ICF components [6, 33] and also the ICF LBP core set [25]. This examination revealed that some of the items had links with categories covered in both "body functions" and "activities and participation." Another issue at this stage was that some ICF core set categories from the body functions component (mobility and stability of joint functions, muscle power and muscle tone), and one category from the activities and participation component (toileting), were not covered in the contents of the questionnaires. However, as uncovered body function categories require a physical examination, it was impossible to include them in a self-report questionnaire. Regarding the toileting activity, which was the only "activities and participation" category missing, the investigators decided that it was not an essential deficit as most of the components of toileting activity such as sitting, rising from sitting position and dressing were already covered in other items. Furthermore none of the other questionnaires used in LBP were assessing toileting activity. Thus the four chosen scales gave 108 items as candidate items for the item bank.
Data analysis
Initial unidimensionality testing
The 108 items were submitted to an exploratory factor analysis (EFA) for categorical data using weighted least square methods [34] to investigate the dimensionality of the item set. Model fit was evaluated using the root-mean-square error of approximation (RMSEA) that accounts for model parsimony. RMSEA values < 0.08 suggest adequate fit; values < 0.05 indicate good fit [10].
When more than one dimension was found according to the results of EFA, separate item sets were constructed and named. Items, whose factor loadings below 0.40, were eliminated from the item set(s) [11]. After the determination of the dimensions of the total item set by EFA, the next step was to calibrate these items onto their appropriate dimensions using an IRT model.
IRT model selection
The Rasch model, sometimes referred to as the one-parameter IRT model, produces latent trait person estimates that are independent of the distribution of the population, and item difficulty estimates which are independent of the ability of the person [35]. These are requirements for obtaining interval scale estimates [36]. This then allows, for example, the calculation of person change scores from what was originally ordinal data [37]. Master's partial credit model (PCM) is an extension of the Rasch dichotomous model which can accommodate items with different response categories, such as those proposed for the LBP item bank [38]. The PCM equation, in the logit form is:
where P
nik
is the probability of person n affirming category k in item i, compared with an adjacent category (k-1); θ
n
is person ability, b
ik
is the difficulty of the k
ththreshold which is the probabilistic midpoint (i.e., 50/50) between any 2 adjacent categories in item i.
The resulting Rasch analysis, as with all versions of the Rasch model, is mostly concerned with testing the underlying assumptions of the model; that of the probabilistic relationship between items, unidimensionality and local independence [39]. In addition, item bias or differential item functioning can be examined.
Unidimensionality and local independence
The PCM is a unidimensional measurement model, therefore the assumption is that the items summed together form a unidimensional scale. There are various ways to test this assumption, and these can be thought of as a series of indicators to support the assumption. Rasch programs usually provide a principal component analysis of the residuals. The absence of any meaningful pattern in the residuals will also be deemed to support the assumption of unidimensionality. A test for unidimensionality, proposed by Smith EV [19], takes the patterning of items in the residuals, examining the correlation between items and the first residual factor, and uses these patterns to define two subsets of items (i.e., the positively and negatively correlated items). These two sets of items are then used to make separate person estimates, and, using an independent t-test for the difference in these estimates for each person, the percentage of such tests outside the range -1.96 to 1.96 should not exceed 5%. A confidence interval for a binomial test of proportions is calculated for the proportion of observed number of significant tests, and the lower bound should overlap the 5% expected value for the scale to be unidimensional. Given that the differences in estimates derived from the two subsets of items are normally distributed, this approach is robust enough to detect multidimensionality [40] and appears to give a test of strict unidimensionality, as opposed to essential unidimensionality [41]. In the latter case a dominant factor occurs, and although other factors exist, they are not deemed to compromise measurement.
The assumption of local independence implies that when the 'Rasch factor' has been extracted, that is, the main scale, there should be no leftover patterns in the residuals. This assumption was tested by performing a PCA analysis of the residuals obtained from PCM. If a pair of items had a residual correlation of 0.30 or more, one of the items that showed a higher accumulated residual correlation with the remaining items was eliminated [42].
Correct ordering of response categories
Before evaluation of item fit, where polytomous items are involved, the response categories should be examined for correct ordering. This involves the examination of the threshold pattern, the threshold being the transition point between adjacent categories. This ordering of thresholds is graphically demonstrated in the category probability curves by using the RUMM2020 software [43]. For an item with an appropriate ordering of thresholds each response option would demonstrate the highest probability of endorsement at a specific range of the scale, with successive thresholds found at increasing levels of the construct being measured. One of the most common sources of item misfit concerns respondents' inconsistent use of these response options. This results in what is known as disordered thresholds and usually, although not always, collapsing of categories where disordered thresholds occur improves overall fit to the model [44].
Item fit
In the current analysis, individual item fit statistic and individual person fit statistic are presented, both as residuals and as a chi square statistics. The individual item fit statistic is based on the standardised residuals (differences between the observed and expected responses divided by square root of variance and calculated for each patient for a given item). To obtain an overall statistic for an item, the standardised residuals are squared and summed over the patients. The individual item fit statistic is calculated by transforming this overall statistic to make it more nearly approximate a standard normal deviate under the hypothesis that the data fit the model. Thus, it is concluded that the deviations between the responses and the model are no more than random errors. Residuals between ± 2.5 are deemed to indicate adequate fit to the model. A person fit statistic is constructed for each person in a way similar to that of each item. A chi-square test is also available for each item. The chi-square statistics compares the difference in observed values with expected values across groups representing different ability levels (called class intervals) across the trait to be measured. Consequently, for a given item, several chi-squares are computed (the number of groups depend on sample size), and then these chi-square values are summed to give the overall chi-square for the item, with degrees of freedom being the number of groups minus 1. If the p value calculated from the overall chi-square is less than 0.05 (or Bonferroni-adjusted value) then the item is deemed to misfit to the model [45].
In addition to these individual fit statistics explained above, overall item fit statistics, overall person fit statistics and item-trait interaction statistics are presented. If the data accord to the model expectation, the mean of the overall item and the overall person fit statistics should be close to 0 and their standard deviation close to 1. A third summary fit statistics is an item-trait interaction statistics reported as a Chi-Square, reflecting the property of invariance across the trait. This statistic sums the chi-squares for individual items across all items. A significant chi-square indicates that the hierarchical ordering of the items varies across the trait, compromising the required property of invariance. A wide variety of texts are available to help the reader understand fit and the other relevant topics discussed in this article [35, 45–48].
Differential item functioning
DIF, or item bias, can also affect fit to the model. This occurs when different groups within the sample (e.g., younger and older persons) respond in a different manner to an individual item, despite having equal levels of the underlying characteristic being measured. Therefore, this does not preclude a different score between younger and older persons, but rather indicates that, given the same level of, for example, pain, the expected score on any item should be the same, irrespective of age. Two types of DIF may be identified. One is where the group shows a consistent systematic difference in their responses to an item, across the whole range of the attribute being measured, which is referred to as uniform DIF [20]. When there is non-uniformity in the differences between the groups (e.g., differences vary across levels of the attribute), then this is referred to as non-uniform DIF. The analysis of DIF has been widely used to examine cross-cultural validity, and readers can find an explanation of the approach, including the analysis of variance-based statistical analysis used in RUMM2020 software [43], in several recent reports [21, 49, 50]. In the current analysis, DIF was tested by age and gender.
Thus items to be entered into the item bank are required to satisfy Rasch model expectations, be free of DIF, and meet strict unidimensionality and local independence assumptions. This applies to the 'item bank' in total.
Reliability
An estimate of the internal consistency reliability of the item bank was tested by Person Separation Index (PSI). This is equivalent to Cronbach's alpha [51] but has the linear transformation from the Rasch model substituted for the ordinal raw score [52].
Computerized adaptive testing (CAT)
Given the calibrated item bank, the next stage is to apply the CAT application. We have developed new CAT software, Smart CAT™ (v1.0) [53], following the logic of Thissen and Mislevy [22] during this study.
In CAT, when a test is administered to a patient by using a package program via the computer, the program estimates the patient's ability after each question, and then that ability estimate can be used in the selection of subsequent items. For each item, there is an item information function (centred on item difficulty in the dichotomous case), and the next item chosen is usually that which maximises this information. The items are calibrated by their difficulty levels from the item bank. Figure 1, which is adapted from Wainer et al. [54], shows the sequence of steps inherent in CAT administrations in our study. Initially, the question with the median difficulty level in the item bank is administered (Step 1) and the patient's ability level (θCAT) and its standard error (SE) is estimated (Step2). The maximum likelihood estimation method with the Newton-Raphson iteration technique is used for this estimate in the current study [55, 56]. Given this estimate, the next most appropriate item (which maximizes the information for the current θ estimate) is chosen (Step 3) and then presented to the patient and θCAT and its SE are re-estimated (Step 4). If the predefined stopping rule (the SE of 0.5 or less) is not satisfied, Step 5 involves repeating Steps 3–4 until the stopping rule is met. When the stopping rule is satisfied, another dimension is measured or the assessment is completed.
Simulated and real-CAT applications
CAT was applied in two ways: A simulated and a real CAT application. In the simulated CAT, responses for 10000 patients derived from the RUMMss simulation program [57] were taken to represent the responses the patient would have given, had the item been administered in the context of a CAT. These data were simulated to meet Rasch model expectations using the item difficulty estimates from the item bank. Patient's disability level was normally distributed with a mean of 0 and standard deviation of 2. It was assumed that the mode of administration (i.e. paper and pencil which gave estimates for the item bank or the CAT application) would not substantially have affected item responses when the CAT estimated the disability level (θS-CAT) and its SE for each patient. These estimations (θS-CAT) were compared with the disability levels (θS-PCM) generated by the simulation program using all the items based upon the original calibration using the PCM.
In the real-CAT application, 133 patients were asked to complete both a paper-and-pencil test of the full item bank, and the CAT version. Estimations from the real-CAT application (θR-CAT) were compared with the disability levels generated using the response to all items analyzed with a PCM (θR-PCM), with item difficulties anchored to the original calibration of 266 cases.
At the final stage, the estimates derived from the real CAT application (θR-CAT) were compared with those derived from all the original questionnaires, including subscale scores, in order to demonstrate a limited form of convergent validity [14].
To summarize the approach used in this study; questionnaires that had been adapted in the Turkish language were chosen to include the ICF components of disability and the ICF categories listed in the ICF core set for LBP. The dimensionality of the total item set was explored using EFA for categorical data and the psychometric properties of the resulting item set were then evaluated by the Rasch (PCM) model [38]. The calibrations of the items which satisfied the model expectations then formed the item bank which was subsequently included in the CAT process. The CAT process involved both simulated and real (i.e. patient completed) responses. A comparison was made between the simulated CAT (θS-CAT) and the original estimate provided by the simulation programme (θS-PCM). A further comparison was made between the disability levels estimated from the item bank (θR-PCM) and those generated using real (observed) CAT (θR-CAT). And for the last stage, a form of convergent validity between the real CAT derived estimates and the scores from the original questionnaires were also examined. The response burden of the CAT process in terms of the number of items was compared to the 'paper and pencil' approach.
Sample size and statistical software
For the Rasch analysis it is reported that a sample size of 266 patients will estimate item difficulty, with α of 0.05, to within ± 0.3 logits [58]. This sample size is also sufficient to test for DIF where, at α of 0.05 a difference of 0.3 within the residuals can be detected for any 2 groups with β of 0.20. Bonferroni corrections are applied to both fit and DIF statistics due to the number of tests undertaken [59]. A value of 0.05 is used throughout, and corrected for the number of tests. Convergent validity between the real CAT derived estimates and the scores from the original questionnaires, including the subscales, were tested by the Spearman's correlation coefficient (r). The Intraclass correlation coefficient [ICC (2,1)] [60] and the Bland-Altman method [61] were used for evaluating the agreement between PCM and CAT derived θ estimations.
Statistical analysis was undertaken with SPSS 11.5; exploratory factor analysis with the MPlus program [34]; Rasch analysis with the RUMM2020 package [43] and the simulation were undertaken with RUMMss [57]. The CAT application used Smart CAT™ (v1.0) [53].