Patients and setting
Data was collected in the Department of Physical Medicine and Rehabilitation at the Medical Faculty of Ankara University, Turkey. A total of 100 outpatients diagnosed as knee and/or hip OA according to the American College of Rheumatology criteria for the classification and reporting OA of knee and hip were included in the study [9, 10]. Patients with concomitant uncontrolled or severe systemic diseases, any recent surgery that might affect their health status, and any cognitive impairment that would preclude participation in the study were excluded. The study was approved by the Ethical Committee of the Faculty of Medicine, Ankara University. All patients gave informed consent and the study was carried out in compliance with Helsinki Declaration.
Assessment
The assessment included the administration of the ICF Comprehensive Core Set for OA, the Western Ontario and McMaster Universities Index of Osteoarthritis (WOMAC, V3.1) [11] and the Short Form-36 Health Survey v1.0 (SF-36®) [12]. The scoring of ICF Core Set for all patients was performed by the physical and rehabilitation medicine specialists who were trained in a structured one-day workshop organized by the researchers of the WHO ICF Collaborating Center at the Ludwig-Maximilian University in Munich. These specialists took part in the International Validation Studies of Core Sets and were experienced in the scoring system since they collected the data of many patients with various musculoskeletal conditions such as osteoarthritis, low back pain, rheumatoid arthritis and chronic widespread pain. The questionnaires WOMAC and SF-36 were either self-completed by patients or the assessors administered them to those who were illiterate. Sociodemographic (age, gender, educational level, employment status) and clinical data (disease duration, location, comorbidities) were also recorded.
The ICF Comprehensive Core Set for OA consists of 13 categories from the component Body functions (BF), 6 from the component Body Structures (BS), 19 from the component Activities and Participation (AP), and 17 from the component Environmental Factors (EF) [6]. A generic qualifier scale was used to evaluate the extent of a patient's problem in each of the ICF categories. The qualifier scale of the components BF, BS and AP have five response levels, ranging from 0 to 4: no/mild/moderate/severe/complete problem. The qualifier scale of the component EF has 9 response levels, ranging from -4 to +4. A specific environmental factor can be a barrier (-1 to -4), or a facilitator (1 to 4), or can have no influence (0) on the patient's life. If a factor has an influence, the extent of the influence (either positive or negative) can be coded as mild, moderate, severe, or complete. In addition, there are two other response options "8 (not specified)" and "9 (not applicable)" for all ICF categories.
The WOMAC is a disease-specific index developed for OA of the knee or hip [11]. It consists of 24 items in three subscales: pain (5 items), stiffness (2 items), and physical function (17 items). There are five response options for every question ('0' none, '1' mild, '2' moderate, '3' severe and '4' extreme) in Likert form. In this study, validated Turkish version of WOMAC [13] was used and the scores were presented as 0-10 for each WOMAC subscale after a normalization procedure [11, 14]. The summation of equally weighted three subscales provided a single value for WOMAC total score, thus being 0-30.
Health-related quality of life (HRQoL) was evaluated using the SF-36 questionnaire [15]. It contains 36 items that measure perceived health in 8 scales, namely, physical functioning (PF), role-physical (RP), bodily pain (BP), general health (GH), vitality (V), social functioning (SF), role-emotional (RE), and mental health (MH), with higher scores (range 0-100) reflecting better perceived health. Additionally, two summary scores can be obtained; the Physical Component Summary (PCS) score and the Mental Component Summary (MCS) score. The Turkish version of the SF-36 was used in the study [16].
Internal Construct Validity
The internal construct validity of the items of the ICF Core Set for OA, proposed as a scale for each ICF component, was tested by Rasch analysis. This is the formal testing of an assessment or a scale against a mathematical measurement model which defines how interval scale measurement can be derived from ordinal questionnaires [17–19]. This model assumes that the probability of a given respondent affirming an item is a logistic function of the relative distance between the item difficulty and the person ability on a linear scale. Thus, for example, in the case of mobility, the probability of a person affirming a (dichotomous) item about mobility is a logistic function of the relative distance between the level of mobility expressed by the item (the item difficulty), and the level of mobility of the person (the person ability). The model estimates person ability independent of the distribution of the population, and item difficulty independent of the person ability [20]. Master's partial credit model (PCM) which is an extension of the Rasch dichotomous model for polytomous (more than two response categories) items was used in this study [21].
The process of Rasch analysis is iterative, certain pathways are applied to each scale where an item set is intended to be summated to give a score. Initially, where polytomous items are involved, the response categories are examined for correct ordering. This is reflected by successive thresholds (point at which probability of being in adjacent thresholds is equal) demonstrating increasing levels of the construct being measured. The respondents' inconsistent use of response options can result in disordered thresholds and usually, in these circumstances, the collapsing of categories improves overall fit to the model [22].
Following this a range of tests are undertaken with respect to local dependency, probabilistic ordering (fit), unidimensionality and differential item functioning (DIF). The assumption of local independence implies that when the 'Rasch factor' has been extracted, that is, the main scale, there should be no leftover patterns in the residuals [23]. When a pair of items has a residual correlation of 0.20 or more than the average residual correlation, this is indicative of local response dependency between the items [24]. Such dependency inflates reliability as the items are, in practice, near replications of each other. This issue is dealt with by creating testlets - summary scores from the items that are locally dependent, which are then treated as one new larger variable [25]. Testlets were created considering the contents (what they assess) and response dependency of the items where mostly clinically relevant items were found to be locally dependent.
A variety of fit statistics are used to test if the data conform to Rasch model expectations. In the RUMM2030 programme [26], two are item-person interaction statistics transformed to approximate a z score, representing a standardized normal distribution. If the items and persons fit the model, these interaction statistics would have a mean of approximately zero and a standard deviation (SD) of one. A third summary statistic is a summed chi-square within groups defined by their position on the trait, where the overall chi-square for items is summed to give the item trait interaction statistic, testing the property of invariance across the trait. A significant chi-square indicates that the hierarchical ordering of the items varies across the trait, so compromising the required property of invariance. The significance of all chi-square fit statistics are Bonferroni adjusted to account for multiple testing [27]. In addition to these overall summary fit statistics, individual person- and item-fit statistics are presented, as (a) residuals (a summation of individual person and item deviations), (b) as a chi-square statistic, and (c) as an analysis of variance (ANOVA) with the residuals summed across the main effects of class intervals. Fit residuals between ± 2.5 are deemed to be adequate. These are summated within ability groups to provide the basis of the ANOVA analysis.
A formal test of the assumption of unidimensionality is undertaken by performing a principle component analysis (PCA) of the residuals. Items with the highest positive and negative correlations on the first residual factor are used to construct two smaller scales, anchored to the item difficulties of the main analysis [28]. The person estimates derived from these two subsets of items are then contrasted for each individual by a t test. A significant difference would be expected to occur by chance in 5% of the cases. Consequently, the percentage of tests outside the range ± 1.96 is reported, together with a 95% binomial confidence interval. This interval should overlap 5% for a non-significant finding to confirm unidimensionality.
Items are also tested for DIF. In the framework of Rasch measurement, the scale should be free of item bias or DIF [29]. DIF occurs when different groups within the sample (e.g., males and females), despite equal levels of the underlying characteristic being measured, respond in a different manner to an individual item. For example, men and women with equal levels of mobility may respond systematically differently to a mobility item such as walking 100 metres unaided. DIF can be detected both statistically and graphically. In the current analysis, DIF was tested by age, gender, years of education, and disease duration. The statistical test for DIF is an ANOVA, with main effects, for example for gender, and ability level. This examines the main effect for gender (uniform DIF) where any difference is constant across the trait. An interaction effect between ability level and the contextual factor under investigation (e.g. gender) identifies non-uniform DIF, where the difference between groups varies across the trait.
For item sets which constitute a potential new scale, all the above Rasch assumptions are considered together to determine which items are most suitable for retention. Poor items are removed, and the data refitted to the model until an adequate locally independent, unidimensional scale, free of DIF, is achieved. Finally the targeting and Person Separation Index (PSI) reliability of the scale are considered. A scale is perfectly targeted when the mean of the persons is the same as the mean of the items on their shared common metric. PSI is an estimate of internal consistency reliability and can be interpreted much the same as Cronbach's alpha, but has the linear transformation from the Rasch model substituted for the ordinal raw score [30].
Reliability
Reliabilities of ICF components or proposed scales were initially tested by internal consistency which is an estimate of the degree to which its constituent items are interrelated, and is assessed by Cronbach's alpha coefficient [31]. Usually a reliability of 0.70 is required for analysis at the group level, and values of 0.85 and higher for individual use [32]. Subsequently reliability was further tested by the PSI from the Rasch analysis. Where the distribution is normal these two reliability indicators are equivalent, but where distributions are skewed, the PSI gives a more accurate indication of internal consistency reliability.
External construct validity
External construct validity was determined by testing for expected associations of ICF components or proposed scales with WOMAC and SF-36 through the process of convergent construct validity [33]. In this study, the degree of associations was analyzed by Spearman's correlation coefficient.
Sample size and statistical software
For the Rasch analysis, a sample size of 100 patients will estimate item difficulty, with α of 0.05, to within ± 0.5 logits [34]. Bonferroni correction was applied to both fit and DIF statistics due to the multiple testing [27]. Statistical analysis was undertaken with SPSS 11.5, Rasch analysis with RUMM2030 package [26].