An initial application of computerized adaptive testing (CAT) for measuring disability in patients with low back pain
© Elhan et al. 2008
Received: 12 May 2008
Accepted: 18 December 2008
Published: 18 December 2008
Skip to main content
© Elhan et al. 2008
Received: 12 May 2008
Accepted: 18 December 2008
Published: 18 December 2008
Recent approaches to outcome measurement involving Computerized Adaptive Testing (CAT) offer an approach for measuring disability in low back pain (LBP) in a way that can reduce the burden upon patient and professional. The aim of this study was to explore the potential of CAT in LBP for measuring disability as defined in the International Classification of Functioning, Disability and Health (ICF) which includes impairments, activity limitation, and participation restriction.
266 patients with low back pain answered questions from a range of widely used questionnaires. An exploratory factor analysis (EFA) was used to identify disability dimensions which were then subjected to Rasch analysis. Reliability was tested by internal consistency and person separation index (PSI). Discriminant validity of disability levels were evaluated by Spearman correlation coefficient (r), intraclass correlation coefficient [ICC(2,1)] and the Bland-Altman approach. A CAT was developed for each dimension, and the results checked against simulated and real applications from a further 133 patients.
Factor analytic techniques identified two dimensions named "body functions" and "activity-participation". After deletion of some items for failure to fit the Rasch model, the remaining items were mostly free of Differential Item Functioning (DIF) for age and gender. Reliability exceeded 0.90 for both dimensions. The disability levels generated using all items and those obtained from the real CAT application were highly correlated (i.e. > 0.97 for both dimensions). On average, 19 and 14 items were needed to estimate the precise disability levels using the initial CAT for the first and second dimension. However, a marginal increase in the standard error of the estimate across successive iterations substantially reduced the number of items required to make an estimate.
Using a combination approach of EFA and Rasch analysis this study has shown that it is possible to calibrate items onto a single metric in a way that can be used to provide the basis of a CAT application. Thus there is an opportunity to obtain a wide variety of information to evaluate the biopsychosocial model in its more complex forms, without necessarily increasing the burden of information collection for patients.
Low back pain (LBP) is a frequently reported musculoskeletal problem causing much disability . The economic burden of LBP on the society is great due to both its high prevalence and chronicity . The main goals in the management of LBP are to control pain, maintain and improve function and consequently prevent disability . Thus the assessment of disability is essential for both planning and monitoring therapeutic interventions. There are many questionnaires available to assess disability for outcome measurement in LBP [4, 5] and most recently, 'core sets' of items have been proposed based upon the International Classification of Functioning, Disability and Health (ICF) .
The ICF, developed by the World Health Organization (WHO), aims to provide a unified and standard language and framework for the description of health and health-related conditions . It describes a model which systematically classifies the health and health related domains into two components: 1) body functions and structures; 2) activities and participation. According to this model, functioning is an umbrella term encompassing all body functions, activities and participation; similarly disability is an umbrella term including both impairments and activity limitations or participation restriction. Impairments catalogue the problems in body structure (e.g. displacement of vertebral disks) or body functions (e.g. pain in back) such as a significant deviation or loss. Activity is defined as the execution of a task or action by an individual whereas participation is involvement in a life situation. Activity limitations are difficulties an individual may have in executing such activities. Participation restrictions are problems an individual may have experienced in involvement in life situations. The ICF also lists environmental factors that interact with functioning and disability as contextual factors. The unit of classification in ICF is called as 'category'. Within each component, there are various individual categories arranged in a stem/branch/leaf scheme. In order to capture the integration of various aspects of functioning, ICF uses a biopsychosocial approach including biological, individual and social perspectives . Impairments such as displacement of vertebral disks or pain in back can cause limitations in individual activities such as dressing, or walking and/or restriction in societal participation such as work or leisure. These domains may be further mediated by environmental factors such as terrain, or the provision of assistive devices.
Clinicians and other health professionals could be faced with using a substantive range of outcome measures if even part of the ICF model is to be routinely implemented. This potentially presents a considerable burden to patients, as well as a formidable administrative burden to hard pressed health care professionals. One solution for this problem is to make use of a relatively new approach to outcome measurement, built upon existing work, such that patient and professional burden can be reduced or, where necessary, information collected can be increased at no extra burden. The mechanism by which this solution can be obtained is to implement a Computerized Adaptive Testing (CAT) approach for measuring disability in LBP. CAT, an outcome measurement approach for comprehensive and precise assessment of patient-related outcomes, is being used with increasing frequency in the health care field [8–15]. The approach uses a computer to administer test items to patients. In doing so, using a previously calibrated set of items called an item bank, it selects the most informative items for each individual patient according to their level on the construct being measured . This avoids the administration of a large number of questionnaire items by selecting items close to the person's ability level, effectively constructing a "tailored test" for each individual. The CAT approach allows for the collection of precise outcome information that can simply be applied in both clinical and research settings [8, 16–18].
Thus the CAT approach depends on a calibrated set of item difficulties, the calibrations of which are derived from a particular Item Response Theory (IRT) model [17, 19–21]. This calibration and the associated item information derived are the most important elements in CAT applications [10, 22]. IRT models are statistical models that describe the probability of choosing each response on a questionnaire item as a function of the construct (latent trait) being measured [16, 23]. With IRT, item calibrations and person estimates are located on the same metric. As such, the items are inherently linked to the metric both in terms of ability of the person and the amount of information that an item provides at each point along the trait. This property supports an efficient selection of items during a CAT administration. Thus the combination of IRT and CAT creates considerable flexibility in administering tests in an adaptive approach for each patient .
Several recent studies have reported the use of CAT in lumbar spine disorders. In the earliest study, Hart et al. developed a CAT assessing lumbar functional status in terms of activities domain of the ICF . Similarly in another study CAT was applied to measure the self-care and mobility activities in an orthopaedic outpatient physical therapy setting . Most recently, Kopec et al. used a CAT program to measure 5 domains of health-related quality of life: daily activities, walking, handling objects, pain and feelings . To our knowledge, no study has yet reported the use of a CAT program assessing disability in LBP in a comprehensive manner as defined in the ICF.
Therefore the aim of this study was to explore the potential of CAT for measuring disability in patients with LBP based on the definition of disability in ICF which includes impairments, activity limitation, and participation restriction. In order to achieve this aim, item banks were developed from currently used questionnaires. The internal construct validity of each item bank was examined by testing the assumptions of unidimensionality, local independence and Differential Item Functioning (DIF) by age and gender, within the framework of the Rasch measurement model . CAT software was then developed to utilise the calibrated items from each item bank. Real and simulated CAT applications were applied and the correlation between the disability levels generated by CAT, and the responses to all items in the item bank, was determined. Finally convergent validity between the CAT derived estimates and the scores from each original questionnaire were examined.
Data was collected in the Department of Physical Medicine and Rehabilitation at the Medical Faculty of Ankara University, Turkey, from February 2007 to November 2007. A total of 399 outpatients with low back pain were included in the study. Patients with non-mechanical back pain resulting from inflammatory, infectious, malignant or visceral diseases were excluded. In the first stage of the study 266 patients answered all the questions in the total item set obtained from the selected questionnaires (given below). After development of the item banks, the second stage involved another group of 133 patients completing the item banks (items determined after Rasch analysis) under a CAT version and by 'paper and pencil'.
In all cases, questionnaires were either self-completed by literate patients, or where patients were illiterate, the questionnaires were administered by one of the authors (DÖ). At the CAT stage, the same author also helped the patients who were unfamiliar with computer use. All patients gave informed consent to take part in the study and the study was carried out in compliance with Helsinki Declaration.
Initially, contents of both generic and specific questionnaires commonly used for outcome measurement in LBP were reviewed. The candidate item sets to be used as an item bank in CAT was designed to be applicable to patients with a spectrum of LBP problems and to represent the ICF components of disability  and the ICF core set for LBP . Another requirement was the existence of a validated Turkish version of the outcome measure to be selected. After considering these requirements, 4 questionnaires were selected: The Oswestry Disability Index (ODI), the Roland Morris Low Back Pain Disability Questionnaire (RDQ), the World Health Organization Disability Assessment Schedule (WHODAS II), and the Nottingham Health Profile (NHP).
The WHODAS II was developed by the World Health Organisation to assess functioning and disability . Based on the ICF model, it is a 36-item, generic, multidimensional questionnaire which is used for measuring the levels of disability in terms of activities and participation. It includes six domains: understanding and communicating (6 items), getting around (5 items), self care (4 items), getting along with others (5 items), household and work activities (8 items), and participation in society (8 items). It has a 5-point rating scale on all items in which "1" indicates no difficulty and "5" indicates extreme difficulty or inability to perform the activity. Raw scores are transformed into standardized scores. The total score and subscale scores range between 0–100, with higher scores reflecting greater disability. A previously adapted Turkish version of the WHODAS II instrument was used .
The Oswestry Disability Index (ODI) is a self-completed questionnaire designed for assessing the degree of functional limitation and pain in patients with LBP . It includes 10 items (pain intensity, personal care, lifting, walking, sitting, standing, sleeping, sex life, social life, and travelling), each of which has 6 ordinal responses. The scale has a total score ranging between 0 and 100 with a high score showing higher disability. The Turkish adaptation was used in this study .
The Roland & Morris Disability Questionnaire (RDQ) is a self-completed questionnaire designed to assess physical disability due to LBP . It includes 24 items, each with a dichotomous response category of yes or no. The scale has a total score ranging between 0 and 24 with a high score showing higher disability. The Turkish version of the RMDQ was used in this study .
The Nottingham Health Profile (NHP) is a generic health status measure developed to record the perceived distress of patients in physical, emotional and social domains . It comprises 38 statements (answered 'yes' or 'no') in six sections: physical mobility (8 items), pain (8 items), sleep (5 items), emotional reactions (9 items), social isolation (5 items) and energy level (3 items). The Turkish version of NHP was used . In this version the score on each section of the NHP is the percentage of items affirmed by the respondent (that is, the number of 'yes' responses multiplied by 100 and divided by the number of items in that section). Possible scores could range from 0 to 100, with a higher score indicating greater distress.
As seen above, response options and corresponding scores of items across the scales were different. While the items of WHODAS II and ODI were polytomous, those of RDQ and NHP were dichotomous.
The contents of these questionnaires were examined by the investigators regarding their links to the categories of ICF components [6, 33] and also the ICF LBP core set . This examination revealed that some of the items had links with categories covered in both "body functions" and "activities and participation." Another issue at this stage was that some ICF core set categories from the body functions component (mobility and stability of joint functions, muscle power and muscle tone), and one category from the activities and participation component (toileting), were not covered in the contents of the questionnaires. However, as uncovered body function categories require a physical examination, it was impossible to include them in a self-report questionnaire. Regarding the toileting activity, which was the only "activities and participation" category missing, the investigators decided that it was not an essential deficit as most of the components of toileting activity such as sitting, rising from sitting position and dressing were already covered in other items. Furthermore none of the other questionnaires used in LBP were assessing toileting activity. Thus the four chosen scales gave 108 items as candidate items for the item bank.
The 108 items were submitted to an exploratory factor analysis (EFA) for categorical data using weighted least square methods  to investigate the dimensionality of the item set. Model fit was evaluated using the root-mean-square error of approximation (RMSEA) that accounts for model parsimony. RMSEA values < 0.08 suggest adequate fit; values < 0.05 indicate good fit .
When more than one dimension was found according to the results of EFA, separate item sets were constructed and named. Items, whose factor loadings below 0.40, were eliminated from the item set(s) . After the determination of the dimensions of the total item set by EFA, the next step was to calibrate these items onto their appropriate dimensions using an IRT model.
where P nik is the probability of person n affirming category k in item i, compared with an adjacent category (k-1); θ n is person ability, b ik is the difficulty of the k th threshold which is the probabilistic midpoint (i.e., 50/50) between any 2 adjacent categories in item i.
The resulting Rasch analysis, as with all versions of the Rasch model, is mostly concerned with testing the underlying assumptions of the model; that of the probabilistic relationship between items, unidimensionality and local independence . In addition, item bias or differential item functioning can be examined.
The PCM is a unidimensional measurement model, therefore the assumption is that the items summed together form a unidimensional scale. There are various ways to test this assumption, and these can be thought of as a series of indicators to support the assumption. Rasch programs usually provide a principal component analysis of the residuals. The absence of any meaningful pattern in the residuals will also be deemed to support the assumption of unidimensionality. A test for unidimensionality, proposed by Smith EV , takes the patterning of items in the residuals, examining the correlation between items and the first residual factor, and uses these patterns to define two subsets of items (i.e., the positively and negatively correlated items). These two sets of items are then used to make separate person estimates, and, using an independent t-test for the difference in these estimates for each person, the percentage of such tests outside the range -1.96 to 1.96 should not exceed 5%. A confidence interval for a binomial test of proportions is calculated for the proportion of observed number of significant tests, and the lower bound should overlap the 5% expected value for the scale to be unidimensional. Given that the differences in estimates derived from the two subsets of items are normally distributed, this approach is robust enough to detect multidimensionality  and appears to give a test of strict unidimensionality, as opposed to essential unidimensionality . In the latter case a dominant factor occurs, and although other factors exist, they are not deemed to compromise measurement.
The assumption of local independence implies that when the 'Rasch factor' has been extracted, that is, the main scale, there should be no leftover patterns in the residuals. This assumption was tested by performing a PCA analysis of the residuals obtained from PCM. If a pair of items had a residual correlation of 0.30 or more, one of the items that showed a higher accumulated residual correlation with the remaining items was eliminated .
Before evaluation of item fit, where polytomous items are involved, the response categories should be examined for correct ordering. This involves the examination of the threshold pattern, the threshold being the transition point between adjacent categories. This ordering of thresholds is graphically demonstrated in the category probability curves by using the RUMM2020 software . For an item with an appropriate ordering of thresholds each response option would demonstrate the highest probability of endorsement at a specific range of the scale, with successive thresholds found at increasing levels of the construct being measured. One of the most common sources of item misfit concerns respondents' inconsistent use of these response options. This results in what is known as disordered thresholds and usually, although not always, collapsing of categories where disordered thresholds occur improves overall fit to the model .
In the current analysis, individual item fit statistic and individual person fit statistic are presented, both as residuals and as a chi square statistics. The individual item fit statistic is based on the standardised residuals (differences between the observed and expected responses divided by square root of variance and calculated for each patient for a given item). To obtain an overall statistic for an item, the standardised residuals are squared and summed over the patients. The individual item fit statistic is calculated by transforming this overall statistic to make it more nearly approximate a standard normal deviate under the hypothesis that the data fit the model. Thus, it is concluded that the deviations between the responses and the model are no more than random errors. Residuals between ± 2.5 are deemed to indicate adequate fit to the model. A person fit statistic is constructed for each person in a way similar to that of each item. A chi-square test is also available for each item. The chi-square statistics compares the difference in observed values with expected values across groups representing different ability levels (called class intervals) across the trait to be measured. Consequently, for a given item, several chi-squares are computed (the number of groups depend on sample size), and then these chi-square values are summed to give the overall chi-square for the item, with degrees of freedom being the number of groups minus 1. If the p value calculated from the overall chi-square is less than 0.05 (or Bonferroni-adjusted value) then the item is deemed to misfit to the model .
In addition to these individual fit statistics explained above, overall item fit statistics, overall person fit statistics and item-trait interaction statistics are presented. If the data accord to the model expectation, the mean of the overall item and the overall person fit statistics should be close to 0 and their standard deviation close to 1. A third summary fit statistics is an item-trait interaction statistics reported as a Chi-Square, reflecting the property of invariance across the trait. This statistic sums the chi-squares for individual items across all items. A significant chi-square indicates that the hierarchical ordering of the items varies across the trait, compromising the required property of invariance. A wide variety of texts are available to help the reader understand fit and the other relevant topics discussed in this article [35, 45–48].
DIF, or item bias, can also affect fit to the model. This occurs when different groups within the sample (e.g., younger and older persons) respond in a different manner to an individual item, despite having equal levels of the underlying characteristic being measured. Therefore, this does not preclude a different score between younger and older persons, but rather indicates that, given the same level of, for example, pain, the expected score on any item should be the same, irrespective of age. Two types of DIF may be identified. One is where the group shows a consistent systematic difference in their responses to an item, across the whole range of the attribute being measured, which is referred to as uniform DIF . When there is non-uniformity in the differences between the groups (e.g., differences vary across levels of the attribute), then this is referred to as non-uniform DIF. The analysis of DIF has been widely used to examine cross-cultural validity, and readers can find an explanation of the approach, including the analysis of variance-based statistical analysis used in RUMM2020 software , in several recent reports [21, 49, 50]. In the current analysis, DIF was tested by age and gender.
Thus items to be entered into the item bank are required to satisfy Rasch model expectations, be free of DIF, and meet strict unidimensionality and local independence assumptions. This applies to the 'item bank' in total.
An estimate of the internal consistency reliability of the item bank was tested by Person Separation Index (PSI). This is equivalent to Cronbach's alpha  but has the linear transformation from the Rasch model substituted for the ordinal raw score .
Given the calibrated item bank, the next stage is to apply the CAT application. We have developed new CAT software, SmartCAT™ (v1.0) , following the logic of Thissen and Mislevy  during this study.
CAT was applied in two ways: A simulated and a real CAT application. In the simulated CAT, responses for 10000 patients derived from the RUMMss simulation program  were taken to represent the responses the patient would have given, had the item been administered in the context of a CAT. These data were simulated to meet Rasch model expectations using the item difficulty estimates from the item bank. Patient's disability level was normally distributed with a mean of 0 and standard deviation of 2. It was assumed that the mode of administration (i.e. paper and pencil which gave estimates for the item bank or the CAT application) would not substantially have affected item responses when the CAT estimated the disability level (θS-CAT) and its SE for each patient. These estimations (θS-CAT) were compared with the disability levels (θS-PCM) generated by the simulation program using all the items based upon the original calibration using the PCM.
In the real-CAT application, 133 patients were asked to complete both a paper-and-pencil test of the full item bank, and the CAT version. Estimations from the real-CAT application (θR-CAT) were compared with the disability levels generated using the response to all items analyzed with a PCM (θR-PCM), with item difficulties anchored to the original calibration of 266 cases.
At the final stage, the estimates derived from the real CAT application (θR-CAT) were compared with those derived from all the original questionnaires, including subscale scores, in order to demonstrate a limited form of convergent validity .
To summarize the approach used in this study; questionnaires that had been adapted in the Turkish language were chosen to include the ICF components of disability and the ICF categories listed in the ICF core set for LBP. The dimensionality of the total item set was explored using EFA for categorical data and the psychometric properties of the resulting item set were then evaluated by the Rasch (PCM) model . The calibrations of the items which satisfied the model expectations then formed the item bank which was subsequently included in the CAT process. The CAT process involved both simulated and real (i.e. patient completed) responses. A comparison was made between the simulated CAT (θS-CAT) and the original estimate provided by the simulation programme (θS-PCM). A further comparison was made between the disability levels estimated from the item bank (θR-PCM) and those generated using real (observed) CAT (θR-CAT). And for the last stage, a form of convergent validity between the real CAT derived estimates and the scores from the original questionnaires were also examined. The response burden of the CAT process in terms of the number of items was compared to the 'paper and pencil' approach.
For the Rasch analysis it is reported that a sample size of 266 patients will estimate item difficulty, with α of 0.05, to within ± 0.3 logits . This sample size is also sufficient to test for DIF where, at α of 0.05 a difference of 0.3 within the residuals can be detected for any 2 groups with β of 0.20. Bonferroni corrections are applied to both fit and DIF statistics due to the number of tests undertaken . A value of 0.05 is used throughout, and corrected for the number of tests. Convergent validity between the real CAT derived estimates and the scores from the original questionnaires, including the subscales, were tested by the Spearman's correlation coefficient (r). The Intraclass correlation coefficient [ICC (2,1)]  and the Bland-Altman method  were used for evaluating the agreement between PCM and CAT derived θ estimations.
Statistical analysis was undertaken with SPSS 11.5; exploratory factor analysis with the MPlus program ; Rasch analysis with the RUMM2020 package  and the simulation were undertaken with RUMMss . The CAT application used SmartCAT™ (v1.0) .
A total of 266 patients with low back pain answered 108 items from the four original questionnaires. The mean age of the patients was 52.2 years (standard deviation (SD) 12.5), 16% were men, and patients had a mean complaint time of 8.24 years (minimum: 1 month; maximum: 40 years). Prior to detailed analysis, it was observed that few patients worked (13%) and only half of the group had an active sexual life (50%). Thus a total of 6 work and sexual life related items (5 from the WHODAS II and 1 from the ODI) were removed from the item set.
An Exploratory Factor Analysis (EFA) was conducted with the remaining 102 items. Due to highly negative correlations (< -0.99) with other items, three items were removed from the analysis and a new EFA was conducted with 99 items. This analysis produced a two-factor solution. When the items were examined regarding their links with the ICF categories, it was seen that items in the first dimension were related to pain, sleep, cognitive and emotional aspects of health, therefore this dimension was named as "body functions". The second dimension included items concerned with activities and participation (e.g., mobility, self-care activities, domestic life, social life), and was therefore named as "activity-participation". The factor loadings varied from 0.425 to 0.883 for the body functions and 0.413 to 0.935 for activity-participation. At this stage, none of the items loaded on both dimensions with a factor loading of 0.40 or above, but five items failed to load on either dimension, and so were removed from the item set. The RMSEA value for the two-factor solution was 0.087. Although this RMSEA value is a little high, it was concluded that the 40-item "body functions" set and the 54-item "activity-participation" set represented good starting points to create a unidimensional item bank for each construct.
Fit of "Body Functions" item bank to Rasch model (after rescoring) (n = 266)
Individual Item Fit Residual
Chi-Square Test Statistics
WHODAS II – 1.1. In the last 30 days, how much difficulty did you have in concentrating on doing something for ten minutes?
WHODAS II – 1.3. In the last 30 days, how much difficulty did you have in analyzing and finding solutions to problems in day to day life?
WHODAS II – 1.5. In the last 30 days, how much difficulty did you have in generally understanding what people say?
WHODAS II – 1.6. In the last 30 days, how much difficulty did you have in starting and maintaining a conversation?
WHODAS II – 4.1. In the last 30 days, how much difficulty did you have in dealing with people you do not know?
WHODAS II – 4.2. In the last 30 days, how much difficulty did you have in maintaining a friendship?
WHODAS II – 4.3. In the last 30 days, how much difficulty did you have in getting along with people who are close to you?
WHODAS II – 4.4. In the last 30 days, how much difficulty did you have in making new friends?
WHODAS II – 6.3. In the last 30 days, how much of a problem did you have living with dignity because of the attitudes and actions of others?
WHODAS II – 6.5. In the last 30 days, how much have you been emotionally affected by your health condition?
RDQ 13. My back is painful almost all of the time
RDQ 18. I sleep less well because of my back
RDQ 22. Because of back pain, I am more irritable and bad tempered with people than usual
NHP 1. I'm tired all the time
NHP 2. I have pain at night
NHP 4. I have unbearable pain
NHP 6. I've forgotten what it's like to enjoy myself
NHP 7. I'm feeling on edge
NHP 9. I feel lonely
NHP 13. I'm waking up in the early hours of the morning
NHP 15. I'm finding it hard to make contact with people
NHP 16. The days seem to drag
NHP 20. I lose my temper easily these days
NHP 21. I feel there is nobody that I am close to
NHP 22. I lie awake for most of the night
NHP 23. I feel as if I'm losing control
NHP 28. I'm in constant pain
NHP 29. It takes me a long time to get to sleep
NHP 30. I feel I am a burden to people
NHP 31. Worry is keeping me awake at night
NHP 32. I feel that life is not worth living
NHP 34. I'm finding it hard to get along with people
NHP 37. I wake up feeling depressed
Finally, using the PCA of residuals obtained from PCM, taking the highest positively and negatively correlated items to the first residual factor to make two subsets, no significant difference in person estimates (t = 6.8%; CI 4.2%–9.4%) was found between the two subsets, thus supporting the unidimensionality of the item bank. When the assumption of local independence was examined, there was no pair of items which had a residual correlation of 0.30 or more.
Starting with 54 items, many polytomous items displayed disordered thresholds, necessitating collapsing of categories. Following this, items "ODI 2, ODI 3 and ODI 5" did not fit the model (given a Bonferroni adjustment fit level of 0.001) and were removed. Overall mean item fit residual was -0.239 (SD 1.411) and mean person fit residual was -0.412 (SD 0.959). Item-trait interaction was non-significant, suggesting the invariance of items (chi-square 204.46 (df = 153), p = 0.0035). The PSI was good (0.94) indicating the ability of the scale to differentiate more than 4 groups of patients .
Fit of "activity-participation" item bank to Rasch model (after rescoring) (n = 266)
Individual Item Fit Residual
Chi-Square Test Statistics
WHODAS II – 2.1. In the last 30 days, how much difficulty did you have in standing for long periods such as 30 minutes?
WHODAS II – 2.2. In the last 30 days, how much difficulty did you have in standing up from sitting down?
WHODAS II – 2.3. In the last 30 days, how much difficulty did you have in moving around inside your home?
WHODAS II – 2.4. In the last 30 days, how much difficulty did you have in getting out of your home?
WHODAS II – 2.5. In the last 30 days, how much difficulty did you have in walking a long distance such as a kilometer (or equivalent)?
WHODAS II – 3.1. In the last 30 days, how much difficulty did you have in washing your whole body?
WHODAS II – 3.2. In the last 30 days, how much difficulty did you have in getting dressed?
WHODAS II – 3.3. In the last 30 days, how much difficulty did you have in eating?
WHODAS II – 3.4. In the last 30 days, how much difficulty did you have in staying by yourself for a few days?
WHODAS II – 5.3. In the last 30 days, how much difficulty did you have in doing most important households tasks well?
WHODAS II – 5.4. In the last 30 days, how much difficulty did you have in getting all the household work done that you needed to do?
WHODAS II – 5.5. In the last 30 days, how much difficulty did you have in getting your household work done as quickly as needed?
WHODAS II – 6.1. In the last 30 days, how much of a problem did you have in joining in community activities (for example, festivities, religious or other activities) in the same way as anyone else can
WHODAS II – 6.2. In the last 30 days, how much of a problem did you have because of barriers or hindrances in the world around you?
WHODAS II – 6.8. In the last 30 days, how much of a problem did you have in doing things by yourself for relaxation or pleasure?
ODI 4. Walking
ODI 6. Standing
ODI 9. Social Life
ODI 10. Travelling
RDQ 1. I stay at home most of the time because of my back
RDQ 2. I change position frequently to try to get my back comfortable
RDQ 3. I walk more slowly than usual because of my back
RDQ 4. Because of my back, I am not doing any jobs that I usually do around the house
RDQ 5. Because of my back, I use handrail to get upstairs
RDQ 6. Because of my back, I lie down to rest more often
RDQ 7. Because of my back, I have to hold on to something to get out of an easy chair
RDQ 8. Because of my back, I try to get other people to do things for me
RDQ 9. I get dressed more slowly than usual because of my back
RDQ 10. I only stand up for short periods of time because of my back
RDQ 11. Because of my back, I try not to bend or kneel down
RDQ 12. I find it difficult to get out of chair because of my back
RDQ 14. I find it difficult to turn over in bed because of my back
RDQ 16. I have trouble putting on my sock (or stockings) because of the pain in my back
RDQ 17. I can only walk short distances because of my back pain
RDQ 19. Because of my back pain, I get dressed with the help of someone else
RDQ 21. I avoid heavy jobs around the house because of my back
RDQ 23. Because of my back, I go upstairs more slowly than usual
RDQ 24. I stay in bed most of the time because of my back
NHP 8. I find it painful to change position
NHP 11. I find it hard to bend
NHP 12. Everything is an effort
NHP 18. I find it hard to reach for things
NHP 19. I'm in pain when I walk
NHP 24. I'm in pain when I'm standing
NHP 25. I find it hard to get dressed by myself
NHP 26. I soon run out of energy
NHP 27. I find it hard to stand for long (e.g., at the kitchen sink, waiting in a line)
NHP 36. I'm in pain when going up or down stairs
NHP 38. I'm in pain when I'm sitting
The unidimensionality of the item bank was supported by the individual t-test showing 7.5% of tests as significant (CI 4.9%–10.2%). When the assumption of local independence was examined, there was no pair of items having residual correlation of 0.30 or more.
Internal consistencies of the item banks were adequate at the dimension level with Cronbach's alphas of 0.91 and 0.93 and the PSI values of 0.91 and 0.94 for the first and second item banks, respectively.
For the simulated CAT application, 95% ranges of agreement between θS-CAT and θS-PCM according to Bland-Altman approach were -0.695 to 1.174 for the body functions and -1.038 to 1.213 for activity-participation dimensions. Furthermore, 8566 of the 9056 and 9456 of the 9916 converged estimates were also within the 95% limits of agreement for the first and second dimensions, respectively. The θS-PCM and θS-CAT correlated well (for the first dimension r = 0.96 and ICC = 0.95 and for the second dimension 0.97 and 0.96, respectively). The initial CAT setting used a median of 19 items for body function, and 15 items for activity-participation dimensions.
A total of 133 patients with low back pain completed 108 items from the four original questionnaires and the CAT version. The mean age of these patients was 53.0 years (standard deviation (SD) 13.9), 19.5% were men, and patients had a mean complaint time of 7.0 years (minimum: 1 month; maximum: 30 years).
For the real initial CAT application, 95% ranges of agreement according to Bland-Altman approach were -0.487 to 0659 and -0.734 to 0.776 for the body functions and activity-participation dimensions, respectively. A total of 126 of the 133 patients were within the 95% limits of agreement for body functions, and 126 of the 133 patients were within the 95% limits of agreement for the activity-participation dimension. The ICC (2,1) values were 0.98 and 0.97, respectively. The CAT used median of 19 and 14 items to estimate θ for the body functions and activity-participation, respectively. θR-PCM and θR-CAT correlated well for the body functions and activity-participation dimensions (r = 0.98 and r = 0.97, respectively).
As would be expected, respondent burden was substantially greater for those who completed all items in the scales, in comparison with those for whom scores were estimated using CAT. CAT assessments initially reduced the number of items administered to 19 and 14 per patient for the first and second item banks. This reduction in number of items administered translated into estimated reductions in response times from an average of 15 to 6 minutes.
The initial CAT application included a standard error of 0.50 or less as a stopping rule. We increased the standard error to 0.55 and 0.60 to test if this further reduced the burden. As a result, the average number of items administered fell to 15 and 12 for the body functions dimension, and to 12 and 10 for the activity-participation dimension respectively, for these increased standard errors.
Convergent validity between θR-CAT and subscale and total scores of the four original questionnaires.
θR-CAT Body Functions
WHODAS II – Understanding and communicating
WHODAS II – Getting around
WHODAS II – Self care
WHODAS II – Getting along with people
WHODAS II – Getting along with people (without sexual activities item)
WHODAS II – Life activities (without work items)
WHODAS II – Participation in society
WHODAS II – Total (without work and sexual items)
NHP Social Isolation
NHP Physical Mobility
This study is the first to explore the potential for applying CAT in the assessment of ICF related disability for outcome measurement in LBP. Using a combination approach of EFA and Rasch analysis, based upon the disability definition in the ICF, together with new developments in CAT software, we have been able to show that items can be calibrated onto a single metric and that they can be used to provide the basis of a CAT application which map on to the ICF. In this way, a simple, precise estimate of the person's ability can be determined and, given the use of the Rasch model, one that is interval scaled. Furthermore, the combination of items from different questionnaires makes a wider 'ruler' of ability than any single scale, reducing the risk of floor and ceiling effects, and providing continuity of measurement across the acute-community divide.
The development and implementation of such an approach has raised, and continues to raise several developmental and application challenges. At the conceptual level for example, not all items within the ICF core set are accommodated within our item banks . Consequently, further expansion to make these item banks inclusive, at least of the brief core sets, would be advantageous. However, there is no guarantee that additional items would satisfy strict unidimensional requirements as there is no empirical evidence to support the dimensionality of the published core sets. It is also true to say that the way in which tasks are operationalised in some scales can reflect both cognitive and physical components, and can potentially straddle both body functions and activities within the ICF categorisation. The task of developing a measurement system to map onto the ICF is thus an ongoing challenge, and the current study offers one potential way of providing measurement that facilitates an ICF based CAT approach. The grouping of items into body functions and activity-participation is based upon rigorous tests of unidimensionality but, for example, the latter does not attempt to separate activities from participation. Indeed there is still considerable debate about the distinction between activities and participation as defined by the ICF. A recent paper has suggested that these need further differentiation into 'acts', 'tasks' and 'societal involvement' .
We have adopted rigorous tests of unidimensionality as there is evidence that even small deviations from this can lead to substantive and significant differences in person estimates . CAT would be particularly vulnerable to this influence as only a relatively small set of items are administered. Even then, we need to gather more data to undertake a confirmatory factor analysis on the final sets of items to have greater confidence in the unidimensionality of the item banks. An EFA approach was used because traditional factor analysis may overestimate the number of factors and underestimate the factor loadings when analyzing skewed categorical data . Nevertheless, our indicator of unidimensionality (RMSEA) for the item banks was higher than we would have wanted, and suggests some fragility in the dimensionality of the structure.
In Turkey there is an educational and income gradient by age, including illiteracy and a lack of computer experience . Consequently most of the patients required help with the CAT application. The computer set up was traditional, including a mouse, and touch screen technology may have improved independence for some, and is an obvious next step. The illiteracy problem is likely to remain for another 20 years or so, and so this is a particular challenge to CAT application in Turkey and other countries where there are similar problems, whereas possibly not so much in northern European countries or the USA. Nevertheless, despite these problems, internet based CAT applications, where patients can log in, should offer further opportunities for the community-based follow-up.
There are further technical issues which require further thought and development. From the simulated data it was not possible to obtain an estimate of the persons' body functions or activity-participation dimensions in all cases. The CAT application failed to converge in 9.4% and 0.8% of cases for the first and second dimensions, respectively. This is a known problem with the Newton Raphson algorithm which was used in the current study, but the next version of SmartCAT™ will include the modified maximum likelihood estimation procedure which should eliminate this problem [64, 65]. This will leave only the estimate of extreme persons (i.e. at the floor or ceiling of the entire item bank) where additional information will be required to obtain a person estimate. Currently this was obtained from the RUMM2020 programme as the person estimate for extremes in the item bank calibration . The actual number of extreme cases was low with none in the body functions and 0.01% in the activity-participation dimensions of the persons in the real CAT application. Furthermore, only 1 of 133 real CAT applications failed to converge.
The number of cases used in the current study is lower than the average by CAT standards. Previous published work on CAT has been based on sample sizes ranging from less than one hundred to several thousand cases [9, 11, 14, 18, 66]. Some of this variability may be due to the use of different IRT models as the basis of this work. Generally the Rasch model is far less demanding in terms of sample size than other IRT models , although it is much more demanding in terms of quality of data as it requires the scales to satisfy conjoint theory axioms . The key issue is the degree of precision required of the person estimate, and this raises further interesting issues as to whether this might vary across different diagnoses and situations, for example, where estimates might be used as the basis of clinical management decisions (e.g. to start a particular treatment).
It is known that each pair of adjacent categories in the polytomous item serves as a single dichotomous item so, the polytomous item bank makes more contribution to the test information function than the dichotomous item bank. Also, the information is typically distributed across a wider range of the trait being measured when polytomous items contribute to item banks. For this reason, even when there is a relatively small item bank with polytomous items, CAT works well . Since, our item banks were relatively small and most of the items in item banks were dichotomous, the number of items used to estimate the thetas with SE < 0.5 was higher in our CAT application than other CATs [8–10, 12, 13, 69]. However, Haley et al.  achieved the same SE of 0.5 with 20-item CAT application and another study  also concluded that a 20-item adapted test was successful in achieving accurate estimates of physical functioning scores and age-based centiles. These findings were similar to the present study in terms of number of items administered and precision of the estimated theta.
Using a combination approach of EFA and Rasch analysis this study has shown that it is possible to calibrate items onto a single metric in a way that can be used to provide the basis of a CAT application. Recent applications of CAT in other medical outcomes suggest that many others are working on these issues at the present time, and we could expect to see a rapid growth in the scientific basis and the ease of application during the coming years . All these developments mean that at the present time, there is the opportunity to obtain a wide variety of information to evaluate the biopsychosocial model in its more complex forms, without increasing the burden of information collection for patients. Else, it will be possible to minimize the burden of data collection further compared with existing data collection protocols. Both scenarios will be based upon scientifically rigorous measurement which offers greater breadth of measurement than the traditional single scale approach.
This study was supported by a grant from the Ankara University, Scientific Research Unit with a project number of 2006/0809241.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.