The Visual Analogue WOMAC 3.0 scale - internal validity and responsiveness of the VAS version

Background Many people suffer with Osteoarthritis (OA) and subsequent morbidity. Therefore, measuring outcome associated with OA is important. The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) has been a widely used patient reported outcome in OA. However, there is relatively little evidence to support the use of the Visual Analogue Scale (VAS) version of the scale. We aimed to explore the internal validity and responsiveness of this VAS version of the WOMAC. Methods Patients with chronic hip or knee pain of mechanical origin, waiting for a hip or knee joint replacement completed the WOMAC as part of a study to investigate the effects of acupuncture and placebo controls. Validity was tested using factor analysis and Rasch analysis, and responsiveness using standardised response means. Results Two hundred and twenty one patients (mean age 66.8, SD 8.29, 58% female) were recruited. Factor and Rasch analysis confirmed unidimensional Pain and Physical Functioning scales, capable of transformation to interval scaling and invariant over time. Some Differential Item Functioning (DIF) was observed, but this cancelled out at the test level. The Stiffness scale fitted the Rasch model but adjustments for DIF could not be made due to the shortness of the scale. Using the interval transformed data, Standardised Response Means were smaller than when using the raw, ordinal data. Conclusions The WOMAC Pain and Physical Functioning subscales satisfied unidimensionality and ordinal scaling tests, and the ability to transform to an interval scale. Some Differential Item Functioning was observed, but this cancelled out at the test level and, by doing so, at the same time removed the disturbance of unidimensionality. The scaling characteristics of sets of items which use VAS require further analysis, as it would appear that they can lead to spurious levels of responsiveness and scale compression because they exaggerate the distortion of the ordinal scale. Trial number UKCRN study ID: 4881 ISRCTN78434638


Background
The prevalence of Osteoarthritis (OA) has been reported to be as high as 8.5 million people in the UK [1] and many patients suffer a considerable amount of pain and functional limitation [2,3]. Therefore, the evaluation of patients' health status is important in supporting individual treatment decisions and assessing quality of care and treatment [4,5]. In recent years we have seen an ever increasing number of patient reported outcome measures (PROMs) to aid in this process, which are now routinely used to monitor health care provision in the UK [4]. One commonly used measure in osteoarthritis is the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) [6]. The scale has three subscales (Table 1), pain (5 items), stiffness (2 items) and physical functioning (17 items). Numerous studies have reported on its reliability and validity [7][8][9]. There have also been several studies, which have raised issues about the factorial validity of the subscales [5,[10][11][12][13]. Evidence from research using the Rasch measurement model [14] seems to be consistent in observing a lack of fit to the Rasch model, a need to reduce the item set to achieve fit, or lack of confirmation of the distinct subscale structure [15][16][17][18][19]. It is unusual in offering a Likert-style version and a Visual Analogue Scale (VAS) version. However, much of the validation work appears to have been undertaken on the Likert version of the WOMAC. One study compared the Likert and VAS versions, suggesting differential efficiency for subscales depending on which versions were used [20]. The study did not report on factorial validity. Consequently, there is little evidence to support the reliability and factorial validity of the VAS version of the scale. Yet VAS's are increasingly used within scales and as single items in clinical practice and research. The VAS tends to be analysed as an interval scale but there is no scientific evidence that this is a reasonable assumption. The little evidence that exists to support the psychometric proper-ties of the VAS scale suggests that they are ordinal, that people do not tend to use the full range of the scale, and that the actual design of the VAS can be different when measuring the same construct and thus could benefit from standardisation [21,22]. Further, if people do not use the full range of the VAS this might have implications for its responsiveness. Thus, whilst the WOMAC is a popular measure to assess impairment and activity limitation in patients with osteoarthritis, we lack evidence on the internal construct (factorial) validity of the VAS version [23]. In addition, further evidence on the extent to which the WOMAC VAS version can detect change over time (responsiveness) [24] is required. This paper examines the key con-cepts of internal validity and responsiveness of the WOMAC v3.0 VAS scale with factor and Rasch analysis.

Methods
WOMAC v3.0 data (VAS version) were collected as part of a prospective randomised controlled trial, which investigated the relative effects of acupuncture and different acupuncture placebo controls on osteoarthritis (OA) patients waiting for hip or knee replacement. OA was diagnosed by orthopaedic consultants both clinically and radiographically. Patients were included if they had chronic pain predominantly from a single joint (hip or knee) of mechanical origin, and scored a minimum of 30 on a 100 mm VAS scale for pain, and were not on active treatment (apart from their normal analgesia). Those with serious co-morbidity (such as cancer, rheumatoid arthritis, severe low back pain), pregnant, prolonged or current steroid use, or waiting for a joint revision were excluded. WOMAC data was collected at two time points, on entry into the study and at the end, six weeks later.

Data analysis
Given that the WOMAC is an established outcome measure with three subscales we conducted all analyses on each of the subscales. To avoid spurious precision, where the thickness of a mark upon a VAS may exceed one millimetre, or the interpretation of the exact location may vary by a millimetre, WOMAC data were divided by 2, thus reducing the range of each item to 0-50. For the purpose of this paper we will refer to these raw data as 'ordinal data'. Internal reliability of each of the subscales was examined with a Cronbach alpha, deemed acceptable for group use if >0.7 [25]. Also, each subscale was subjected to factor analysis where Monte-Carlo Parallel analysis was employed to determine significant eigenvalues [26]. Parallel analysis looks at the values of the eigenvalues as determined in a Monte-Carlo simulated random data set with the same sample size and number of items. It determines if the eigenvalue observed in the data is truly significant, given the generated random data. Default values in some statistical packages such as eigenvalues greater than one do not take this into account, and can generate spurious factors. Factor analysis and Cronbach Alpha's were carried out using SPSS15 [27].
Data were fitted to the Rasch measurement model to determine if the individual subscales satisfied the expectation of the Rasch model [14,28]. The RUMM2020 software was used for this purpose [29]. The Rasch model is a mathematical algorithm that expresses the probabilistic expectations of item and person performances/estimates [30]. Specifically, the probability of a correct response or endorsement is a logistic function of the difference between the person and item parameter. Where data sat-isfy the expectations of the Rasch model, the summed subscale scores can be transformed into interval scale measurement [31] (for the purpose of this paper we will refer to these Rasch transformed data as 'interval data'). A number of tests are performed to determine if the data meet the assumptions of the Rasch model. A summary chi-square interaction statistic should be non-significant, showing no deviation from model expectation. Person and item fit residuals should be within the range of +/-2.5 and mean person/item fit residuals should be close to zero (values of zero indicate perfect fit) [28]. Individual item chi-squares should be non-significant (Bonferroni adjusted).
Inconsistent use of response options (disordered thresholds), item bias across groups of respondents (Differential item functioning, DIF), multidimensionality, or local dependence may contribute to misfit: • The thresholds between response categories (i.e. the transition point between adjacent categories), where the probabilities of a response is equally likely, should reflect an increase in the underlying trait (e.g. pain). In the case of the VAS every millimetre (mm) is a response category, resulting in 100 thresholds. However, since we divided scores by two, the number of thresholds was reduced to 50. Disordered thresholds can be observed and dealt with by grouping response categories.
• The scale should be invariant and not be influenced by bias (Differential Item Functioning or DIF). For example we wish to see that people from different groups, with equal amounts of the underlying trait under investigation (i.e. pain, physical functioning or stiffness), respond to items in the same manner. This requirement of invariance is indicated by a non-significant ANOVA of the residuals where the key group is the main factor. DIF can be uniform and present consistently across the trait (see below how to deal with this), or non-uniform where bias is not consistent across the trait. Items which display non-uniform DIF often need to be removed from the scale [32,33]. Invariance across key groups (age, gender, joint affected, previous experience of acupuncture, which practitioner they were allocated to, and treatment allocation) was examined using an analysis of variance of the residuals where the group is the main effect.
• Unidimensionality is a requirement for summating any set of items [34]. It is examined by creating two subsets of items that are identified by a principal component analysis of the item residuals; those loading negatively forming one set and those loading positively the second set [35]. T-tests on the two estimates derived from the subtests for each respondent are then performed to see if they differ statistically; if the 95% confidence interval of the proportion of significant tests includes 5%, unidimensionality is supported [35,36].
• The correlation matrix of item residuals is explored to ensure that examinee item responses depend only on their trait level (local independence, residual correlations <0.30) and not on their responses to other test items.
Where items display uniform DIF they are grouped together into a testlet [37]. Essentially this combines the responses of the offending items into a 'super item'. Thus, we see if the bias is cancelled out at the test level and if so this allows an unbiased estimate of the person estimate. Similarly, where local dependency is found to exist, the locally dependent items are added into a testlet to explore if this removes the dependency in the data [37].
The person separation index (PSI) is an indicator of how precisely subjects have been spread out along the measurement construct defined by the items (ranges from 0 to 1) [28]. Values ≥0.70 allow for group comparisons but for individual clinical use values should be ≥0.85. If the scale is found to fit, we explore how well the scale is targeted to the sample, using item-person threshold maps.
For polytomous data two different parameterisations of the Rasch model can be used. The Rating Scale version assumes that the distance between thresholds is equal across items [38]. The Unrestricted (Partial Credit) model does not make this assumption [39]. If results from these two models are significantly different (using a log-likelihood test) the Partial Credit model should be used as was the case with our data (Pain subscale χ 2 = 53.84, p < 0.001; Physical Functioning subscale χ 2 = 206.83, p < 0.001; Stiffness subscale χ 2 = 19.47, p < 0.001). Bonferroni corrections were applied throughout the analysis to allow for multiple testing [40].
Responsiveness was examined using both the observed, ordinal scores on the VAS, and those derived from the Rasch analysis (log transformed interval data). For the latter purpose we obtained log transformed data both on the pre-and post data). Standardised Response Means (SRM) were used to evaluate the subscales' responsiveness. SRMs are derived by dividing the mean change score by the pooled standard deviation [41]. This accounts for different levels of variance in the data at baseline and follow-up. Bootstrapped standard errors were generated within the STATA programme to provide confidence intervals to ascertain if the difference between SRM's were significantly different [42]. Their median VAS pain score (over seven days before the commencement of the study) was 59.4 (IQR 48.0 to 68.9). Table 1 displays participants' raw scores (ordinal data) on each of the subscales, pre and post, and demonstrates that significant changes occurred over time on all subscales.

Pain subscale
Factor analysis of the WOMAC Pain subscale (pre-data) demonstrated a unidimensional construct, with 70.6% of the variance attributable to the first factor.
Fit to the Rasch model was demonstrated by satisfactory summary statistics and t-tests for unidimensionality (table 2, analysis 1). Individual item fit was good. There were no significant residual correlations between the items suggesting absence of local dependence. Only two out of the five items were disordered (item 3 & 4). However, due to the large number of response categories (i.e. 51) it was not possible to determine a sensible rescoring method. The PSI of the pain subscale was 0.86 and Cronbach alpha was 0.82.
Two items (2 and 4) showed uniform DIF by 'joint' in opposite directions: people with the same level of pain tended to score higher on item 2 if they were waiting for a knee replacement than those waiting for a hip replacement. The reverse was the case for item 4. Combining these two items into a testlet and comparing them against the remaining three items resulted in a fit to the Rasch model and unidimensionality (table 2, analysis 2). This is an indication that the DIF is cancelled out at the subtest level. The resulting item fit statistics are shown in table 3.
Despite the potential 250 raw score points (ordinal data) derived from the 5 items, the scale demonstrated a substantial lack of range (figure 1). This is consistent with the moderate reliability and indicates that increments in raw (ordinal) score points across the centre of the scale are associated with only marginal increments on the underlying metric construct (interval data).
There was absence of DIF over time when the pre-and post data were combined indicating that the scale is invariant by time and the items were well targeted to the population. The SRM for the ordinal data (raw scores) was 0.55 and for the interval (Rasch transformed scores) data 0.35 suggesting the ordinal SRM is overestimating the true responsiveness of the WOMAC (table 1). However, the confidence interval for the difference between the two SRM's overlapped zero, indicating that the difference was not significant.

Physical Functioning (PF) subscale
Factor analysis of the 17 item PF subscale supported a unidimensional construct, with 63.4% of the variance attributable to the first (and only significant) factor.
The pre-data PF items initially deviated significantly from the Rasch model with a chi-square probability >0.003 (table 2, analysis 3) and a lack of unidimensionality. Five items showed significant DIF by joint (item 1, 2, 5, 9 and 11). In addition, items 1 and 5 had high fit residuals. As these two items also showed DIF they were combined into a testlet and compared with the remaining 15 items. This resulted in a fit to the Rasch model and unidimensionality (table 2, analysis 4; table 3), suggesting DIF was responsible for the lack of fit and unidimensionality. Cronbach alpha was 0.95.
As with the pain scale, the PF scale (Rasch transformed scores) had a limited distribution (figure 2) and the ordinality of the raw score was accentuated. For example, a change in 25 points out of a total of 850 (17 items each ranging from 0-50 as scores were halved) at the margins of the raw total (ordinal) VAS physical functioning subscale scores is reflected in a real, interval equivalent change of 311 points (622 mm) (table 4) The person fit residual standard deviation was high. We used a regression analysis to explore independent variables that might be predictive of this. Variables entered into this analysis were gender, age, joint and time on the waiting list. None correlated significantly with the person fit residuals.
Combining pre-and post data showed that the Physical Functioning Subscale was invariant over time (no DIF observed). The SRM using the ordinal data was 0.49 and using the interval data 0.37 (table 1). In this instance the confidence interval for the difference between SRM's did not overlap zero (0.017-0.206), indicating a significantly different effect size.

Stiffness subscale
Since the stiffness subscale consists of two items it was not appropriate to subject it to Factor analysis. Rasch analysis showed that the subscale fitted the Rasch model (table 2, analysis 5). The reliability of this subscale was low (0.81), which is not unexpected considering the

Discussion
Data from the three WOMAC subscales were assessed by factor and Rasch analysis, which largely supported the structure of the Pain and Physical Function subscales. There was some bias in item response, but this tended to cancel out at the scale level. Of significance is that this bias, when uncorrected, gave rise to the appearance of multidimensionality, and misfit to the Rasch model. This is consistent with earlier findings about the impact of DIF on dimensionality [43]. It is thus possible that earlier Rasch analyses of the WOMAC, which did not adjust for this bias, may have indicated that item reduction was necessary to obtain fit to the model and/or unidimensionality [15][16][17][18][19]. The use of testlets as a mechanism to evaluate the potential cancelling effect of bias appears to be a useful strategy to avoid unnecessary and possibly incorrect item deletion. Classical factor analysis may also have led to a conclusion of multidimensionality if parallel analysis was not applied [10][11][12][13]. In the current analysis, the default rule of an eigenvalue of greater than one as significant would have led to a multidimensional solution for the Physical Function scale. Although many items cross loaded across two factors, at least two items would have been candidates for removal under these circumstances. Therefore, it is easy to see how slight differences in methodological approaches may have given rise to different solutions regarding the subscale structures of the WOMAC.
In addition, the inclusion of OA patients in different stages of their disease in other studies may have given rise to valid multidimensional conclusions and consequently careful testing of the structure of scales across all stages (and disease groups) is a prerequisite for confidence in the robustness of any generic scale [44].
Although the stiffness subscale only consists of two items it was shown to fit the Rasch model. However, we were not able to employ strategies to overcome observed DIF and reliability was low. The usefulness of this scale should therefore be reconsidered.
The Rasch model is strict in terms of satisfying the requirement for transformation to interval scaling [45,46]. The iterative process of Rasch analysis requires unidimensionality tests to be done at each stage. Thus, factor analysis and Rasch analysis provide their own hierarchical ordering of scalability with the assumption of unidimensionality and finally the potential for interval scale transformation. The WOMAC Pain and Physical Function scales satisfy all of these conditions in this sample of those awaiting hip or knee replacement.
Responsiveness of the WOMAC has been reported to be good, both for the Likert and the VAS versions [47][48][49][50][51][52]. However, these studies make no attempt to adjust for the ordinal nature of the Likert scale or VAS, and the resulting differential deviation from the interval scale metric. As the calculation of responsiveness involves mathematical operations which are not supported by ordinal data, the results based upon ordinal data may be spurious [53]. Clinicians and others may be tempted to choose the VAS version of the scale because it seems more responsive than a Likert version. Figure 1 showed that a wide range of ordinal raw score points in the middle of the score range are associated with a very small number of actual metric points, and that at the margins the converse is true. In other words, the distance between data points in the middle of a visual analogue scale (in millimetres) as deduced from the raw (ordinal) data is in fact much smaller once data are transformed into interval level data and thus the calculation of the SRM provides a good example of the impact of the misuse of ordinal data. Consequently, the level of responsiveness is spurious, as evidenced by the fall in SRM on all subscales when calculated using the interval data (where the technique is valid). Therefore, when using raw ordinal data researchers and clinicians run the risk of misinference, regarding the magnitude of change in pain and physical functioning   [54]. Other studies employing Rasch analyses of visual analogue scales have not reported on the logit range and we can therefore not compare these findings to others. Further work needs to be undertaken to evaluate the effect of scale units (i.e. ordinal versus interval) upon statistics such as the SRM, and upon routine interpretation of outcome.
There are a number of limitations to the study. The sample is taken from those awaiting arthroplasty and therefore may be reflective of only those with moderate or severe pain and functional limitations. Consequently the findings need replication in those with lesser severity. The high person fit residual SD found in the Physical Functioning subscale was puzzling and could not be . The y-axes display the raw scores (top y-axis) which range from 0 to 250 as we divided the VAS scores by half for the analysis and the subscale contains five items, and the frequencies of item thresholds and participants (bottom y-axes). The Figure also shows the location of study participants along the construct of Pain. Data for this figure represent the unbiased person estimates derived from Analysis 2 (see also Table 1) which combined biased items 2 and 4 into a testlet and left the remaining items are unchanged. explained by the effects of a number of independent variables such as gender, age, time on the waiting list and joint. It is possible that these may also be a function of the large number of data points, and the associated sample size and again this will require further work.

Conclusions
In conclusion, the WOMAC Pain and Physical Functioning subscales were found to fit Rasch model expectations, and thus be internally valid and unidimensional. Factor analysis using parallel analysis also confirmed the unidimensionality. Consequently the raw score is a sufficient statistic for estimating the person's level of pain and physical functioning at the ordinal level. We were also able to transform the ordinal data (constrained to a 0-50 range for each item) to an interval scale through fit to the Rasch model. Some Differential Item Functioning was observed, but this cancelled out at the test level and, by doing so, at the same time removed the disturbance of unidimensionality. Therefore, we do not recommend changes to the item structure of the subscales. However, the scaling characteristics of sets of items which use Visual Analogue Scales do require further analysis, as it would appear that responsiveness using ordinal data is under-reported when people move along the margins of the scale and over-reported when they move across the middle of the scale. Clinically this means that change over time on the WOMAC for patients on the margins, using the raw ordinal data, cannot be directly compared with those who score in the middle of the scale, consistent with the lack of validity of performing mathematical operations on * Scores were halved for the analysis, therefore the pain total score ranges from 0-250 and the physical functioning total score from 0-850 ordinal data. Finally, the utility of the Stiffness subscale should be reconsidered. The graph displays the person-item threshold distribution map with the x-axes displaying location or difficulty of item thresholds (lower half) and location or level of physical functioning reported by participants (upper half). The y-axes display the frequencies of item thresholds (lower half) and participants (upper half). Data for this figure represent the unbiased person estimates derived from Analysis 2 (see also Table 2) which combined biased items 1 and 5 into a testlet and left the remaining items are unchanged.