Exploring differential item functioning in the SF-36 by demographic, clinical, psychological and social factors in an osteoarthritis population

Background The SF-36 is a very commonly used generic measure of health outcome in osteoarthritis (OA). An important, but frequently overlooked, aspect of validating health outcome measures is to establish if items work in the same way across subgroup of a population. That is, if respondents have the same ‘true’ level of outcome, does the item give the same score in different subgroups or is it biased towards one subgroup or another. Differential item functioning (DIF) can identify items that may be biased for one group or another and has been applied to measuring patient reported outcomes. Items may show DIF for different conditions and between cultures, however the SF-36 has not been specifically examined in an osteoarthritis population nor in a UK population. Hence, the aim of the study was to apply the DIF method to the SF-36 for a UK OA population. Methods The sample comprised a community sample of 763 people with OA who participated in the Somerset and Avon Survey of Health. The SF-36 was explored for DIF with respect to demographic, social, clinical and psychological factors. Well developed ordinal regression models were used to identify DIF items. Results DIF items were found by age (6 items), employment status (6 items), social class (2 items), mood (2 items), hip v knee (2 items), social deprivation (1 item) and body mass index (1 item). Although the impact of the DIF items rarely had a significant effect on the conclusions of group comparisons, in most cases there was a significant change in effect size. Conclusions Overall, the SF-36 performed well with only a small number of DIF items identified, a reassuring finding in view of the frequent use of the SF-36 in OA. Nevertheless, where DIF items were identified it would be advisable to analyse data taking account of DIF items, especially when age effects are the focus of interest.


Background
Osteoarthritis (OA) is one of the most common causes of disability and with aging populations ever more treatments and procedures are being carried out. In order to evaluate the effectiveness of such treatments and procedures, it is essential to have accurate measures of outcome. Without good measures we cannot identify those that do benefit from treatments and those who do not benefit and for whom possibly other less invasive treatments may be more appropriate.
The 36 item SF-36 is the most commonly used generic measure of outcome used in OA [1]. The SF-36 is based on a multidimensional model of health and reflects eight important health concepts. These concepts are limitations in Physical Functioning, Role Limitations due to physical problems, Social Functioning, Bodily Pain, General Mental Health, Role Limitations due to emotional problems, Vitality and General Health Perceptions. There is also a single question on reported Health Transition. While considerable effort has been invested in developing the SF-36 to high psychometric standards, improving the quality of the measure and its interpretation for specific populations, such as OA, is an ongoing scientific task.
An important, but frequently overlooked aspect of establishing the validity of a measure, is to establish if items and measures work in the same way across subgroups of a population? e.g. certain socio-economic groups, or gender. That is, if respondents have the same underlying level of an attribute, such as disability, does the measure give the same score in different populations? or is it biased in some groups. For example, it has been shown that for the Centre of Epidemiology Scale of Depression (CES-D), women are more likely than men to endorse an item about having crying spells even though they have the same underlying level of depression [2]. Thus, while this item may truly reflect differences between men and women in likelihood of crying, it exhibits gender bias with respect to the measurement of depression and scores for women might be inflated compared to men. Hence, apparent group differences in depression scores may be due to measurement bias rather than true differences. Alternatively, where no group differences are found, if DIF items exist then they might mask true group differences. Although a measure may appear equivalent at the measure level, biases may still be present at the individual item level [3]. Thus, item level analyses are now seen as central to establishing measurement equivalence across subgroups of a population [4].
The techniques are known as differential item functioning (DIF) methods and biased items are said to exhibit DIF. DIF items have been identified in health outcome measures with respect to gender, age, race, ethnicity, socio-economic status, language, nationality and health care setting [5]. For example, several items from cognitive screening measures were shown to be poor items for those with low education levels [6]. Use of these items may exaggerate the problems of more deprived individuals.
If DIF items are found in measures in development then these items could be re-written or the item could be removed and an alternative DIF-free item with similar item properties could be substituted. If DIF items are found in an existing measure then it may be preferable to select an alternative measure (with no DIF). If data has already been collected or in situations where there is an established use of a measure such as the use of the SF-36 in OA, then DIF items could be removed and analyses repeated without the DIF items or using an analysis method that can take into account the DIF items; these analyses allow more accurate interpretation of results obtained.
Importantly, it has been shown that SF-36 items may work in different ways for different clinical conditions [7] including some evidence of DIF even between arthritic conditions, (between psoriatic arthritis and rheumatoid arthritis) [8]. Hence, given its frequency of use in OA there is a need to examine items for DIF specifically in an OA population. Furthermore, it has been shown that cultural differences may impact the validity of SF-36 items [9]. However, DIF in the SF-36 has only been explored in US, Danish, Dutch, Israeli and Chinese patients. It is clearly important to achieve this level of validation for the SF-36 in a UK population. Additionally, only demographic factors have been previously explored; DIF items have been identified for the SF-36 with respect to age, education, gender, race, condition and language in other conditions [7,[10][11][12]. Hence, the aim of the study was to examine DIF in the SF-36 for an OA population with respect to demographic, social, clinical and psychological factors.

Design
Statistical techniques were applied to SF-36 data from a community-based population of UK people, with OA to explore DIF items in the SF-36 with respect to demographic, social, psychological and clinical factors.

Participants and data collection
The sample comprised a community sample of 763 people who had been diagnosed with OA from 1359 people with hip and/or knee symptoms who completed the SF-36 during a follow-up assessment of health outcome measures as part of the Somerset and Avon Survey of Health Survey (SASH, [13,14]). SASH is a large scale survey of the population aged 35+. The age-sex stratified survey of 28,080 people registered with 40 general practices in Avon and Somerset yielded 2703 people reporting hip and/or knee symptoms at baseline (1994)(1995). At follow-up assessment (2002)(2003), 763 had OA. diagnosed by a clinician assessing X-rays using the Kellgren-Lawrence classification [15] Written informed consent was obtained from all participants. Ethics approval was obtained from the South West Research Ethics Committee (MREC/01/6/51) and the study was conducted in accordance with the Helsinki Declaration.

Measures SF36
The 36 item SF-36 is the most commonly used generic measure of outcome used in OA [1]. The SF-36 was developed from the Medical Outcomes study (MOS) based on a multidimensional model of health [16]. The SF-36 is a shorter, 36 item measure that reflects the eight most important health concepts of the MOS. The concepts were limitations in Physical Functioning (10 items), Role Limitations due to physical problems (4 items), Social Functioning (2 items), Bodily Pain (2 items), General Mental Health (5 items), Role Limitations due to emotional problems (3 items), Vitality (4 items) and General Health Perceptions (5 items). There is also a single question on reported Health Transition. Only subscale scores are calculated (i.e. no total score). We used the UK SF-36 version in this study.

Validation measures
The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC [17]) was used to validate the DIF-free SF-36. The WOMAC is the most commonly used disease-specific measure of outcome used in OA. The WOMAC was based on the objective of defining the dimensionality of pain and disability in OA of the hip and/or knee.

Grouping factors
Median splits were used where appropriate. The sociodemographic factors explored were gender, age (median = 70.88), social deprivation (measured by Townsend Index group) (median = −1.47), social class (Registrar General classes 1,2,3i v 3ii,4,5) [18] and employment status (paid work v not paid work). The psychological variable investigated was mood assessed using a single item on the EuroQol [19] (no anxiety/depression v moderate or extreme). The clinical factors were Body Mass Index (BMI, underweight/normal/overweight v obese i.e., <30, 30+), the number of affected OA joints (1or 2 v 3 or 4) and type of OA (hip v knee).

Statistical analysis
First t-tests were used to examine for differences in SF-36 scores between grouping factors. Then the assumption of unidimensionality was tested before testing for DIF.

Testing assumptions: unidimensionality
Ordinal factor analysis was carried out to explore unidimensionality. We used the FACTOR computer program [20] using common factor analysis with polychoric correlations. Unidimensionality was supported if there were large difference in eigenvalues between factor 1 to 2 and small difference in eigenvalues between 2 and 3 [21]. A widely cited criteria for acceptable unidimensionality is if > =20% variance is explained in first factor [22]. Another commonly reported method for exploring acceptable unidimensionality is by looking at the ratio of the first to second eigenvalue. The first eigenvalue should be significantly higher than the second eigenvalue. If this is 3:1 or 4:1 then there would appear to be a dominant first factor [23]. If both of these criteria were reached unidimensionality was accepted.
The number of factors was also evaluated using the MAP procedure proposed by Velicer (1976) which examines the matrix of partial correlations [24].

DIF testing
Details of the method used have been described elsewhere [25]. Briefly, ordinal Logistic Regression (OLR) was used to explore DIF. In DIF analyses it is crucial to control for the underlying attribute that the item is supposed to be measuring since different groups may have different ability levels (i.e. it is necessary to 'match' on ability levels). The total score on the relevant subscale was used as the matching variable. Hence, each item was tested within its own subscale.
DIF analysis was carried out for each SF-36 item by testing the effect of the grouping factors and the interaction term (matching variable by grouping factor) once the matching variable has already been added into the model [26,27]. A macro was written in SPSS to facilitate the DIF analysis. Specifically, the following steps for OLR was carried out for each item.

Ordinal logistic regression model i) General procedure for DIF testing
Three OLR models were calculated for each item (the dependent variable).
1) Model 1: The total score (matching variable) was entered as a predictor variable. 2) Model 2: The grouping factor (i.e. binary variable) was added into Model 1 as a second predictor variable 3) Model 3: The interaction (i.e. grouping factor by total) was added into Model 2 as the third predictor variable.
The difference in Chi-square between the Model 1 and Model 3 was tested for the significance as a Chi-Square test with 2 degrees of freedom [26]. If significant, this indicated DIF. The difference in Chi-square between Model 2 and Model 1 gave a test of uniform DIF (same DIF effect over the construct) and the difference in Chisquare between Model 3 and Model 2 gave a test of nonuniform DIF (uneven DIF effect over the construct) [26].
Significance testing and item level effect sizes Different criteria have been suggested to classify items as exhibiting DIF and as previously described [25] we classified DIF items using the criteria proposed by Swaminathan and Rogers (SR, 1990) [27]. Swaminathan and Rogers use a criteria of p < 0.05 for the difference in Chi-square between Model 3 and Model 1. Bonferroni corrections were also applied to minimise Type 1 error [10,28]. For uniform DIF the odds ratios were calculated to examine the direction of the bias.

ii) Assumption testing: Proportional Odds
An assumption of ordinal logistic regression is that the parameter coefficients are equivalent across the levels of the dependent variable (i.e. proportional odds). If for any model the assumption of proportional odds was violated then k-1 dichotomous variables were created for that item where k is the number of response categories.
Purification If DIF items were found then they were removed from the total score (i.e. the matching variable) and all the analyses for the items in that measure re-run. As standard, the item with DIF was included in the total score used in testing that item as this has been shown to reduce bias [29]. Purification was an iterative process so the analyses may be re-run a number of times until no changes in identified DIF items were seen on two consecutive analyses.
Effects of covariates Where DIF items were identified, the analyses were repeated (steps a-b above) with age additionally entered as a covariate in the logistic regressions to explore if apparent DIF effects in other grouping factors were confounded by age.
Examination of the impact of DIF at the measure and subscale level Modified measures were constructed with DIF items removed and compared to the original measure or subscale. The effect of DIF on group differences was explored using t-tests to see if different conclusions would result if a DIF-free measure was used. Also the difference in significance between the tests was explored by repeated measures ANOVA and exploring the interaction between the two different total scores and the grouping factor i.e. does the effect size change by a significant amount depending on the total used. All totals were recalculated as averages due to the different number of items in each total.
The validity and reliability of DIF-free measure The validity and reliability of DIF-free measure was explored by carrying out standard psychometric tests. Construct validity was explored by examining the relationship of the DIF free measures with other subscales from the WOMAC. Cronbach's alpha was calculated for the original measure and for the DIF free measure.
Power Based on Crane's (2006) [28] suggestion for number of participants in each subgroup, we required at least 80 participants per subgroup (based on the maximum of 6 response categories). All subgroups had more participants than the minimum required.

Demographics
The participant's characteristics are presented in Table 1.

Testing assumptions: unidimensionality
The ordinal factor analysis supported the unidimensionality for all subscales of the SF-36 with large difference in eigenvalues between factor 1 to 2 and small difference in eigenvalues between 2 and 3. Only one dimension was also suggested from the MAP procedure for all of the subscales. Hence there was evidence of unidimensionality for all subscales (see Table 2).

DIF items
DIF items were found across 8 of the 9 grouping factors (the exception being the grouping factor 'number of affected joints'). Of the 35 items, 16 items showed DIF for at least one of the grouping factors (without counting the mental subscale items for the mood grouping factor as these should exhibit DIF) (see Table 3). The greatest number of DIF items were identified for age and employment status (each with 6 items). For age, items that showed DIF were, from the physical functioning subscale, item PF1 concerning 'vigorous activities' and PF6 'bending kneeling, stooping'; from the role physical subscale, RP2 'accomplishing less' , from the social functioning subscale SF2 'interference with social activities' and from the vitality subscale, V2 'energy' and V3 'worn out'. The items RP2 'accomplishing less' and V2'energy' had uniform DIF with older people reporting more limitations than they would have with an unbiased item, but the item V3 'worn out' showed uniform DIF in the other direction with older people reporting less limitation than their overall level would suggest. The other items that showed DIF by age had non-uniform DIF. It appeared that older people with good overall physical functioning reported more problems with vigorous activities than younger people with the same level of overall physical functioning. Whereas older people with mid range or poor overall physical functioning reported fewer problems with vigorous activities than younger people with the same level of overall physical functioning (see Figure 1).
However, for PF3 'bending, kneeling and stooping' older people with very poor overall physical functioning reported having more problems with bending, kneeling and stooping than younger people who also had poor physical functioning, whereas older people with midrange to very good physical functioning reported having fewer problems with bending, kneeling and stooping than younger people who also had same level of overall physical functioning (see Figure 2).
Older people with mid-range to very good overall social functioning responded to having fewer problems with SF1 'interference with social activities' than younger people who also had same level of overall social functioning, whereas older people with poor overall social functioning responded to having more problems than younger people who also had same level of overall social functioning.
For employment status, DIF was also identified for 6 items. The DIF items were both the items in the pain subscale (P1 and P2); RP1 'cut down time on work and activities' and RP4 'difficulty performing work and activities' from the role physical subscale; and SF1 'interference with social activities' from the social functioning subscale and from the vitality subscale V3 'worn out'. Uniform DIF was identified for employment status with those not working reporting less 'difficulty performing work and activities' and being less 'worn out' than those people who were working who also had same level of functioning on the relevant subscales.
The other items that showed DIF by employment status had non-uniform DIF. It appeared that for at the worse end of the pain subscale, those not working reported greater intensity of pain (P1) than those working whereas over the rest of the subscale those not working reported less pain than those working at similar overall pain score.
The other pain (P2) item didn't quite reach significance for uniform DIF (although significant overall DIF) but it appeared that those not working reported fewer problems across the whole range than those working with the same level of overall pain. For the item RP1 'cut down time on work and activities' over most of the construct range there was little difference between the employment groups, however at the better end of the subscale those not working reported fewer limitations than those working and with similar levels of function. For those with similar social functioning in the mid-range of the subscale, those not working reported fewer limitations than those working on the item SF1 'interference with social activities' with responses similar over the rest of the subscale (see Figure 3).  After adjusting for age, 3 items still showed DIF for employment status, P1'intensity of bodily pain'; RP1 'cut down time on work and activities' and SF1 'interference with social activities'.
For the other grouping factors, 2 items were identified as having DIF for gender PF3 'lifting/carrying groceries' and M1 'nervous'; 2 items for social class (P2 'extent pain interferes with normal work' and M2 'down in the dumps'), 1 item for deprivation (RP1 'cut down time on work and activities'), 2 items for mood (excluding mental functioning subscale, PF10 'bathing and dressing' and GH2 'get ill more easily than others'), 1 item for BMI (PF1 'vigorous activities'), and 2 items for Hip v Knee OA (RP1 'cut down time on work and activities' and from role emotional RE2 'accomplished less'). Only gender had consistently uniform DIF; women were more likely to respond as having more limitations carrying groceries and nervousness than men, at the same the overall level of limitation. Uniform DIF was also identified for M2 'down in the dumps' , with those in the lower social class group more likely to report more problems than those in the higher social class group at the same level of overall mental health. The other DIF item for social class P2 'extent pain interferes with normal work' showed non-uniform DIF, with similar responses to the item between the groups over most of the low to mid range, but across the mid-range those in the lower social class group reported more limitations than those in the higher social class group. However, at the high end of the subscale, those in the lower social class group reported less pain interference on this item than those in the higher social class group Uniform DIF was also identified for GH2 'get ill more easily than others' with those with low mood reporting more difficulties than those without low mood who also had same level of overall general health.
The item PF10 'bathing and dressing' showed nonuniform DIF for mood; those with low mood and at the best end of physical functioning reported fewer problems than those without low mood with the same level of overall physical functioning. However over the rest of the scale those with low mood reported more difficulties than those without low mood at similar levels of physical functioning.
For hip v knee OA, those with hip OA reported better scores on the item RP1 'cut down time on work and activities 'than those with knee OA with the same scores   on the role physical subscale across the whole range of role physical; differences with knee OA were even greater at the better end of the subscale. For the item RE2 'accomplished less' from the role emotional subscale, those with hip OA reported more limitations across the whole range of the subscale than those with knee OA with the same scores on the role emotional subscale; differences with knee OA were even greater at the more limited end of the subscale. After adjusting for age, there were no changes in the identified DIF items for gender, social class and deprivation, however only 1 DIF item remained for mood, GH2 'get ill more easily than others' and one item for Hip v Knee, RE2 'accomplished less' and the one item for BMI, PF1, was no longer being identified with DIF (see Table 3).

Testing for group differences using original and DIF-free measures
If DIF-free measures were used, then the apparent conclusions for two group differences would have changed. Using the original scoring method, older people had worse scores on physical functioning, role physical, role emotional but better scores on mental functioning and general health than the younger group. When the DIFfree totals were used older people had significantly worse scores for social functioning compared to the younger group. Although the conclusions did not change for the other DIF-free subscales by age-group they all had significant changes in effect size.
Those in the lower social group, with worse deprivation scores and not working had worse scores on all subscales (other than for mental functioning in the employment group). However when the role physical subscale was DIF adjusted, there was no longer a significance between deprivation groups with a trend of a significant change in effect size. No other DIF adjustments changed conclusions although most changed the effect size.
With the original scoring, women had worse physical functioning, pain, social functioning, mental functioning and vitality. When the DIF-free totals no conclusions changed but the effect sizes significantly changed.
High BMI was associated with worse scores on all subscales but no conclusions changed with DIF adjustment although there was a significant change in effect size.
Those with lower mood had worse scores on all of the subscales but there was no change in conclusion where DIF free totals were used and no change in effect size.
Mean subscale scores were not significant different for those with one or two affected joints compared to those with 3 or 4 affected joints nor for those with hip OA compared to knee OA. These conclusions did not change where DIF adjusted scores were used and no change in effect size was found (see Table 4).

The validity and reliability of DIF-free measures
The removal of the DIF items from the subscales (of more than two items) only resulted in small changes to Cronbach's alpha except for the vitality subscale where alpha reduced from 0.86 to 0.70. This was probably due to the number of items halving from four to two items. The strength of correlations of the SF-36 physical functioning and pain subscales with the physical and pain WOMAC subscales were only slightly reduced (not shown).

Discussion
Overall, the SF-36 preformed well. However, each subscale showed some evidence of DIF by at least one grouping factor: physical function (4/10 items), Pain (2/2), Role-Physical (3/4), Social functioning (1/2), Role emotional (1/3), Mental health (excluding mood as a grouping factor, 2/5), Vitality (2/4) and General health (1/5). Previous studies that explored DIF in the SF-36 by sociodemographic factors also found many subscales with DIF items. DIF items were found in all the SF-36 subscales of a US general population and in all except social functioning and role emotional in a chronic condition population [10]. Other US based studies have examined particular subscales for the presence of DIF; DIF items were found in the physical functioning and mental health subscales in people with chronic diseases [30] and in the physical functioning subscale in people with fibromyalgia [12]. DIF items were also found when only the general health subscale was examined in a Danish general population study [11].
DIF items were found across all of the grouping factors except for 'number of affected joints'. Of the 35 items, 16 items showed DIF for at least one of the grouping factors. DIF was most commonly found for age and employment status (6 items each) and so DIF may be less of a problem within samples that are homogeneous for employment or age although when controlling for age, there was less evidence of DIF for employment (3 items). In previous studies, more items showed DIF for age and education than for other grouping factors [10], although items with DIF were also identified for gender, marital status and income [30].
In common with previous US general and chronic condition populations studies, we identified DIF by age for items 'Vigorous activities' and 'Bending, kneeling stooping' in the physical functioning subscale [10,30] and for the items in the vitality subscale 'energy' and 'worn out' [10]. We also identified the item 'Lifting, carrying groceries' from the physical functioning subscale for DIF by gender in common with the study of people with chronic diseases [30]. We also found six items that exhibited DIF by employment but this was not found in the only other study that included employment status [30]. These six items included both the items from the pain subscale. The other main difference between the current study and previous studies was for the general health subscale where previously most of the items showed DIF for age [10,11], education, gender and race [10], while in our UK OA population no DIF by socio-demographic factors was identified. We also found a small number of DIF items across the subscales that were not identified in other studies. There are many possible explanations for why we have found different DIF items to the other studies, this may be as none of the previous studies focused solely on an OA population and hence differences may be attributed to the particular difficulties and challenges faced by people with OA. Additionally, this is the first study to examine the SF-36 for DIF in a UK population. Also, many of the previous studies used general population datasets which have wide age ranges, whereas, the present study employed an older population only. At test-level, in common with other studies, few changes in conclusions were found when DIF free subscales were compared to the original subscale although, in most cases, effect sizes significantly changed. Two changes of conclusion were made after re-analysis; one for comparisons by age and the other for a comparison by social deprivation (Townsend index) score. Hence, we would suggest for these comparisons, it would be particularly advisable to re-run analyses without the DIF items or use an analysis method that takes account of the DIF items. As many changes in effect size were found between the original and DIF free scales, it would also be prudent to re-analyse for all the comparisons where DIF items were found.
These results have implications for interpretation for SF-36 results and for the understanding of the process and pattern of disablement in OA. Some of the findings suggest that individual aspects of OA may affect different groups in different ways. So for example, women are more likely to report being nervous and having difficulty in lifting and carrying groceries than their actual level of function would suggest. There also appear to be differential effects on older people; they are more likely to report accomplishing less and having low energy than younger people with similar limitations of function, but are less likely to describe themselves as worn out. If older people have overall poor function, then they reported having fewer problems with vigorous activity than their level of overall physical functioning would suggest but report having more problems with bending and kneeling and with interference in social activities than their level of overall function would suggest. This pattern of findings is compatible with models of ageing that suggest older people select activities they wish to preserve and work to optimise their performance of the selected activities [31], but have more problems with activities that are essential rather than chosen.
Employed people, reported more work-related difficulties and in feeling more worn out than would be expected for their level of functioning on the respective subscales compared with those who are not in employment. In addition, other limitations may be more pronounced for employed people with poor function. Again this suggest that the methods of accommodating to impairments may be important in determining the pattern of limitations and that continuing in employment may bring a specific pattern of difficulties which follow logically from the work itself.
In the study we took the approach of removing the DIF items. However removing items may affect content validity of the measure and comparability with other studies. Using more complex Item Response Theorybased analyses, DIF items do not need to be removed as adjusted scores can be calculated for each subgroup. Alternatively, researchers may choose to stratify by gender, age etc. in the design or analysis of studies using the SF36. If the measure is in development, an alternative to deleting the DIF items, may be to substitute similar but DIF-free items either by re-writing, or choosing an alternative item with similar item properties. Re-writing could be facilitated by the identification of the source of DIF, for example by cognitive interviewing or by reviewing the item by groups of experts.
The study has limitations. The sample was a community sample and thus had relatively mild OA compared to, say, an arthroplasty sample, hence the generalisability of these results to all levels of OA would need exploring. We created some groups by using median splits and it is possible that other splits may have produced different results. Additionally although we adjusted for age, it is possible that DIF effects could be due to differences in other covariates between groups and it is possible that there are not real differences in the underlying response probabilities We also carried out a large number of statistical tests and although we applied a Bonferroni correction it is possible that some findings were due to chance and thus replication would be desirable. Also 2 scales (pain and social functioning) only had 2 items, so could not be purified and the total score was based on a very small number of items.
In this study we used OLR to explore DIF due to the accessibility, flexibility and practicality of this method. However, another approach to DIF detection is to use the more complex item response theory (IRT) approach including Rasch models. There is still much debate over the advantages and disadvantages over different methodological approach to DIF [31][32][33]. IRT does have advantages, in particular the use of the latent variable as the matching variable rather the use of sum scores in OLR. However, IRT is a complex statistical method requiring the use of specialist software and yet produces similar results to OLR. Additionally, IRT requires good model fit as poor model fit can contribute to false DIF detection and yet the methods for assessing model fit are not fully established [32,33]. However, it is possible that we may have obtained different results if an alternative DIF method was used. It is also possible that by using different significance criteria for the OLR method we may have reached different conclusions.

Conclusions
Overall a small number of DIF items were identified, a reassuring finding in view of the frequent use of the SF-36 in OA. Although individual items exhibited DIF, this rarely extended to the measure level, although in most cases the effect sizes changed significantly. Nevertheless, where DIF items were identified it would be advisable to analyse data taking into account DIF items especially when age effects are the focus of interest. The results demonstrate the importance of DIF detection as a standard part of validity testing for measures of health outcome.