Evaluation of three patient reported outcome measures following operative fixation of closed ankle fractures

Background Several patient reported outcome measures (PROMs) are available for assessing the outcomes of ankle fracture but few have been compared for recommended measurement properties. This study compares the measurement properties of the Lower Extremity Function Scale (LEFS), Olerud Molander Ankle Score (OMAS) and Self-Reported Foot and Ankle Score (SEFAS) following ankle surgery. Methods The retrospective cohort study included 959 patients aged 18 years and over who underwent surgical treatment (ORIF) for unstable and closed ankle fractures in SE Norway. The PROMs were included in a postal questionnaire sent to patients’ homes in 2015, three years after surgery. Missing data, structural validity, internal consistency, test-retest reliability and validity were assessed. Results Confirmatory factor analysis results showed model fit for the SEFAS and a bi-dimensional LEFS with scales of easy and difficult items. The OMAS performed less satisfactorily. Cronbach’s alpha and test-retest correlations ranged from 0.82 to 0.96 and 0.91 to 0.93 respectively. The smallest detectable differences for group and individual comparisons were 14.1 to 20.6 and 0.93 to 1.55; SEFAS performed best. As hypothesised, instrument scores were highly correlated and with those for the EQ-5D and SF-36 physical functioning. Mean imputation where half or more items are completed increased usable scores by 1.4–15.7% without affecting measurement properties. Conclusions The three instruments largely performed satisfactorily in relation to important measurement properties but the LEFS had evidence for two dimensions relating to easier and more difficult aspects of function. Mean imputation where half or more items are completed increased the number of usable responses for all three instruments. The three instruments represent different approaches to measuring outcomes and their content should be considered carefully when choosing between them. The SEFAS is designed for a range of foot disorders including ankle fractures and has the best measurement properties in this population.


Background
Ankle fracture constitutes approximately 9% of all fractures, have an incidence of 122 per 100,000 people [1] and incidence requiring hospitalisation of 83 per 100,000 people [2]. Following a systematic review it was concluded that there was insufficient evidence as to whether conservative management or surgery gives the best long-term outcomes in adult patients [1]. Moreover, the evidence derived from a systematic review of competing surgical technologies led to the conclusion that further evaluation, including greater consideration of long-term outcomes, was necessary [3].
Ankle fractures reduce quality of life and particularly in older people, may cause loss of independence. It is important that studies evaluating outcomes in these patients include valid and reliable patient reported outcome measures (PROMs) that reflect important concerns of patients [4]. There are a large number of ankle-specific PROMs [5] but few have been developed with the input of fracture patients or sufficiently evaluated for measurement properties [4]. Clinicians and researchers wishing to select an ankle specific PROM are faced with a confusing array of instruments with little information on measurement properties in ankle fracture patients. When a choice of instrument exists, the concurrent evaluation of their measurement properties in the patients and health care setting of interest is highly informative [6]. Systematic reviews of measurement properties are also informative, however, the two published reviews focus on ankle problems more generally [5] or ligament injuries [7].
The Lower Extremity Functional Scale (LEFS) [8], Olerud and Molander Ankle Score (OMAS) [9] and Self-Reported Foot and Ankle Score (SEFAS) [10] have been widely applied but have undergone limited testing for measurement properties in patients with ankle fractures. These three instruments represent different approaches to measuring outcomes in patients with ankle fractures; the LEFS focuses on lower limb, the OMAS is ankle fracture specific, and the SEFAS is foot and ankle specific. None of them have been tested for structural validity, which gives evidence supporting their scoring as unidimensional scales. This study compares important measurement properties of these instruments, including reliability and validity [11].

Study population
The retrospective cohort study included 959 patients who underwent surgical treatment (ORIF) for unstable and closed ankle fractures at two hospitals in SE Norway [12]. Patients were 18 years of age and over and treated in a three year period from January 1, 2009. They received a postal questionnaire that included LEFS, OMAS and SEFAS in January 2015; 299 respondents received a test-retest questionnaire at six weeks. Non-respondents received a reminder at four weeks.

Patient-reported outcome measures
Norwegian translation of the three instruments followed international guidelines [11] including two independent forwards and one independent backwards translations with a meeting to agree on the final Norwegian versions.
The LEFS comprises 20 items relating to physical function and daily activities with a five-point scale from 'extreme difficulty or unable to perform' to 'no difficulty' [8]. Items are summed to give a score from 0 to 80 where 80 is the best possible score. The mean of the completed items is used when up to four items are missing [8,13] and normative data is available to aid the interpretation of LEFS scores [14]. Two studies have assessed the measurement properties of LEFS in patients with ankle fracture and there is evidence for reliability, validity and responsiveness in Australian patients [15] and in Finnish patients undergoing surgery due to musculoskeletal pathology of the foot and ankle [16].
The OMAS comprises nine items relating to symptoms, physical function and daily activities [9]. The response scales vary from binary to five-points with clinical scoring that reflects the level of disability for individual items. Item responses are summed to give a score from 0 to 100 with higher scores representing the best possible. The instrument has evidence for test-retest reliability and construct validity in patients with ankle fracture in Sweden [17] and Turkey [18].
The SEFAS comprises twelve items relating to pain, limping, swelling, use of orthotics and walking. The fivepoint scales reflect item content and sum to give a score from 12 to 60 where the former represents normal function [10,19]. The mean of the completed items is used when one item is missing. The instrument has not been evaluated solely in patients with ankle fracture but has evidence reliability, validity and responsiveness in Swedish patients with foot and ankle disorders undergoing surgery [10,19]. For purposes of comparison, scores for the LEFS and SEFAS are also presented on a 0 to 100 scale where higher scores represent the best possible.
Two generic instruments were also included in the questionnaire. The EQ-5D-3L includes five items with a three-point response scale which are scored to give a single index [20]. The SF-36 physical function scale comprises ten items with a three-point scale which sum to a 0 to 100 scale where 100 is the best possible health [21].

Statistical analysis
The measurement properties tested and related terminology follow the COSMIN checklist [11]. Levels of missing data were assessed at the item and scale level with the latter also including imputation for missing data where half or more item responses were present. For comparison, all items were recoded from 0 to 4 where 4 is the best possible health.
Confirmatory factor analysis (CFA) with weighted least squares estimation was used to assess structural validity. Model fit was assessed with the comparative fit index (CFI), Tucker-Lewis Index (TLI) and root mean square error of approximation (RMSEA) [22][23][24]. The CFI and TLI should be greater than 0.90 and RMSEA between 0. 06 and 0.08 for acceptable fit [24,25]. Internal consistency was assessed using item-total correlation which should exceed 0.4 and Cronbach's alpha, which should exceed 0.7 and 0.9 for use in groups and individual patients, respectively [26]. The intraclass correlation coefficient was used for estimating reliability within a two-way mixed effects model with absolute agreement. Weighted kappa was used for assessing individual item reliability [27]. The standard error of measurement (SEM) and smallest detectable change (SDC) were calculated. The former is the square root of the total error variance. For individuals the SDC is 1.96 × √2 × SEM and for groups, the SDC for individuals is divided by √n [26].
Hypothesis testing was used to assess the validity of the three ankle instrument scores through comparisons of those for the EQ-5D and SF-36 physical functioning and clinical variables. These instruments were included in previous tests of validity for the LEFS [13] and SEFAS [10] and continue to be the most widely evaluated and applied PROMs [6]. It was hypothesised that scores for the three instruments would be highly correlated over 0. 7. High levels of correlation were expected with SF-36 physical functioning and particularly for the LEFS, given the overlap in content. The three instruments include items that overlap with two or more EQ-5D items and hence high levels of correlation were expected for EQ-5D scores and the EQ-5D mobility item. Moderate levels of correlation in the range 0.5 to 0.7 were expected for the EQ-5D usual activities and pain items. Lower levels of correlation in the range 0.3 to 0.5 were expected for the EQ-5D self-care and anxiety/depression items. Lower levels of correlation under 0.3 were hypothesised for the clinical variables including ASA classification, BMI, duration of operation and fracture classification.
LISREL was used for the CFA and PASW Statistics 18. 0 was used for the remainder of the statistical analysis.

Study population
The questionnaire was returned by 567 (59.1%) patients. Table 1 shows the characteristics of respondents. There were 182 (60.9%) respondents to the test-retest questionnaire.

Statistical analysis
Levels of missing data ranged from 1.2 to 6.2% across the three instruments ( Table 2). Levels of missing data were highest for items assessing higher levels of function including 'hopping' and 'running' for LEFS and 'jumping' for OMAS. For the LEFS, the 'getting in and out of bath' had the highest level of missing data. Use of mean imputation for missing data increased the number of useable scores by 1.4, 6.9 and 15.7% for the LEFS, SEFAS, and OMAS respectively ( Table 2).
Item mean scores were mostly skewed towards the best possible scores across instruments (Table 2). For the LEFS, the lowest scores denoting poorer health were for 'hopping' and the highest scores were for 'walking between rooms'. For the OMAS, the lowest scores were for 'stiffness' and the highest scores were for 'assistive devices'. For the SEFAS the highest scores were for 'getting up from a chair' and the lowest scores were for 'usual pain level'.
Model fit for the unidimensional SEFAS was good according to all criteria ( Table 3). The LEFS and OMAS had a RMSEA that was over the criterion of 0.08. There was support for a bi-dimensional LEFS with scales relating to easy and difficult items. Item-total correlations were over 0.4 for all items with the exceptions of 'assistive devices' and 'use of special innersoles/shoes' for the OMAS and SEFAS respectively (Table 1). Cronbach's alpha ranged from 0.82 to 0.96 for the OMAS and LEFS respectively. Table 4 shows that there were small but insignificant (p < 0.05) score improvements across instruments at retest. Weighted kappa for the individual items indicated   Table 5 shows that the hypotheses used in validity testing were largely met but some correlations were higher than expected. The lowest correlation between the three ankle instruments was 0.84 (LEFS and SEFAS) and the highest was 0.89 (SEFAS and OMAS). High levels of correlation were found for SF-36 physical functioning scores, the highest being for the LEFS which were comparable to those between the LEFS and other specific instruments. Moderate to high levels of correlation were found for the EQ-5D mobility and pain/discomfort items. For the three instruments, the correlations with the EQ-5D usual activities item were of a similar moderate level and for the remaining two items of self-care and anxiety/depression, of a similar low level. Correlations with the clinical variables were all of a low level, the lowest were for BMI and the highest were for the duration of operation. The use of adjusted scores had very little effect on the size of the correlations.

Discussion
There was evidence that the LEFS might be bidimensional in this group of patients which contrasts with it is use in applications as a unidimensional measure of lower extremity function. Exploratory factor analysis (data not shown) showed that the items loaded onto two clearly discernible factors relating to easier and more difficult aspects of function which gave better results in the CFA. The LEFS with 20 items, is a good deal longer than the OMAS and SEFAS and such a lengthy instrument that assesses one aspect of health is unusual for PROMs. The OMAS and SEFAS are shorter, have acceptable levels of internal consistency, test-retest reliability and the SEFAS has a lower SEM and hence is more capable of measuring change in individuals and groups of patients.
The current study followed previous studies in treating the LEFS as unidimensional in other aspects of testing but results should be treated with caution until further evidence becomes available. This study is a long-term follow-up of patients and the evidence may be different for patients in the shorter-term post-surgery. LEFS items may differ in their relevance in these patients. For example, more difficult items including 'running' , 'squatting' and 'walking a mile' might have greater relevance at follow-up as shown by their much lower ceiling effects compared to the remaining items. The inclusion of easier items in the same scale might mask important effects at follow-up. Eighty percent or more patients had the best possible score on seven LEFS items compared to just one item in each of the OMAS and SEFAS. If long-term outcomes are the focus, then these two instruments might be more responsive to change than the   There were low levels of missing data at the item level with very few items having more than 5 % missing. These items tended to relate to more difficult activities undertaken less frequently. Hence, the levels of missing data may reflect uncertainty on the part of the patients regarding their performance, or that they may have held back from undertaking such activities due to concerns about the ankle. Such items include running for the LEFS and jumping for the OMAS. The LEFS item 'getting in and out of bath' denotes low levels of function and had relatively high levels of missing data which may because many Norwegians do not have a bathtub at home.
All items with the exception of the assistive devices items for the OMAS and special innersoles/shoes in the SEFAS, had acceptable item-total correlations. This indicates that these two items might not be adequately contributing to the construct being measured. For example, patients might be using assistive devices, innersoles and shoes for reasons other than the severity of their ankle problem or because of other health problems. These items might be considered for removal if future studies also find that they make a limited contribution in a similar patient population. Cronbach's alpha and test-retest correlations were acceptable for the three instruments. Alpha is dependent on the number of items and hence the highest level was expected for the LEFS.
Scores for the three instruments were highly correlated which is evidence that they are assessing very similar aspects of health and have convergent validity. The highest correlations were found between scores for the OMAS and SEFAS which reflects their ankle specific focus compared to the focus on lower extremity function of the  Adjusted scores where mean imputation is used for missing data when half or more items are completed LEFS. These two instruments also had slightly higher correlations with the EQ-5D, including individual EQ-5D items. The LEFS had the highest correlations with SF-36 physical functioning scores and several LEFS items that are not covered by the OMAS and SEFAS, have similar content to this SF-36 scale. For the LEFS, Cronbach's alpha, test-retest reliability correlation coefficient and the SEM were similar to those previously reported [16]. The OMAS had a slightly higher Cronbach's alpha than in the previous study [17]. The test-retest reliability correlation coefficient was slightly lower and SEM slightly higher than those previously reported [17]. It follows that the smallest detectable change was larger; 19-20 compared to 16 [17]. The SEFAS had a higher alpha and similar level of test-retest reliability compared to the Swedish study that included patients with hind foot and ankle disorders [10]. The SEM and SDC were not reported in this study.
Recommendations for handling missing data were not available for the OMAS. The conventional approach is mean imputation when half or less items are missing. Compared to the approach that has been recommended for the SEFAS [10], this form of imputation increases the number of patients with final scores by 7%. This reduces sample sizes required in evaluative studies including clinical trials. Mean scores and the results of testing were very similar irrespective of the methods of handling missing data. For example, levels of correlation with the EQ-5D scores were virtually unchanged. The conventional approach will reduce sample size requirements in clinical trials and based on these study findings, will increase useable scores by up to 16% for the OMAS.
Clinicians and researchers selecting PROMs for this group of patients should consider using the SEFAS in preference to the LEFS and OMAS. There is uncertainty surrounding the structural validity of the LEFS, it has greater respondent burden and a broader focus on lower limbs rather than the foot and ankle. The broader focus was reflected in correlations between the LEFS and SF-36 physical function scores which were higher than those between the LEFS, OMAS and SEFAS scores. The OMAS has more complex scoring, performed less satisfactorily than the SEFAS in terms of structural validity and had a higher SEM. The use of mean imputation where half or more items are completed, reduces the number of patients needed for recruitment with negligible effects on measurement properties.

Study limitations
Important limitations include the follow-up period, potential respondent bias, choice of instruments and lack of testing for other measurement properties. The median time between surgery and questionnaire completion was 4.3 (IQR 3.9-5.1) years [12], and it is important that the measurement properties of the three instruments are assessed at other clinically important follow-up periods. This limitation means that it was not possible to recommend modifications to the instruments including the use of a bi-dimensional LEFS and removal of items across the instruments. The 59% response rate to the questionnaire is acceptable for this type of study but there were some statistically significant differences between respondents and non-respondents to the questionnaire [12]. Other instruments are available that have undergone limited testing in patients with ankle fracture [5], but respondent burden meant that only three instruments could be included in this study. The design of the study also meant that instrument responsiveness to change could not be assessed. This is an important criterion which further aids the selection of instruments for evaluative studies including clinical trials [11].
Assessment of the SEM and SDC followed the COSMIN checklist [11] and have been previously reported for the LEFS and OMAS [16,17]. The SDC is the level of change that can be considered real change above measurement error and does not consider whether the change is important. The minimal clinically important difference (MCID) or minimal important change (MIC), are levels of change that patients consider important and further help score interpretation [11]. The MIC has not been reported for the study instruments in this patient population and assessment was not possible within the current study design. It is recommended that the MIC be reported in future studies.

Conclusion
This is the first study that has concurrently evaluated these instruments in patients following surgery for ankle fracture. Moreover, the LEFS and SEFAS have not been previously evaluated solely in patients with ankle fracture. The three instruments have acceptable evidence for internal consistency, test-retest reliability and construct validity. However, there are some doubts about the unidimensionality of the LEFS in this population and it has a relatively large number of items with the largest ceiling effects representing the highest level of functioning. Further testing of these instruments is recommended in patients with ankle fracture including shorter-term follow-up following surgery. Responsiveness to changes in health should also be assessed with instrument completion taking place before and after an intervention of known efficacy. Instrument content should be carefully considered when choosing between these three instruments. The LEFS is specific to the lower extremities and includes a relatively large number of items. The OMAS is designed to be ankle fracture specific and includes clinical weightings whereas the other two instruments are based on simple summed scales. The SEFAS is designed for a range of foot disorders including ankle fractures and has the best measurement properties in this population. Finally, it is recommended that mean imputation is used for missing responses when half or more items are completed by patients.