Details about the RCT on which this work is based is reported in detail in Hellum et al. . Between April 2004 and September 2007, 172 patients with diagnosed chronic low back pain and degenerative disc disease were randomized to either surgery with total disc replacement or multidisciplinary rehabilitation. The results from this study have been published previously .
Briefly, data were collected in a multicentre randomized controlled trial involving the five university hospitals in Norway. Inclusion criteria included age between 25 and 55 years, LBP for more than a year, degenerative changes in the intervertebral disc in one of the two lowest levels of the lumbar spine and an Oswestry Disability Index score of 30% points or more. Exclusion criteria included generalized chronic pain syndrome and degeneration established in more than two levels. Part of this study was an economic evaluation of chronic low back pain treatment. Patients were randomized to either surgery with insertion of an artificial disc or to non-surgical treatment (a multidisciplinary back rehabilitation program).
The outcomes of patients who completed the SF6D, EQ5D, and ODI at baseline and at 2-year follow up were included in this study.
The ODI is a back-specific questionnaire [16, 17]. Patients rate physical disability in activities of daily living due to low back pain in 10 questions, each of which has verbal response alternatives. Ratings are summed to yield a score ranging from 0 (not disabled at all) to 100 (completely disabled). We used the Norwegian translation of the validated questionnaire (version 2.0) .
The SF6D utility index is comprised of 11 items from the SF-36  that were revised into a six-dimensional health state classification system. The six dimensions are physical functioning, role limitations, social functioning, pain, mental health, and vitality. It reflects a continuous outcome scored on a 0.29–1.00 scale, with 1.00 indicating full health . SF6D health states were evaluated against a normal population using the Standard Gamble (SG) method. We used the United Kingdom (UK) tariff . The SF6D was calculated based on the Norwegian SF-36 (version 2) with the use of syntax files in SPSS 17(SPSS, New York, US). The syntax files were kindly provided by Dr J. Brazier, University of Sheffield, UK.
For the EQ5D utility index, responses on a questionnaire with five dimensions, each comprised of three levels, are revised into an index with a range from −0.59–1, with 1.00 indicating full health. The 243 possible health states on the EQ5D are evaluated against a normal population using the time trade off method (TTO) [20, 21]. We used the Norwegian version of the EQ5D and syntax files obtained from the EQ5D society using the UK tariff to calculate the index.
Seven-point scale for patient assessment
Many authors suggest a seven-point scale to assess patient outcome in terms of a global score . On the question: “How much benefit do you think you have had from the treatment you have received?” patients answered on a 7-category response scale that ranged from “I am completely disabled” to “I am completely recovered”.
We followed the definitions and recommendations from The COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments) checklist when analyzing the psychometric properties of the two utility indexes and ODI in this study .
If not otherwise mentioned, SPSS version 17 was used in the statistical analysis.
Measurement error concerns the systematic and random error of a patient`s score that is not attributed to true changes in the construct to be measured [24
]. We used the standard error of measurement (SEM) to express instrument imprecision [22
]. The advantage of using SEM is that it is considered to be an attribute of the measure and not a characteristic of the sample itself [28
]. The SEM value could be calculated from a test-retest study or in a group of stable patients. The SEM in this study was calculated as:
where sw is the within-subject standard deviation, d is the difference between two observations in patients i who reported “unchanged” on a four-point scale between 3 and 6 months follow up and n is the number of subjects . The sw statistics is also called the SEMconsistency.
The lowest change that exceeds measurement error and noise at a 95% confidence level is defined as:
Here, the *
is introduced because there are two measurements for each patient. The minimum detectable change (MDC) at a 95% confidence level, is denoted MDC95. With a scale value ≥MDC95, we can be 95% certain that a change in the measured underlying construct has really occurred .
To assess the agreement between EQ5D and SF6D, a Bland Altman plot was constructed. . The average EQ5D and SF6D change score values were plotted against the mean difference in change score values of both instruments. Limits of Agreement (LoA) based on a +/− 1.96*SDdifference interval for the differences were also constructed.
Structural validity concerns the degree to which the scores of an instrument are an adequate reflection of the dimensionality of the construct to be measured . Both EQ5D and SF6D are constructed to measure the dimension of general health related quality of life (HRQoL) alongside a continuous scale (from low to high). Using Item Response Theory (IRT), the unidimensionality of the two utility indexes was tested. The category ordering of the questionnaire items (the probability of moving from an easier to a harder accomplished category of item answers in parallel with being increasingly disabled) was also tested.
We employed the unrestricted (Partial-Credit) polytomous model of the Rasch model (for general information about fit to the Rasch model, see Additional file 1) and the test proposed by Smith to reveal unidimensionality . The SF6D and EQ5D were tested for unidimensionality in a principal component analysis (PCA) . We performed a test equating procedure with baseline values from the SF6D and the EQ5D. The response of each patient to a question was tested against what was predicted by the Rasch model. Deviation from the model is expressed in residuals. Independent t-tests were used to test if the magnitude of the residuals represents a significant deviation. The CI calculated for this was 95%. We carried out a binominal test for the proportion of t-tests outside the range of −1.96–1.96. The software used in the Rasch analysis was RUMM 2020 (RUMM Laboratory Pty Ltd.).
Criterion validity concerns the degree to which the scores of an instrument are an adequate reflection of a “gold standard” when this is present . In this analysis we compared the scores of the EQ5D and SF6D to the disease specific instrument ODI. The rationale was that the ODI has been found to be a responsive and valid measure for patients with LBP [16, 18, 36] and that an improvement assessed by the ODI should be correlated with an improvement assessed by the two utility indexes.
Spearman rank correlation coefficient (r) with 1000 bootstrap replications of the baseline scores was calculated to assess the correlation between the scores of the EQ5D and ODI and SF6D and ODI.
Responsiveness is defined as the ability of an instrument to detect change over time in the construct to be measured . Responsiveness was assessed by using the ODI and the seven-point global scores at 2-year follow-up as “gold standard”. First, we calculated the Spearman rank correlation coefficient (r) with 1000 bootstrap replications for the correlation between change scores from baseline to 2 year FU for the EQ5D, SF6D and ODI. Second, we analyzed the area under the Receiver Operator Curve (ROC) for the change scores of the EQ5D, SF6D and ODI by using a dichotomization of the patient global scores as follows: Categories 1 to 3 was considered “improved” and categories 4 to 7 were “non-improved”. Sensitivity was defined as the proportion of patients who were correctly classified as “improved” and specificity was defined as the proportion of patients who were correctly classified as “non-improved”. A receiver operating characteristic (ROC) curve was then calculated by plotting every possible change score from baseline to 2 year FU for EQ5D, SF6D and ODI using the global score as an anchor [37, 38]. The area under the ROC curve (AUC) was then calculated. This value corresponds to the possibility of correctly diagnosing a patient as having improved when this is really the case  and reflects how responsive the instruments are to detect a change in the underlying construct.
The calculation of ROC curves was performed with MedCalc Statistica software (version 11.1.1. for Windows, Brussels, Belgia).
Interpretability concerns the qualitative meaning of quantitative scores or change in scores. A core question is: “What is the smallest change in score in the construct to be measured which patients consider important? This is expressed as the Minimal Important Change (MIC) value , and is calculated based on the sensitivity and specificity results from the ROC analysis described above. The cut-off value for differentiating between patients with or without improvement at optimum sensitivity and specificity was determined using ROC analysis . This corresponds to the upper left point on the ROC curve and it can be interpreted as the point or value that yields the lowest overall misclassification [25, 39].
The study was evaluated and approved by the regional Committee for Medical Research Ethics in east Norway. Storage of data was allowed by the Norwegian Data Inspectorate. The study was conducted in accordance with the Helsinki Declaration and the ICH-GCP guidelines and registered at clinicaltrial.gov under the identifier NCT00394732.