Skip to main content

Predicting patient-reported outcomes following lumbar spine surgery: development and external validation of multivariable prediction models



This study aimed to develop and externally validate prediction models of spinal surgery outcomes based on a retrospective review of a prospective clinical database, uniquely comparing multivariate regression and random forest (machine learning) approaches, and identifying the most important predictors.


Outcomes were change in back and leg pain intensity and Core Outcome Measures Index (COMI) from baseline to the last available postoperative follow-up (3–24 months), defined as minimal clinically important change (MCID) and continuous change score. Eligible patients underwent lumbar spine surgery for degenerative pathology between 2011 and 2021. Data were split by surgery date into development (N = 2691) and validation (N = 1616) sets for temporal external validation. Multivariate logistic and linear regression, and random forest classification and regression models, were fit to the development data and validated on the external data.


All models demonstrated good calibration in the validation data. Discrimination ability (area under the curve) for MCID ranged from 0.63 (COMI) to 0.72 (back pain) in regression, and from 0.62 (COMI) to 0.68 (back pain) in random forests. The explained variation in continuous change scores spanned 16%-28% in linear, and 15%-25% in random forests regression. The most important predictors included age, baseline scores on the respective outcome measures, type of degenerative pathology, previous spinal surgeries, smoking status, morbidity, and duration of hospital stay.


The developed models appear robust and generalisable across different outcomes and modelling approaches but produced only borderline acceptable discrimination ability, suggesting the need to assess further prognostic factors. External validation showed no advantage of the random forest approach.

Peer Review reports


Chronic back pain is the single greatest cause of years lived with disability worldwide [1] with annual direct healthcare costs in the UK of £1,632 million [2]. Spinal surgery is the largest component of healthcare expenditure for managing low back pain (costing £5000-£10,000 per patient) and its rate has doubled over 15-years from 2.5 to 4.9 per 10,000 adults [3]. However, success rates of lumbar spine surgery are highly variable—only about 60% of patients achieve reductions in pain of at least a minimal clinically important difference (MCID), and 1/5 experience persistent long-term pain after surgery [3,4,5].

Reliable predictive factors could maximise patient benefit and cost-effectiveness of surgery. Although several systematic reviews concluded that medical, sociodemographic, and psychological factors are linked to improvement in pain and disability following lumbar spine surgery [6,7,8,9,10], we lack clear clinical guidelines on the reliable predictors [11] and predictive factors are rarely formally documented.

Statistical prediction models enable probabilistic estimation of treatment outcome given a set of preoperative patient data. Previously developed regression-based models demonstrated good ability to discriminate between patients who did and did not achieve MCID in pain or disability after lumbar spine surgery [12,13,14,15,16]. However, performance of the existing prediction models has been rarely quantified in patient data other than that used to develop them. In the few available examples, external validation revealed poorer discrimination ability in the new data [17, 18], or poor calibration leading to under- and overestimation of outcome probabilities [19].

Machine learning has been found to improve predictive performance relative to regression [20], including for spinal surgery outcomes [21,22,23]. This may stem from the ability to capture nonlinear or interactive effects, often characteristic of clinical data. Conversely, machine learning algorithms are prone to overfitting smaller datasets and comparisons to standard regression have lacked external validation; thus, such predictions may not extrapolate well to new data. External validation is rare but necessary before any prediction models can be implemented in clinical practice [24, 25].

Our primary objective was to develop and externally validate prediction models of patient-reported spinal surgery outcomes based on routinely collected prospective data, for the first time comparing the performance of multivariate regression and machine learning approaches, the latter hypothesised to improve prediction accuracy. Secondarily, we identify the most relevant predictors from the available medical and patient-reported information, since existing models (particularly those from the US [14,15,16, 21]) may not translate well to UK cohorts due to different healthcare systems, type of data recorded, and cultural and demographic differences.


The present article follows the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines [26].

Source of data

This study was based on retrospective review of a prospective clinical database from a single Neurosurgery Department at the Walton Centre NHS Foundation Trust (UK). The Walton Centre contributes data to the Eurospine’s international Spine Tango registry [27], which governs standardised data collection protocols. Specifically, patients complete a self-assessment form (Core Outcome Measures Index, COMI) [28, 29] at the preoperative consultation and postoperative follow-ups at 3, 12, and 24 months. The surgeon completes the surgery form before discharge, detailing the patient’s history, type of pathology and surgery, and hospital stay [30]. The Walton Centre’s database was reviewed once on 26/04/2021 to extract lumbar spine surgery cases with degenerative disease as the main pathology. This data included patients operated between 4/04/2011 and 30/03/2021, with the last follow-up dated 21/04/2021.


Eligible patients had lumbar disc herniation and/or stenosis and underwent elective spinal decompression surgery with or without fusion. These most common degenerative pathologies and surgical measures were selected to obtain a representative sample and minimise clinical heterogeneity. Eligible patients completed preoperative and at least one postoperative self-assessment form. If there were multiple surgery cases per patient, only chronologically first eligible surgery was included.


The spinal surgery outcomes were defined as reduction in back and leg pain intensity and reduction in COMI from baseline to the last available follow-up. Due to the self-reported nature of outcomes, their assessment was not blinded.

Back and leg average pain intensity in the past week was measured on 0 (no pain) to 10 (worst pain I can imagine) numerical rating scales embedded within COMI. This scale is a recommended outcome measure of pain in studies evaluating effectiveness of treatments for chronic pain [31]. COMI [28] is a multidimensional instrument consisting of measures of pain, function, symptom-specific well-being, quality of life, social disability, and work disability. Average COMI score can range from 0 to 10, where higher scores indicate worse level of functioning. This instrument is an official outcome measure of Eurospine’s international spine registry and has been extensively validated in patients with back pain [29, 32].

Both continuous and dichotomous outcomes were considered for their precision and clinical utility, respectively. Indeed, relative to dichotomisation, continuous outcomes provide greater statistical power, minimise information loss and risk of false positive results, and allow more accurate estimation of variability in outcomes (e.g. those close to and far from the MCID cut-off) [33]. Here continuous outcomes represent numerical differences between the baseline and follow-up back pain, leg pain, and COMI. Positive scores correspond to improvement (i.e., reduction in pain and functional impairment). Dichotomous outcomes represent achievement of MCID between baseline and follow-up back pain and leg pain (reduction of ≥ 2 points [34]), and COMI (reduction of ≥ 2.2 points [35]).


As candidate predictors we assessed preoperative factors, and controlled for potential intra- and postoperative confounders (see Table 1 for a full list). Assessment of all predictors was blinded to outcomes assessed ≥ 3 months later, but predictors were not blinded to each other, as they were either self-reported by the patients before surgery on a single form, or recorded by the surgeon on a single form before discharge.

Table 1 Multivariate regression models results in the development data and models’ performance in the validation data

Sample size

To estimate the minimum required sample size and minimise the risk of overfitting, at least 20 participants per predictor should be included for continuous outcomes [36], and 10 or more events (i.e., achieving MCID) per predictor for dichotomous outcomes [37]. We considered 19 predictor variables, however, since several categorical factors had more than two levels, each additional factor level was counted separately, resulting in 34 predictors in total. Therefore, a minimum of 680 participants would be required for continuous outcomes, and 340 participants who achieve MCID for dichotomous outcomes.

Data processing and statistical analysis

Data was processed and analysed using R software, version 4.1.1 [38]. Details of data processing and handling of predictors are described in Methods S1, Additional file 1. Except for categorising portion of the BMI data to match the variable types across two versions of the surgery form, other continuous factors were treated as continuous in the analysis. Due to severely skewed distribution, hospital duration was log-transformed (base 10).

Handling missing data

As per our eligibility criteria, only patients with complete outcome data (i.e. with a preoperative and at least one postoperative assessment) were included in the analysis, thus we did not impute any missing outcome data. We report attrition and summary statistics on the portion of excluded data, and between-group comparisons with the final included sample.

We did not exclude any patients based on missing predictor data. This data was assumed to be missing at random and addressed through Multivariate Imputation by Chained Equations (MICE) with 40 iterations using mice R package, [39]. The MICE procedure is explained in Methods S2, Additional file 1. Values were missing in 11 out of 19 predictor variables, with partial missing rates from < 1% to 36% (see Results for details), and a total missing rate of 3.75% across all predictors. Multiple imputation has been found to remain unbiased up to 50% missing rates [40, 41]. A series of diagnostic checks detailed in Methods S2, Additional file 1 demonstrated good convergence and no apparent biases of the MICE algorithm, with overlapping range and distributions of the imputed and observed data (Figs. S1-S4, Additional file 1), and the missingness of each variable was associated with other factors in the imputation dataset (Table S1, Additional file 1), therefore supporting the missing at random assumption. In addition to the main analysis using imputed predictor data, we also conducted a separate sensitivity complete case analysis excluding cases with missing data on any of the predictors.

Model development and validation data

The dataset was split into model development and validation samples based on the date of surgery (2011–2017 and 2018–2021, respectively). For large datasets, a non-random split has been recommended, e.g., by time [37], so that temporal external validation could be performed [42]. The chosen time-split coincided with the introduction of a new version of the surgery form. The setting, eligibility criteria, predictors, and outcomes used were consistent across the development and validation data, except for some differences between the surgery form versions (see Methods S1, Additional file 1).

Multivariate regression analysis

As our primary analysis, we developed multivariate logistic regression models for each dichotomous outcome, and multivariate linear regression models for each continuous outcome. Models were fitted using R functions glm (with binomial family and logit link function) and lm (for linear regression) within the stats package [38]. Full model approach (i.e., including all candidate predictors) was used to estimate prediction accuracy based on the available set of routinely collected data in combination and to assess relative contribution of specific preoperative factors while controlling for potential confounders. Each of the six models (one model per outcome) based on the development sample was then fitted to the observed predictor values in the validation data to predict the outcomes of interest in new, untrained data. The predicted outcomes were compared with the actual (observed) outcome values.

Random forests

As our secondary analysis, we used random forests (RF), as a particularly flexible machine learning approach, which can combine different data types and be applied to both dichotomous and continuous outcomes, and in some studies demonstrated superior predictive performance over other machine learning algorithms [22, 43,44,45]. Compared to statistical regression models, RF showed increased accuracy in presence of larger number of predictors, nonlinear relationships, and smaller number of observations [20]. RF is a non-linear classification and regression algorithm based on an ensemble of deep decision trees. Multiple decision trees are trained on different randomly selected bootstrap samples of the same training dataset. Each time a decision tree split is performed, the best split variable [43] is chosen from a random subset of the original predictor set. Each tree gives an outcome prediction on the leftover data which was not used during training (out-of-bag, OOB). The predictions are then aggregated (bagging) by assigning class labels (MCID vs. no-MCID) by majority vote, or by averaging the continuous dependent variable (change score) across all the trees. Prediction errors are calculated as OOB error (misclassification rate) for classification or OOB mean square error for regression. While aggregating the predictions of multiple decision trees reduces the interpretability of a model, it also reduces variance and minimises overfitting (a major problem of individual decision trees), thus improving the RF’s ability to generalise to new data. We used Breiman’s implementation of RF in randomForest R package [43, 46]. The RF algorithm hyperparameters were tuned to maximise the predictive accuracy as described in Methods S3 and Table S2, Additional file 1. Using the final hyperparameters and 500 trees for each model, we applied RF classification to predict MCID outcomes, and RF regression to predict continuous outcomes in the development data. Then we used the parameters of each of the six developed models (one model per outcome) to predict the same outcomes in the validation data. Although the leg pain outcome was characterised by a moderate class imbalance, with MCID rates of 67–68%, no adjustments such as downsampling non-events were made, as these could distort the true outcome rates, lead to inadequate clinical predictions, and increase the risk of overfitting.

Assessment of model performance

Model performance on the development and validation data was expressed by calibration and discrimination measures with 95% confidence intervals (CI) where applicable. Note that although several internal validation methods are available, such as bootstrapping [47], our primary focus was on the validation on an external sample, which is a preferred, more robust, method to those based on resampling the internal data [37]. For calibration plots, we divided each dataset into 10 deciles according to the predicted probability of MCID (or predicted change score for linear and RF regression), and plotted the mean predicted versus mean observed probabilities (or change scores) for each decile [48]. Perfect fit would be reflected by all points being aligned on a line with intercept 0 and slope 1. We tested whether the 95% CIs of the intercept and slope of the calibration line of best fit include these values. While this method is applicable to both dichotomous and continuous outcomes, other approaches have been proposed as more robust for assessing calibration of outcome probabilities, such as flexible calibration curves [49], which we present in Additional file 1 for reference. We additionally calculated calibration-in-the-large (mean observed – mean predicted outcome probabilities or values), and for MCID models, reported expected-to-observed events ratio (E/O; 1 would indicate perfect calibration), Brier score (mean squared error between the observed and predicted outcome probabilities; 0 would indicate perfect calibration), and estimated calibration index (ECI; average squared difference between predicted and observed outcome probabilities transformed into a single number [0–1] summarising a flexible calibration curve, with 0 indicating perfect calibration) [49, 50]. For linear regression models, also a correlation between ungrouped observed and predicted outcomes was expressed as Pearson’s r. Discrimination of the logistic regression and RF classification models was assessed via Receiver-Operating Characteristic curve (ROC) plots with estimated Area Under the Curve (AUC; c-index), and classification accuracy expressed as sensitivity and specificity estimates at the optimal ROC cut-off (probability threshold maximising both indices). AUC can range from 0 to 1 and a value of 0.5 corresponds to chance discrimination, while 0.7–0.8 is considered acceptable, and > 0.8 excellent discrimination [51]. Nagelkerke pseudo-R2 and deviance were also reported as overall performance measures for logistic regression, and OOB errors for RF classification. For linear and RF regression, discrimination was quantified by R2 (pseudo- R2 for RF) and root mean square error (RMSE), and also an F-test for linear regression models. To express adjusted contribution of each predictor to the outcome of interest, we presented log odds and odds ratios (for logistic regression) and unstandardised and standardised regression coefficients (for linear regression) with 95% CIs. For RF models, we presented relative variable importance based on the mean decrease in accuracy (loss in prediction performance) when a particular variable is omitted from the training data for each development model.



Fig. S5, Additional file 1 illustrates the flow of participants through the eligibility screening process and summarises reasons for exclusion. Out of 6810 screened surgery cases, 4307 unique patients were included in the analysis. Most of these patients underwent an open surgery assisted with a microscope. Descriptive characteristics, missingness rates, and statistical comparisons between included and excluded participants due to missing baseline and/or follow-up assessment are reported in Results S1 and Table S3, Additional file 1. The most frequent latest available follow-up interval was 24 months and the outcomes did not vary depending on the duration of follow-up. Although there were statistically significant differences on 10 predictors, as expected in large datasets even when effect sizes of these differences are small, the included data were representative: the included data covered the full range of possible predictor values in the excluded data, and all levels of categorical factors present in the excluded data were well-represented in the included data.

Development dataset included 2691 and validation dataset 1616 patients, thus each was more than sufficient to fit regression models with 34 specified predictors. Patient characteristics are presented in Table S4, Additional file 1, and any group differences and additional post-hoc sample size considerations based on the observed outcome rates and means are described in Results S1, Additional file 1. On average, patients in the development compared to the validation sample achieved less reduction in back pain (mean [SD]: 2.15 [3.30] vs. 2.51 [3.50]) and leg pain (3.68 [3.69] vs. 3.94 [3.75]), but did not differ in COMI change (3.22 [3.07] vs. 3.34 [3.18]) or MCID rates in COMI (57% vs. 58%), back pain (53% vs. 57%), or leg pain (67% vs. 68%). There were slight statistically significant differences on most of the candidate predictors; yet, importantly, the validation data was within the range of the development data for all variables.

Development and validation of regression models

Regression diagnostic checks are detailed in Results S2 and Figs. S7-11, Additional file 1. In summary, there was no multicollinearity among the predictors, no severe deviations from linearity between the continuous predictors and logit of MCID or continuous outcomes, residual variance was homogenous across different levels of categorical predictors, there were no highly influential values, models had good fit across the range of observations, and standardised residuals in linear models showed acceptable homoscedasticity and normal distribution, with slight deviations on the tails.

Binary outcomes (MCID)

Table 1 presents odds ratios with 95% CIs for each candidate predictor in the development models for achievement of MCID in COMI, back pain, and leg pain intensity (see also Fig. S12, Additional file 1). Independent of other factors included in the models, older age was associated with higher odds of achieving MCID after surgery across all outcomes. Decompression with fusion surgery was related to higher odds of MCID in COMI and leg pain, whereas blood loss of 100–500 ml – higher odds of MCID in back pain. Additionally, higher baseline COMI, back pain, and leg pain predicted better odds of improvement in their corresponding outcomes. In contrast, patients with spinal stenosis, history of previous surgeries, currently smoking, and with higher morbidity class had lower odds of achieving MCID after surgery across all outcomes. Disc herniation with stenosis, higher baseline COMI, and presence of any complications also predicted lower odds of MCID in back pain. Finally, patients with higher baseline back pain and longer hospital stay had lower odds of MCID in COMI and leg pain. Adjusted effect sizes of these predictors were small (odds ratios > 0.4 and < 2.5).

The AUC in the development and validation data was bordering on between no-better-than-chance and acceptable discriminability (Table 1). COMI MCID model had the worst discrimination, whereas the highest, acceptable, discrimination ability was found for the back pain MCID model in the validation data. Using the optimal ROC cut-offs for each outcome, development models generally had good sensitivity (ability to detect true MCID), while specificity (detecting true no-MCID) oscillated near chance classification. There was a consistent pattern of classification in the validation data for leg pain MCID, however, COMI and back pain MCID classification presented an opposite pattern, with good specificity but poor sensitivity. The ROC curves are presented in Fig. S13a, Additional file 1 for development data, and Fig. 1a for validation data.

Fig. 1
figure 1

Discrimination ability of the (a) logistic regression and (b) random forest classification models when fitted to the validation data for MCID in COMI, back pain, and leg pain. Plots illustrate Receiver-Operating Characteristic (ROC) curves with an optimal probability threshold (black point on the ROC curve; specificity and sensitivity indicated in brackets). Area Under the Curve (AUC) is reported for each ROC with 95% confidence interval

The proportion of explained variation in the development models ranged from 10% for COMI to 17% for back and leg pain intensity. Residual deviance was lower than null deviance, indicating that the included variables allow to predict each outcome better than intercept-only (null) models. Calibration-in-the-large was near zero and E/O equal to or approaching 1, suggesting no overall differences between mean observed and predicted outcomes. Brier scores ranged from 0.19 to 0.23, consistently across the development and validation models (Table 1). Calibration plots indicated good model fit for MCID outcomes in the development data (Fig. S14a, Additional file 1). In the validation data, considering the visual inspection (Fig. 2a) and the fact that in all cases, the intercept of the calibration lines did not significantly differ from 0, and their slope did not significantly differ from 1, we conclude that the models showed good external calibration. On a more granular level, flexible calibration curves for COMI and leg pain MCID consistently showed good calibration, whereas that for the back pain model, accompanied by a higher ECI, had a positive intercept and suggested a small degree of underestimation, particularly in the range of 0.3–0.5 predicted MCID probabilities (Fig. S15a, Additional file 1).

Fig. 2
figure 2

Calibration plots for (a) logistic regression, (b) linear regression, (c) random forest (RF) classification, and (d) RF regression models when fitted to the validation data for each outcome. Points correspond to the mean predicted and observed probabilities of MCID or change scores in each decile with 95% confidence intervals (CI) of the mean observed probabilities or change scores. Calibration lines of best fit are plotted in red with 95% CI in grey and their intercept (α) and slope (β) estimates with 95% CIs are presented in the top-left corner of each plot. *95% CI of the intercept do not include 0, or the 95% CI of the slope do not 1, indicating significant deviation from the perfect fit

Continuous outcomes

Results of the linear regression models on change in COMI, back pain, and leg pain are presented in Table 1 (standardised regression coefficients) and Fig. S12, Additional file 1. After adjusting for other factors included in the models, older age, male gender, and decompression with fusion (moderate effect size) were associated with greater improvement after surgery across all outcomes. Additionally, higher baseline COMI, back pain, and leg pain predicted greater improvement in their corresponding outcomes (moderate-large effects). On the contrary, patients with spinal stenosis and disc herniation with stenosis, history of previous surgeries, currently smoking (moderate effect), with higher morbidity class (moderate effect), and longer hospital stay had less improvement after surgery across all outcomes. Additionally, higher baseline COMI predicted less improvement in back and leg pain, higher baseline back pain predicted less improvement in COMI and leg pain, and operation time > 3 h (moderate effect) predicted less improvement in COMI and back pain. Adjusted effect sizes of these predictors were small (betas < 0.25), unless specified otherwise.

Discrimination performance in the development models ranged from 13% (COMI change) to 22% (back and leg pain change) and the included set of predictors explained each outcome better than intercept-only models (F-test ps < 0.001) (Table 1). Unadjusted portion of explained variance was higher in the validation than development data for COMI and back pain change, but lower for leg pain change outcome. Prediction accuracy was slightly worse in the validation compared to the development data with approximately 0.1 higher RMSEs across all outcomes.

Model calibration was very good in the development data (Fig. S14b, Additional file 1), and plots in the validation data also showed close agreement between mean observed and predicted outcomes, although there was some degree of underestimation of the predicted changes in back and leg pain (Fig. 2b), also apparent in flexible calibration curves (Fig. S16a, Additional file 1). However, only the leg pain model calibration line significantly deviated from the perfect fit. Calibration-in-the-large indicated that average predicted changes in COMI and pain outcomes were 0.25–0.35 points lower than observed changes (Table 1).

Individual predictions of outcomes based on the developed logistic and linear regression models can be made according to the equations provided in Results S3, Additional file 1 and log-odds and unstandardised regression coefficients presented in Table S5, Additional file 1.

Compared to the primary analyses using the imputed predictor data, sensitivity complete case analyses presented in Results S4 and Table S6, Additional file 1 showed similar or worse model performance in the development and validation datasets, and no systematic differences in significant predictors except for the current smoking status which did not significantly predict any outcomes in the sensitivity analyses.

Development and validation of random forest models

Table 2 provides an overview of the RF performance measures across MCID (classification) and continuous change (regression) in COMI, back pain, and leg pain outcomes in the development and validation data.

Table 2 Random forests performance measures in the development and validation data across all outcomes

Binary outcomes (MCID)

Misclassification rates ranged from 30% for leg pain to 40% for back pain and COMI MCID outcomes. Discrimination performance was in a similar range for development and validation data, with AUC consistently over 0.60, but still below acceptable discrimination across all outcomes, and consistently lower than AUC values obtained from logistic regression models. Outcome classification at the optimal probability threshold in the development and validation data across back and leg pain outcomes appeared to be biased towards higher sensitivity at the expense of specificity which oscillated near or below chance classification of no-MCID cases. This pattern was reversed for COMI outcome, where classification showed better specificity but near-chance sensitivity. The ROC curves are presented in Fig. S13b, Additional file 1 for development data, and Fig. 1b for validation data.

Calibration plots demonstrated good agreement between mean observed and predicted MCID probabilities in the development data, however, the calibration line for the COMI outcome significantly deviated from the perfect fit suggesting some underestimation of the predicted MCID (Fig. S14c, Additional file 1). Calibration of the COMI and back pain models in the validation data also suggested a small degree of underestimation of predicted outcome probabilities, but without significant deviations from the perfect fit (Fig. 2c). However, higher-resolution flexible calibration curve for back pain MCID with a positive intercept and slope > 1 further suggested some degree of underestimation of predicted probabilities in the 0.4–0.6 range of the validation data (Fig. S15b, Additional file 1). Back and leg pain calibration curves were also accompanied by higher ECIs compared to the logistic regression models. Nonetheless, RF models were characterised by comparable calibration-in-the-large, E/O, and Brier scores, and calibration plots supported good agreement between the mean observed and predicted probabilities of MCID for most quantiles in the validation data, similar to the calibration of the logistic regression models.

Continuous outcomes

The proportion of explained variance in change in COMI, back pain and leg pain ranged from 11% for COMI to 21% for back pain in the development data, and increased in the validation data (15–25%). These values were lower compared to R2 from the linear regression models in the development data, but did not differ in the validation data except for the back pain model. RMSEs were only minimally higher in the RF regression, but with less discrepancy between the development and validation data.

RF regression models showed very good calibration in the development data (Fig. S14d, Additional file 1). While there was also good agreement between mean observed and predicted outcomes in the validation data, calibration plots showed some degree of underestimation of the predicted relative to the observed changes in outcomes, in particular in the higher deciles, with the slope of the leg pain calibration line significantly deviating from the perfect fit (Fig. 2d). Similar trends are apparent in the flexible calibration curves (Fig. S16b, Additional file 1). Tendency to underestimate predicted changes in the validation data was similar to that in the calibration of linear regression models, although underestimation of leg pain change was more pronounced in lower deciles. Correlation coefficients between ungrouped observed and predicted outcomes in the RF models were overall higher in the validation than development data, and marginally lower compared to linear regression models. Calibration-in-the-large indicated smaller (relative to linear regression models) differences between average observed and predicted outcomes, in the range of 0.14–0.28 points.

Variable importance

Highest variable importance in RFs was generally assigned to the baseline scores on the corresponding outcome measures (except for COMI MCID), for instance, baseline back pain was most important for classifying back pain MCID and predicting change in back pain intensity (Fig. S17, Additional file 1). While for pain intensity outcomes, these baseline scores appeared to be the sole most relevant predictors, COMI outcomes showed broader distribution of importance over different predictors. Across all outcomes, relatively high importance was also attributed to the duration of hospital stay, age, baseline scores on other outcome measures, current smoking status, type of degenerative disease, history of previous spinal surgeries, and morbidity. The same factors were found to have significant prognostic effects in the logistic and linear regression models.


We developed and externally validated multivariate regression and RF models to predict patient-reported outcomes 3–24 months after lumbar spine surgery based on prospectively recorded medical and patient data. The models demonstrated good calibration in the temporal validation data, while their discrimination ability oscillated between acceptable and no-better-than-chance. Linear and logistic regression models performed better than RF algorithms, both in the development and validation data. The most important predictors included age, baseline COMI and pain scores, type of degenerative disease, previous surgeries, smoking, morbidity, and hospital stay.

This study brings a novel contribution to the field by assessing and comparing performance of linear and logistic regression models versus RF regression and classification algorithms, and validating them on external data. Previous spinal surgery studies focused solely on comparing different machine learning and regression approaches for binary outcomes and involved only internal validation of the developed prediction models [21,22,23]. Thus, it was important to identify whether a non-linear modelling strategy (RF) could outperform a linear approach on this type of data, with reference to an external validation dataset, in order to further develop our ability to predict more precisely individual outcomes (i.e. the magnitude of reduction in COMI and pain intensity after surgery). High number of participants and events per variable, which were limited in previous clinical prediction models [44], add to the strength of the present work. Comparable performance and consistency in identified predictors demonstrate the robustness and generalisability of our models across different patient-reported outcomes (COMI, back, and leg pain MCID and continuous change scores) and modelling approaches.

Model performance

There was no substantial decrease in the models’ performance on the new data relative to the development data, indicating no overfitting issues. Regression models predicting changes in back pain showed the best external validity, with acceptable discrimination (0.72) and 28% explained variance, followed by the models predicting leg pain outcomes (AUC 0.68, 22% explained variance). These metrics are comparable or better than in similar externally validated models predicting pain-related outcomes (AUC 0.52–0.83, 6–19% [17,18,19]). COMI models showed poorer discrimination (AUC 0.63, 16% explained variance), although comparable with external validity of another model relying on the same measure (17% [18]), suggesting that composite outcomes like COMI may be more difficult to predict than, for instance, specific disability measures (AUC 0.71 [19]).

Similar studies relying on internal validation generally reported better discrimination (AUC 0.64–0.84, 23–49% [12,13,14,15,16]), highlighting a potential degree of over-optimism when model performance is only assessed on resampled or randomly-split data. Furthermore, model calibration was not always assessed [12, 16, 17], but accurate prediction of outcomes can be particularly problematic in external data, and both underestimation of leg pain and overestimation of back pain and disability outcomes have been reported upon external validation [18, 19]. Our models showed good calibration in the validation data across all outcomes, although there was a mild tendency to underestimate back pain MCID and leg pain reduction. The range of the calibration measures in the current study (see Table 1) indicated similar or better calibration compared to other externally validated regression models in the field (E/O 0.77 – 1.20; Brier score 0.12 – 0.22; ECI 0.41 – 0.67; r 0.31 – 0.44; [18, 19]).

We found that RF did not outperform linear and logistic regression models. RF showed similarly good calibration in the validation data, with overall calibration metrics very close to those of the conventional regression models, and even slightly better calibration-in-the-large for continuous outcomes. Nonetheless, calibration slope indicated some underestimation of leg pain reduction, and the flexible calibration curve and ECI suggested potential underestimation of back pain MCID (consistent with the underestimation tendencies observed for statistical regression models). Brier scores in validation data were at the upper limit of the range of those reported in previously published RF prediction models (0.14 – 0.20), which only assessed internal calibration [21, 22]. Despite overall good external calibration, none of the RF models reached acceptable discriminability in the validation (or development) data (0.62–0.68). This may reflect the difficulty of RF algorithms to extrapolate to new, untrained data, although previous relevant studies only achieved RF discrimination of 0.64–0.72 in internal validation [21, 22]. Various machine learning classification approaches (e.g. elastic net penalised regression, deep neural networks, extreme gradient boosting, RF) have previously shown superior predictive performance compared to logistic regression [20, 22, 23]. The abovementioned approaches were also found to outperform RF for some outcomes [21, 23], therefore, it is possible that more complex machine learning models could further improve prediction accuracy. However, consistent with our findings, a recent meta-analysis concluded that based on low risk of bias studies, performance of machine learning clinical prediction algorithms, including RF, does not differ from logistic regression (their advantage was only found in high risk of bias studies) [44]. Our results extend this conclusion to linear regression versus RF regression.

There could be several reasons why RFs did not outperform statistical regression in the present study. Previous work demonstrating an advantage of machine learning over logistic regression did not cover external validation [20, 22, 23], while regression models are likely to have better generalisability. Furthermore, machine learning works best for problems with high signal-to-noise ratio, which rarely characterises clinical data. Finally, since RFs show improved performance on data with nonlinear and nonadditive effects, any nonlinearities in the present data were likely not severe enough to be detrimental to statistical regression. Therefore, RFs might not show superior performance on large enough datasets satisfying the regression assumptions.

Relevant predictors

According to the regression models, greater odds of achieving MCID and larger reduction in COMI, back, and leg pain were significantly associated with older age, higher baseline score on the respective outcome measure, having decompression surgery with fusion, no stenosis, no history of previous spinal surgeries, lower morbidity class, not smoking, and shorter hospital stay. The same factors (except for surgical measures) were the most important predictors in RF analyses, with preoperative COMI, back, or leg pain scores leading across all models. Relevance of several of the identified predictors was also supported by previous research on patient-reported outcomes from the Spine Tango registry in other countries [18, 52]. Our results are also consistent with systematic reviews supporting prognostic value of age, preoperative pain intensity and disability, type of spinal pathology, previous surgeries, and smoking [6,7,8,9,10]. In contrast, we did not find any effect of symptom duration, here recorded indirectly as duration of previous treatment.

Low back pain and spinal surgery are complex clinical issues where multifactorial data is necessary to make accurate individualised predictions of treatment outcomes. Suboptimal model performance, particularly on COMI and leg pain outcomes, suggests that additional factors to those already recorded in registries such as Spine Tango are likely needed to improve the predictive accuracy. For instance, other predictors identified in the above-mentioned systematic reviews, but not available in our data, included education level, compensation, duration of sick leave, sensory loss, comorbidities, and psychological pain-related and affective factors. Previous prediction models which achieved better discrimination (at least in internal validation) incorporated additional predictors, such as unemployment, medical insurance (although not applicable in the UK context), opioid use, antidepressants, mental functioning, optimism, control over pain, catastrophising, and postoperative psychomotor therapy [12,13,14, 21].


The present study is not without limitations. Class imbalance is a common problem for machine learning classification algorithms, where selecting the outcome occurring more frequently increases overall classification accuracy, which may still be poor for the less frequent outcome [53]. This could potentially account for the sensitivity/specificity trade-off apparent in some of the MCID models, although not specific to RF.

Furthermore, there was missing data on several predictors, and although imputation diagnostics did not indicate any biases, complete case sensitivity analysis was inconsistent with respect to the predictive value of smoking status, which had the highest missingness rate. This suggests that the imputed data may not accurately reflect the true smoking status in the population of interest. Thus, the missingness of the smoking status could potentially be related to other unmeasured variables. However, the significant prognostic value of smoking is consistent with several previous prediction models for spinal surgery outcomes [14, 16,17,18].

To maximise the length of postoperative follow-up that could be included in the analyses, in cases where individual patients underwent multiple surgeries, we only selected the data related to the chronologically first eligible surgery. This could be considered a limitation in terms of neglecting potential effects of subsequent surgeries on the recorded treatment outcomes. However, since the proportion of eligible patients who underwent a subsequent surgery within the included follow-up interval of the first surgery was very small (< 4%), this factor was not included in the models as a potential confounder.

We cannot rule out that more complex machine learning approaches, not assessed in the current study, could further improve our ability to predict spinal surgery outcomes, and potentially outperform statistical regression (c.f. [44]). However, further studies making similar comparisons should consider assessing the performance of such models on external data and their accuracy in predicting continuous outcomes.

Finally, although the developed models performed relatively well in the temporal validation, geographic validation is often more problematic [54]. Thus, future research could include external validation of the prediction models across different neurosurgery centres to further assess their generalisability.


The developed models demonstrated good ability to predict spinal surgery outcomes from new data, thus in practice, they could help identify patients at risk of poor outcomes. Such patients could be considered for additional interventions to improve their chance of recovery [12]. While all models appeared to be well-calibrated, and those predicting change in back pain showed the best performance on external validation, the discrimination ability of the leg pain and COMI models could be further improved, for instance, by including factors that previously demonstrated important contributions to spinal surgery outcomes. Modifiable preoperative predictors could be particularly useful for prospectively maximising the treatment benefit. The proposed models can therefore serve as a benchmark to inform future studies aimed at improving the accuracy of individual outcome prediction and potential revision of routinely collected information for spinal surgery registries.


We found comparable performance and consistent predictors across different outcomes, modelling approaches, and datasets. Regression models showed good calibration and acceptable to no-better-than-chance discrimination in the validation data. For similar datasets (with comparable set of predictors, sufficient sample size, and satisfying regression assumptions), RFs do not appear to outperform statistical regression. A strong advantage of statistical regression is its explanatory value and more easily interpretable prediction rules readily applicable in the clinical context. RFs, however, allow to establish relative predictor importance, which may assist in prioritising complex multifactorial data. Nonetheless, there is still room for improvement in terms of recorded predictor data.

Availability of data and materials

The data that support the findings of this study are available from the Walton Centre NHS Foundation Trust but restrictions apply to the availability of these data, which were used under information sharing agreement for the current study, and so are not publicly available. Access to this data may be requested by contacting the Information Governance Team at the Walton Centre NHS Foundation Trust in Liverpool, United Kingdom. Materials such as analysis scripts used in the current study are available at



Area under the curve


Body mass index


Confidence interval


Core outcome measures index


Expected-to-observed events ratio


Estimated calibration index


Minimal clinically important difference


Multivariate imputation by chained equations




Random forest


Root mean square error


Receiver-operating characteristic curve


Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis


  1. Rice AS, Smith BH, Blyth FM. Pain and the global burden of disease. Pain. 2016;157:791–6.

    Article  PubMed  Google Scholar 

  2. Maniadakis N, Gray A. The economic burden of back pain in the UK. Pain. 2000;84:95–103.

    Article  PubMed  Google Scholar 

  3. Weir S, Samnaliev M, Kuo T-C, Choitir CN, Tierney TS, Cumming D, et al. The incidence and healthcare costs of persistent postoperative pain following lumbar spine surgery in the UK: a cohort study using the Clinical Practice Research Datalink (CPRD) and Hospital Episode Statistics (HES). BMJ Open. 2017;7: e017585.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Hegarty D, Shorten G. Multivariate prognostic modeling of persistent pain following lumbar discectomy. Pain Physician. 2012;15:421–34.

    Article  PubMed  Google Scholar 

  5. Zweig T, Enke J, Mannion AF, Sobottke R, Melloh M, Freeman BJ, et al. Is the duration of pre-operative conservative treatment associated with the clinical outcome following surgical decompression for lumbar spinal stenosis? A study based on the Spine Tango Registry. Eur Spine J. 2017;26:488–500.

    Article  PubMed  Google Scholar 

  6. Dorow M, Lobner M, Stein J, Konnopka A, Meisel HJ, Gunther L, et al. Risk Factors for Postoperative Pain Intensity in Patients Undergoing Lumbar Disc Surgery: A Systematic Review. PLoS ONE. 2017;12: e0170303.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Halicka M, Duarte R, Catherall S, Maden M, Coetsee M, Wilby M, et al. Predictors of Pain and Disability Outcomes Following Spinal Surgery for Chronic Low Back and Radicular Pain: A Systematic Review. Clin J Pain. 2022;38:368–80.

    Article  PubMed  Google Scholar 

  8. Rushton A, Zoulas K, Powell A, Staal JB. Physical prognostic factors predicting outcome following lumbar discectomy surgery: systematic review and narrative synthesis. BMC Musculoskelet Disord. 2018;19:326.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Wilhelm M, Reiman M, Goode A, Richardson W, Brown C, Vaughn D, et al. Psychological Predictors of Outcomes with Lumbar Spinal Fusion: A Systematic Literature Review. Physiother Res Int. 2017;22:1–12.

    Article  Google Scholar 

  10. Wilson CA, Roffey DM, Chow D, Alkherayf F, Wai EK. A systematic review of preoperative predictors for postoperative clinical outcomes following lumbar discectomy. Spine J. 2016;16:1413–22.

    Article  PubMed  Google Scholar 

  11. de Campos TF. Low back pain and sciatica in over 16s: assessment and management NICE Guideline [NG59]. J Physiother. 2017;63:120.

    Article  PubMed  Google Scholar 

  12. Abbott AD, Tyni-Lenné R, Hedlund R. Leg pain and psychological variables predict outcome 2–3 years after lumbar fusion surgery. Eur Spine J. 2011;20:1626–34.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Cobo Soriano J, SendinoRevuelta M, Fabregate Fuente M, CimarraDíaz I, MartínezUreña P, Deglané MR. Predictors of outcome after decompressive lumbar surgery and instrumented posterolateral fusion. Eur Spine J. 2010;19:1841–8.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Khor S, Lavallee D, Cizik AM, Bellabarba C, Chapman JR, Howe CR, et al. Development and Validation of a Prediction Model for Pain and Functional Outcomes After Lumbar Spine Surgery. JAMA Surg. 2018;153:634–42.

    Article  PubMed  PubMed Central  Google Scholar 

  15. McGirt MJ, Sivaganesan A, Asher AL, Devin CJ. Prediction model for outcome after low-back surgery: individualized likelihood of complication, hospital readmission, return to work, and 12-month improvement in functional disability. Neurosurg Focus. 2015;39:E13.

    Article  PubMed  Google Scholar 

  16. McGirt MJ, Bydon M, Archer KR, Devin CJ, Chotai S, Parker SL, et al. An analysis from the Quality Outcomes Database, Part 1. Disability, quality of life, and pain outcomes following lumbar spine surgery: predicting likely individual patient outcomes for shared decision-making. J Neurosurg Spine. 2017;27:357–69.

    Article  PubMed  Google Scholar 

  17. Janssen ER, Osong B, van Soest J, Dekker A, van Meeteren NL, Willems PC, et al. Exploring Associations of Preoperative Physical Performance With Postoperative Outcomes After Lumbar Spinal Fusion: A Machine Learning Approach. Arch Phys Med Rehabil Elsevier. 2021;102:1324-1330.e3.

    Article  Google Scholar 

  18. Staub LP, Aghayev E, Skrivankova V, Lord SJ, Haschtmann D, Mannion AF. Development and temporal validation of a prognostic model for 1-year clinical outcome after decompression surgery for lumbar disc herniation. Eur Spine J. 2020;29:1742–51.

    Article  PubMed  Google Scholar 

  19. Quddusi A, Eversdijk HAJ, Klukowska AM, de Wispelaere MP, Kernbach JM, Schröder ML, et al. External validation of a prediction model for pain and functional outcome after elective lumbar spinal fusion. Eur Spine J. 2020;29:374–83.

    Article  PubMed  Google Scholar 

  20. Couronné R, Probst P, Boulesteix A-L. Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics. 2018;19:270.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Karhade AV, Fogel HA, Cha TD, Hershman SH, Doorly TP, Kang JD, et al. Development of prediction models for clinically meaningful improvement in PROMIS scores after lumbar decompression. Spine J. 2021;21:397–404.

    Article  PubMed  Google Scholar 

  22. Siccoli A, de Wispelaere MP, Schröder ML, Staartjes VE. Machine learning-based preoperative predictive analytics for lumbar spinal stenosis. Neurosurg Focus. 2019;46:E5.

    Article  PubMed  Google Scholar 

  23. Staartjes VE, de Wispelaere MP, Vandertop WP, Schröder ML. Deep learning-based preoperative predictive analytics for patient-reported outcomes following lumbar discectomy: feasibility of center-specific modeling. Spine J. 2019;19:853–61.

    Article  PubMed  Google Scholar 

  24. Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med. 2015;162:W1-73.

    Article  PubMed  Google Scholar 

  25. Siontis GCM, Tzoulaki I, Castaldi PJ, Ioannidis JPA. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol. 2015;68:25–34.

    Article  PubMed  Google Scholar 

  26. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Ann Intern Med. 2015;162:55–63.

    Article  PubMed  Google Scholar 

  27. Röder C, El-Kerdi A, Grob D, Aebi M. A European spine registry. Eur Spine J. 2002;11:303–7.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Deyo RA, Battie M, Beurskens A, Bombardier C, Croft P, Koes B, et al. Outcome measures for low back pain research: a proposal for standardized use. Spine. 1998;23:2003–13.

    Article  CAS  PubMed  Google Scholar 

  29. Ferrer M, Pellisé F, Escudero O, Alvarez L, Pont A, Alonso J, et al. Validation of a Minimum Outcome Core Set in the Evaluation of Patients With Back Pain. Spine. 2006;31:1372–9.

    Article  PubMed  Google Scholar 

  30. Röder C, Chavanne A, Mannion AF, Grob D, Aebi M. SSE Spine Tango--content, workflow, set-up. Tango. Eur Spine J. 2005;14:920–4.

  31. Dworkin RH, Turk DC, Farrar JT, Haythornthwaite JA, Jensen MP, Katz NP, et al. Core outcome measures for chronic pain clinical trials: IMMPACT recommendations. Pain. 2005;113:9–19.

    Article  PubMed  Google Scholar 

  32. Mannion AF, Elfering A, Staerkle R, Junge A, Grob D, Semmer NK, et al. Outcome assessment in low back pain: how low can you go? Eur Spine J. 2005;14:1014–26.

    Article  PubMed  Google Scholar 

  33. Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ. 2006;332:1080.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Farrar JT, Young JP, LaMoreaux L, Werth JL, Poole RM. Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain rating scale. Pain. 2001;94:149–58.

    Article  PubMed  Google Scholar 

  35. Mannion AF, Porchet F, Kleinstück FS, Lattig F, Jeszenszky D, Bartanusz V, et al. The quality of spine surgery from the patient’s perspective: part 2. Minimal clinically important difference for improvement and deterioration as measured with the Core Outcome Measures Index. Eur Spine J. 2009;18:374–9.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Harrell FE. Multivariable modeling strategies. Regres Model Strateg. Springer; 2015. p. 63–102.

  37. Moons KGM, de Groot JA, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist. PLoS Med. 2014;11: e1001744.

    Article  PubMed  PubMed Central  Google Scholar 

  38. R Core Team. R: A language and environment for statistical computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2021. Available from:

  39. van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011;45:1–67.

    Article  Google Scholar 

  40. Haji-Maghsoudi S, Haghdoost A, Rastegari A, Baneshi MR. Influence of pattern of missing data on performance of imputation methods: an example using national data on drug injection in prisons. Int J Health Policy Manag. 2013;1:69.

    Article  PubMed  PubMed Central  Google Scholar 

  41. Marshall A, Altman DG, Royston P, Holder RL. Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol. 2010;10:7.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Moons KGM, Kengne AP, Grobbee DE, Royston P, Vergouwe Y, Altman DG, et al. Risk prediction models II. External validation, model updating, and impact assessment. Heart. 2012;98:691–8.

    Article  PubMed  Google Scholar 

  43. Breiman L. Random Forests. Mach Learn. 2001;45:5–32.

    Article  Google Scholar 

  44. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22.

    Article  PubMed  Google Scholar 

  45. van der Ploeg T, Nieboer D, Steyerberg EW. Modern modeling techniques had limited external validity in predicting mortality from traumatic brain injury. J Clin Epidemiol. 2016;78:83–9.

    Article  PubMed  Google Scholar 

  46. Wright MN, Wager S, Probst P. ranger: A Fast Implementation of Random Forests [Internet]. 2021 [cited 2021 Dec 1]. Available from:

  47. Moons KGM, Kengne AP, Woodward M, Royston P, Vergouwe Y, Altman DG, et al. Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker. Heart. 2012;98:683–90.

    Article  PubMed  Google Scholar 

  48. Sanchez-Santos MT, Garriga C, Judge A, Batra RN, Price AJ, Liddle AD, et al. Development and validation of a clinical prediction model for patient-reported pain and function after primary total knee replacement surgery. Sci Rep. 2018;8:3381.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016;74:167–76.

    Article  PubMed  Google Scholar 

  50. Riley RD, Ensor J, Snell KIE, Debray TPA, Altman DG, Moons KGM, et al. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ. 2016;353: i3140.

    Article  PubMed  PubMed Central  Google Scholar 

  51. Hosmer DWJ, Lemeshow S, Sturdivant RX. Applied logistic regression. John Wiley & Sons; 2013.

  52. Sobottke R, Herren C, Siewe J, Mannion AF, Röder C, Aghayev E. Predictors of improvement in quality of life and pain relief in lumbar spinal stenosis relative to patient age: a study based on the Spine Tango registry. Eur Spine J. 2017;26:462–72.

    Article  PubMed  Google Scholar 

  53. Staartjes VE, Schröder ML. Letter to the Editor. Class imbalance in machine learning for neurosurgical outcome prediction: are our models valid? J Neurosurg Spine. 2018;29:611–2.

    Article  PubMed  Google Scholar 

  54. König IR, Malley JD, Weimar C, Diener H-C, Ziegler A. Practical experiences on the necessity of external validation. Stat Med. 2007;26:5499–511.

    Article  PubMed  Google Scholar 

Download references


Not applicable.


This research was funded via the Translational Research Access Programme (TRAP), Faculty of Health & Life Sciences, University of Liverpool, UK. The funding body approved the overall objective and design of the study, and had no role in the collection, analysis, or interpretation of data, or preparation of the manuscript.

Author information

Authors and Affiliations



Conceptualization: CB, MW, RD, MH; Data Curation: MH; Formal Analysis: MH; Funding Acquisition: CB, MW, RD; Investigation: MH; Methodology: MH, CB; Project Administration: MH, CB; Resources: MH, MW; Software: MH; Supervision: CB; Validation: MH, CB; Visualization: MH; Writing – Original Draft Preparation: MH, CB; Writing – Review & Editing: MH, CB, MW, RD. All authors discussed the results and commented on the manuscript, and read and approved its final version.

Corresponding author

Correspondence to Christopher Brown.

Ethics declarations

Ethics approval and consent to participate

This research has been conducted in accordance with the Declaration of Helsinki and received approval from University of Liverpool Health and Life Sciences Research Ethics Committee (ref. 8224). Patients partaking in the Spine Tango registry provided written informed consent for their pseudonymised data being used for research purposes by other researchers, collected when they agreed to be included in the registry. Also considering the retrospective nature of the current study, no further informed consent was required for the use of this data. Due to confidentiality, the data are not publicly available.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Methods S1. 

Data processing and handling of predictors. Methods S2. Handling missing data (including Fig. S1. Trace lines representing means and SDs of imputed values at each iteration. Fig. S2. Kernel density estimates for the marginal distributions of the observed data and the densities per variable calculated from the imputed data. Fig. S3. Each variable with missing data plotted against its propensity score for observed and imputed values. Fig. S4. Distributions of the residuals of regression models of each variable with missing data on its propensity score, grouped by observed and imputed data. Table S1. Results overview of the logistic regression models on the probability of missingness of each variable with missing data), Methods S3. Random forest hyperparameter tuning (including Table S2. Random forest hyperparameters), Results S1. Participants (including Fig. S5. The flow of surgery cases and participants through the selection process, Table S3. Characteristics of included and excluded participants on continuous and categorical factors, Fig. S6. Mean with 95% CI change in COMI, back pain, and leg pain outcomes at each follow-up interval; Proportions of patients who achieved MCID on these outcomes at each follow-up interval. Table S4. Participant characteristics and differences between the development and validation samples). Results S2. Regression diagnostics (including Fig. S7. Scatterplots with fitted smooth loess regression line with 95% confidence interval, and Pearson’s correlation coefficients, illustrating any relationships among continuous predictors and between continuous predictors and outcomes. Fig. S8. Scatterplots with fitted smooth loess regression line with 95% confidence interval illustrating the relationships between each continuous predictor and logit of outcome in the development data. Fig. S9. Diagnostic plots for logistic regression models on Minimal Clinically Important Difference (MCID) in Core Outcome Measures Index (COMI), back pain, and leg pain. Fig. S10. Component residual plots. Fig. S11. Diagnostic plots for linear regression models on change in Core Outcome Measures Index (COMI), back pain, and leg pain). Fig. S12. Odds Ratios and standardised regression coefficients with 95% confidence intervals for predictors of Minimal Clinically Important Difference (MCID) and continuous change scores in Core Outcome Measures Index (COMI), back pain, and leg pain intensity. Fig. S13. Discrimination ability of the logistic regression and random forest classification models when fitted to the development data for MCID in COMI, back pain, and leg pain. Fig. S14. Calibration plots for logistic regression, linear regression, random forest (RF) classification, and RF regression models when fitted to the development data for each outcome. Fig. S15. Flexible calibration curves fitted to ungrouped observed vs. predicted MCID probabilities for logistic regression and random forest (RF) classification on the validation data. Fig. S16. Flexible calibration curves fitted to ungrouped observed vs. predicted change outcome values for linear and random forest (RF) regression on the validation data. Results S3. Individual predictions formulae (including Table S5. Results of the multivariate logistic regression models on MCID outcomes (log-odds) and multivariate linear regression models on continuous outcomes (unstandardised beta coefficients) in the development data). Results S4. Sensitivity complete case analysis (including Table S6. Comparison of the results from the primary regression analyses using imputed predictor data and from the sensitivity analyses using only cases with complete predictor data). Fig. S17. Random forests relative variable importance plots for Core Outcome Measures Index (COMI), back pain, and leg pain Minimal Clinically Important Difference (MCID), and continuous change on these outcomes in the development data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Halicka, M., Wilby, M., Duarte, R. et al. Predicting patient-reported outcomes following lumbar spine surgery: development and external validation of multivariable prediction models. BMC Musculoskelet Disord 24, 333 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: