Prognostic ability of the German version of the STarT Back tool: analysis of 12-month follow-up data from a randomized controlled trial

Background Stratified care is an up-to-date treatment approach suggested for patients with back pain in several guidelines. A comprehensively studied stratification instrument is the STarT Back Tool (SBT). It was developed to stratify patients with back pain into three subgroups, according to their risk of persistent disabling symptoms. The primary aim was to analyse the disability differences in patients with back pain 12 months after inclusion according to the subgroups determined at baseline using the German version of the SBT (STarT-G). Moreover, the potential to improve prognosis for disability by adding further predictor variables, an analysis for differences in pain intensity according to the STarT-Classification, and discriminative ability were investigated. Methods Data from the control group of a randomized controlled trial were analysed. Trial participants were members of a private medical insurance with a minimum age of 18 and indicated as having persistent back pain. Measurements were made for the risk of back pain chronification using the STarT-G, disability (as primary outcome) and back pain intensity with the Chronic Pain Grade Scale (CPGS), health-related quality of life with the SF-12, psychological distress with the Patient Health Questionnaire-4 (PHQ-4) and physical activity. Analysis of variance (ANOVA), multiple linear regression, and area under the curve (AUC) analysis were conducted. Results The mean age of the 294 participants was 53.5 (SD 8.7) years, and 38% were female. The ANOVA for disability and pain showed significant differences (p < 0.01) among the risk groups at 12 months. Post hoc Tukey tests revealed significant differences among all three risk groups for every comparison for both outcomes. AUC for STarT-G’s ability to discriminate reference standard ‘cases’ for chronic pain status at 12 months was 0.79. A prognostic model including the STarT-Classification, the variables global health, and disability at baseline explained 45% of the variance in disability at 12 months. Conclusions Disability differences in patients with back pain after a period of 12 months are in accordance with the subgroups determined using the STarT-G at baseline. Results should be confirmed in a study developed with the primary aim to investigate those differences.


Background
Back pain is one of the most prevalent symptoms encountered in primary care [1]. Its treatment is challenging for primary care practitioners, such as physiotherapists or general practitioners [2,3]. It has been discussed that determining the prognosis of back pain is a priority for research and practice and may have the potential to supersede diagnosis in its relevance [4]. For the purpose of prediction, several risk factors were identified and differentiated as being modifiable (e.g. psychological factors) or non-modifiable (e.g. pain history) [5][6][7]. Despite this knowledge, and although planning of treatment on such a basis seems ineffective, prediction in clinical practice mainly relies on experience and clinical judgement [1,8].
Several prognosis-based approaches to subgroup patients with back pain have been developed [9]. One which has demonstrated feasibility, clinical applicability and cost-effectiveness is the STarT Back Approach (STarT = Subgroups for Targeted Treatment) [10,11]. To establish the prognosis of an unfavourable treatment outcome for patients with back pain, biomedical and modifiable psychosocial factors are determined using the STarT Back Tool (SBT) [6]. With nine items and mainly dichotomous answer categories, it is easy to complete [6] (http://www.keele.ac.uk/sbst/ startbacktool/downloads/), and its utilisation is recommended by national and international guidelines [12][13][14]. Evaluation of the tool results in a biopsychosocial total score and a sub-score focusing on psychosocial constructs. A classification comprised of low-, medium-and high-risk subgroups can be derived. For groups at higher risk more severe disability is expected over time [6].
The SBT was originally developed in the UK [6]. Following an internationally agreed upon process, it was cross-culturally adapted for German-speaking countries, and several important psychometric properties were examined [15][16][17]. The German version is called STarT-G [17]. Although the properties are not in total agreement with those of the original version, they can be considered acceptable to good [16,17]. Predictive ability has been investigated internationally in a few studies, but not for the German version [18][19][20].
The SBT was specifically developed for primary care and supports physicians and therapists in the clinical decision-making process. In addition to the advantage of better quality of care, in the sense of a more patient-oriented approach, it entails possibilities for cost savings, and, as a result of the utilisation of the tool, a reduction of unnecessary treatment may lead to a reduced burden on the patient [10,11,21]. In addition to its main area of application in clinical practice, the tool could be used by stakeholders such as health insurances to provide clients with better targeted suggestions for prevention. A corresponding programme called 'initiative.rücken' ('back.initiative') was developed by a German private health insurance company. It includes treatment by an interdisciplinary network of therapists, and individual coaching by phone is offered. An evaluation study was planned in parallel with the implementation of this proactive integrated treatment programme, and among other instruments, the STarT-G was applied. However, since the predictive ability of the German version has yet to be established, the STarT-Classification was only used for the purpose of description and not to derive treatment recommendations [22].
The primary aim of the a priori planned analysis of data from a pragmatic trial in health services research was to analyse the disability differences in patients with back pain 12 months after inclusion, according to the subgroups determined at baseline using the STarT-G. Moreover, several secondary aims were set out. The potential for improving prognosis for disability by adding further psychosocial and lifestyle variables was investigated. It was analysed whether the patients' pain intensity 12 months after inclusion differs in accordance with the subgroups determined at baseline, and the properties for discriminative ability and floor and ceiling effects were determined.

Methods
Baseline and 12-month follow-up data of a Zelen randomized controlled trial (RCT) conducted in 2015 to evaluate the efficacy of a health programme for persistent back pain were analysed. Participants of the RCT were members of a German private medical insurance with a minimum age of 18 and with back pain persisting for a period of more than 3 months. For case identification, at least two entries, according to ICD-10 codes M40-M54, had to be given in the insurance database. Additionally, one case of temporary work disability in the previous 12 months was due to ICD codes M40-M54, two opioid prescriptions or data for other specified diagnoses. Inclusion and exclusion criteria can be found in Table 1 (for an extended version with category names, see Appendix 1). A random sample of eligible insured persons was randomly allocated to the trial groups prior to their giving consent. The analysed data included in this study came from those allocated to the control group, and those people were invited to participate in a survey dealing with persistent back pain while receiving usual care. In this context, 'usual care' indicates that patients received care at the behest of their treating general practitioner or specialist, and the investigators had no influence on the treatment. All persons who gave written informed consent received online questionnaires at baseline and again at 12 months after inclusion. Ethical approval for the trial was granted by the Ethics Committee of the University of Luebeck (registration ID: 14/249). Analysing the properties of the STarT-G was planned at the same time as the development of the primary RCT, which is described in more detail in Hüppe et al. [13].
The investigated instrument is the STarT-G. It consists of nine items and is used to determine an individual's prognosis related to disability. The first four items relate to biomedical factors and the remaining five identify modifiable psychosocial risk factors [6,19]. A total-score ranging from 0 to 9 points and a sub-score for the psychosocial risk factors ranging from 0 to 5 points were calculated. Patients were then allocated to one of three prognostic groups using established scoring cut-offs (low-risk: total score ≤ 3 points; medium-risk: total score > 3 and sub-score < 4 points; high-risk: sub-score ≥ 4 points). Disability and back pain intensity were measured using the Chronic Pain Grade Scale (CPGS). The CPGS disability score served as the patient's primary outcome (CPGS-DS). It shows back-pain-related disability, determined by the amount the pain interfered with daily, social and work activities and ranges from 0 to 100, with higher scores indicating more disability [23,24]. Additionally, health-related quality of life, psychological distress and physical activity were measured. For this purpose, the Short-Form-12 Physical and Mental Health Summary Scales (SF-12; range 0 to 100, higher score indicating better health) [25], the Patient Health Questionnaire-4 (PHQ-4; 0 to 12 points, higher score indicating more severe psychological distress) [26] and activity-specific items according to DEGS1 were used (dichotomized into rather inactive = 0, rather active = 1; for details see [27]).

Statistics
Descriptive statistics were calculated to characterize the study population, and completeness of follow-up data was stated. To address the primary objective of the analysis, an ANOVA using the STarT-Classification as a grouping variable and the CPGS-DS at 12 months as the outcome variable was conducted [19,23]. Tukey tests were done post hoc to analyse the mean differences among the three subgroups. Deviation measures (SD) were reported to estimate the variability of disability within the risk groups at 12 months. This procedure, without the inclusion of covariates such as age or other scores, was chosen as the primary approach because the 9-item STarT-G is simple, allowing for use in clinical practice as a single predictive instrument.
While the SBT comprises only modifiable risk factors, the literature states that inclusion of other factors could improve prediction [20,28]. Therefore, the potential to improve prognoses for disability by adding further variables was investigated as a secondary approach. This was only done for our primary outcome disability at 12 months based on the CPGS-DS as the outcome variable. The procedure encompassed two main steps. First, univariate linear regression analyses were carried out to check for, at least, a minimal dependence. Considered were disability at baseline, depression/anxiety based on the PHQ-4, patient self-prognosis for workability, physical activity and global health. All these variables were derived from the survey. STarT-Classification was included as two dummy coded variables, medium and high risk (patients with a medium and high risk coded as 1 respectively), in each case against the other two groups (coded as 0). Patients with a low-risk were used as the reference, being 0 in both dummy variables. For the variables to pass to step two (multiple linear regression analysis), a result of p ≤ 0.2 was necessary from univariate analysis. For the multiple linear regression analysis, the variables were included block-wise. First, the STarT-Classification variables were included to determine the variance explained by the STarT-G. Second, the covariates from the univariate analyses were included before the final model was determined by applying a backward stepwise method.
For pain intensity at 12 months as the outcome variable, an ANOVA was carried out as described for the primary objective. Table 1 In-and exclusion criteria checked in the insurance database Inclusion -Age ≥ 18 years -≥ 2 cases of settlement data due to ICD codes M40-M54 -Additionally, one of the following three -≥ 1 cases of temporary work disability in the past 12 months due to ICD codes M40-M54, -two opioid prescriptions or -settlement data due to one of the following ICD-10 codes: F32. To quantify discriminative ability for patients with different disability and depression/anxiety statuses based on the PHQ-4, receiver operating characteristic (ROC) curves with area under the curve (AUC) and 95% confidence interval (CI) for CPGS against STarT-G total score and PHQ-4 against STarT-G sub-score were computed for baseline and at 12 months. The CPGS was calculated following the standards given by von Korff et al. [23]. To compute ROC curves, a dichotomous reference standard is needed. To define reference standard 'cases' , the GCPS-categories established by von Korff et al. [23] were used: Grades 0, I and II (low disability) versus Grades III and IV (high disability cases). For definition of PHQ-4 cases, the predefined groups of none and mild versus moderate and strong burden were combined. Adjectives that can be used to describe AUC values have been proposed by Hosmer and Lemeshow with an AUC = 0.5 indicating 'no discrimination' , 0.7 to < 0.8 as 'acceptable discrimination' , 0.8 to 0.9 as 'excellent discrimination' and > 0.9 as 'outstanding discrimination' [29].
Additional information on the relationships among the instruments was acquired by calculating Spearman correlation coefficients for the STarT-G total and sub-scores against the CGPS-DS, the PHQ-4-score and the SF-12 sub-scores for physical function and mental health, based on the original metrically scaled variables. As stated for the original version, higher correlations were expected for the total score versus the physical aspects and for the sub-score versus the psychological aspects [6].
Floor and ceiling effects were defined as present if more than 15% of the responders achieved the lowest or highest possible STarT-G total score [30].
Statistical tests were two-sided, and a significance level of alpha = 5% was used, unless otherwise stated. The analyses were performed using IBM SPSS Statistics 25.0 (IBM, Armonk, NY, USA). Figures were produced using the R language and environment for statistical computing software version 3.4.1 [31].

Results
From the database of the private medical insurance company, a random sample of 1499 eligible members was randomly allocated to the control group and invited to take part in a survey. Of those, 294 gave informed consent and answered the baseline questionnaire, and 243 (82.7%) participated in the 12-month follow-up. The mean age of participants was 53.5 (SD 8.7) years, and 38% were female. There were no significant differences for age (p = 0.56) and gender (p = 0.12) between responders and non-responders. The mean STarT-G total and sub-scores were 3.2 (SD 2.3) and 1.3 (SD 1.4), respectively. The risk group distribution was 62.6% for low, 27.6% for medium and 9.9% for high risk. Further baseline values are given in Table 2.
The ANOVA for disability at 12 months indicated significant differences (df total = 242, F = 51.7, p < 0.001) among the groups (Fig. 1). Post hoc Tukey tests revealed significant differences among all three risk groups for every comparison (Table 3).
For the multivariable regression analysis, four variables were adopted along with the STarT-Classification variables. Following the univariate step the p-values of the variables patient self-prognosis for workability, global health, disability at baseline and depression/anxiety, satisfied the set threshold of p ≤ 0.2 ( Table 4). As a result of the first block of the regression analysis with only the STarT-Classification variables included, 28% of the variance in disability at 12 months was explained (adjusted R 2 ). During the backward stepwise procedure, subjective work prognosis and depression/anxiety were excluded. The final model included the STarT-Classification variables, global health and disability at baseline (Table 5). ANOVA for the final regression model resulted in p < 0.001. The model predicted 45% of the variance in disability at 12 months (adjusted R 2 ). The resulting model reads as follows: Disability at 12 months (CPGS-DS) = -5.61 + 5.59*STarT group medium risk + 14.26*STarT group high risk + 0.41*Disability at baseline + 6.44*General health. Semi-partial correlation coefficients ranged from 0.08 to 0.28.
The ANOVA for pain intensity at 12 months indicated significant differences between the risk groups (df total = 242, F = 50.3, p < 0.001). Post hoc Tukey tests revealed significant differences for every comparison among all three risk groups ( Table 3).
Spearman coefficients for the STarT-G total and sub-score versus CPGS-DS, SF-12 Physical Health, SF-12 Mental Health and PHQ-4 scores are given in Table 6 (for graphical relationships between the STarT-G total score and CGPS-DS and between the STarT-G sub-score and PHQ-4, see Appendix 2).

Discussion
In this study, data from nearly 300 patients were analysed. Used on its own, the STarT-Classification was of great value to predict differences in back-related disability 12 months  after recruitment. By including additional variables in a prognostic model, it was possible to explain nearly half of the variance in disability at 12 months. Moreover, properties already determined in previous studies were confirmed, in part, with even higher coefficient values in the present sample [18,19]. Primary users of the SBT are physiotherapists and primary care physicians [2,3,10]. In addition, utilisation of the tool by health insurance companies might have the potential for a more targeted disbursement of resources. Knowledge from previous research about the psychometric properties of the tool have been given, although not for its prognostic ability [18,19]. Since such information is vital for clinicians [20], as well as for institutions, the given results have the potential to foster effective implementation. For health insurance companies, the presented analyses indicate possibilities for utilization of the instrument in their field.
Comparing the risk group distribution from the present study, which had many low-risk and a few high-risk patients, with those from other studies conducted in German-speaking countries, differences can be observed [6,18,19]. Various factors might influence the distribution. One might be the time point of administration [32][33][34]. Beneciuk et al. determined that more than three-quarters of high-risk patients changed categories after 4 weeks [33]. An alternative explanation for the differences in risk group distribution might be the recruitment strategy used for the study to provide the data analysed in this article. Participants were selected from the database of a health insurance and not in routine care.
The strength of the SBT is that the prognosis can be derived in clinical practice without gathering covariates. This approach worked well in this study's chosen sample and is, therefore, in line with both the results from the developers' external sample and another study including patients with chronic complaints [6,21]. In contrast, Kongsted et al. used the SBT without covariates and described low accuracies for the Danish version of the tool [8]. In the present study, multiple linear regression analysis was only used as a secondary approach, but inclusion of covariates considerably improved prediction leading to an explanation of nearly half of the variance. Toh et al. stated that covariates were needed to obtain a better predictive model for pain scores at follow-up, whereas Medeiros et al. directly chose to include covariates, resulting in no predictive capability when using the baseline STarT-Classification [28,35]. Consequently, the effect of including covariates needs further examination, while also remembering that the aim is to keep prediction as simple as possible for practice by focusing on modifiable factors.  A consideration of the results from the regression analysis revealed that other factors, as well as the STarT-classification, provided predictive information; here, these were disability at baseline and general health. The relevance of baseline scores is a common phenomenon in the literature [7,32,33] and is underlined by the determined semi-partial correlation coefficient in this study. In regard to general health, one simple item was identified which should be kept in mind for future predictive models. Also, the result indicating that the baseline score is a prominent predictor agrees with results from other studies [7,32]. In the presented analysis, the STarT-Classification was used. Other researchers discuss the potential of the sub-score which might lead to different conclusions [28]. To improve predictive ability, different researchers have suggested a repeated application of the SBT, e.g. pre-and post-treatment, including the notion of change of the STarT risk group [33,35].
Our work focused on disability being one of the core domains for patients with back pain [36], and pain intensity was chosen as a secondary outcome. This is in line with the strength of the SBT [20], but it has to be kept in mind, for example, when working with patients with pain as the major complaint, since predictive factors may differ between pain and disability [32].
The correlations determined in the presented analyses also showed similarities to results for the original version of the SBT. Stronger correlations were determined between the total score and physical measures and the sub-scores and psychological measures rather than vice versa [6].  The STarT Back approach depends on the SBT. To conduct stratified care, therapists should be upskilled through a training course in order to successfully address the complex needs of high-risk patients through the delivery of 'psychologically informed physiotherapy' [37]. Evidence suggests that physiotherapists trained in this manner are effective in managing around 85% of this high-risk complex patient group, but training might still be helpful to encourage therapists [38][39][40][41][42].

Strengths and weaknesses
For this article, data from a control group of an RCT were analysed. In the literature, several advantages and disadvantages are described for such a proceeding [43]. An important aspect that reduced information bias was that the analysis of the STarT-G properties was intended from the design period forward: the analyses were planned by SK and KH before receiving access to the data.
The questionnaires chosen for the study were comprehensively validated, and established cut-offs were available [26,45,46]. On the other hand, the study sample was specifically selected. Only patients from one private health insurance were included, and, although approximately only 10% of the population are similarly insured [47], with nearly 350,000 customers, the company is large enough to represent a relevant group of people. Fewer than a fifth of the patients were lost to follow-up. This does not exceed the benchmark of 30% set for long term follow-up by the Cochrane Back and Neck Group. However, since the benchmark is set arbitrary, bias is still possible [48].
Patients with nonspecific complaints is the group which the SBT targets. In the present study, inclusion of the patients was conducted on the grounds of a search in the database of the insurance company using ICD-codes. Since a comprehensive range of diagnosis was undertaken by using codes M40 to M54, the possibility that patients with serious complaints had been included cannot be discounted. Nevertheless, inclusion of the patients for the development study of the SBT was also based on a computerized search. In the latter study, as in ours, red flag diagnoses, such as cancer, were defined as exclusion criteria [6,44]. Despite the possibility of a heterogeneous sample being analysed, it is notable that the tool still demonstrated acceptable properties.
In various studies on the SBT, missing data were identified as a challenge when using the instrument in research. For approximately up to one-tenth of the patients in this previous work, it was not possible to determine the risk group [19,49]. In the present study, it was possible to identify the risk group for all participants because there were no missing values, simply because the online tool used did not accept unanswered questions.

Conclusions
Differences in the disability of patients with back pain after a period of 12 months agree with the subgroup classification determined by using the STarT-G at baseline. By adding prognostic variables to the STarT-Classification in a prognostic model, it is possible to explain nearly half of the variance observed in disability at the 12-month follow-up. Considering this information, the instrument can be used more purposefully by practitioners. Further studies to examine the predictive ability and timing of the application of the STarT-G in clinical practice and when it would be best for insurers to implement it should be conducted. Table 7 In-and exclusion criteria checked in the insurance database, version including ICD category names