A cohort study with a two-week retest has been conducted. Patients with LBP were recruited from primary care (general practices and orthopaedic practices both with free access to care). The first questionnaire (t0) took place within the practice before consultation, and the second questionnaire was sent by post (t1) 10 days later. Patients who did not respond to the postal questionnaire were contacted by telephone.
Nine general practices and two orthopaedic practices participated with eight being single-handed, and three having more than one physician. Before patients were recruited, each practice received training from the study-team involving:
-
An introduction to the STarT Back Tool,
-
The informed consent procedure,
-
Information about the data collection procedure,
-
Information on transferring collected data from the practice to the study center,
-
Information about study reimbursement.
Inclusion criteria were patients with non-specific LBP, aged 18 to 60 years. The diagnosis of low back pain was defined as being specific, if a patient had a cauda equina syndrome, an inflammatory disorder such as ankylosing spondylitis, or had a suspected serious pathology such as a tumor or vertebral fracture. No restrictions were placed on the duration of a patient’s back pain symptoms. Patients were excluded, if they had consulted the physician within the last twelve weeks, had undergone spinal surgery within the last six months, or if they were unable to complete the study questionnaires due to poor German language skills. Anonymized information on eligible patients’ age and gender was obtained regardless of study participation (“consent list”).
The retest-material was sent to patients 10 days after the baseline assessment from the study-center. This duration was set to counter memory effects. Since it was likely that the health status would change at least for a part of the patients, an additional question on the subjective estimation of whether their complaints had changed over this period, was added [18]. Patients who did not respond to the postal questionnaire within two weeks were telephoned and reminded to send the questionnaire, or alternatively asked to answer a limited set of questions. The retest-process was managed by a specifically constructed database to ensure the maintenance of the predefined time intervals.
Ethical approval was granted by the Ethics Committee of the University of Heidelberg (registration ID: S-414/2013). All patients gave their written informed consent for participation before entering the study in the participating practice.
Instruments
In addition to the STarT-G, several validated German versions of reference standard instruments were included in the study questionnaire. Disability was operationalized using the Roland and Morris disability questionnaire (RMDQ) [19], fear avoidance beliefs were operationalized with the 17-item-version of the Tampa Scale of Kinesiophobia (TSK) [20], catastrophizing with the Pain Catastrophizing Scale (PCS) [21] and depression with the Hospital Anxiety and Depression Scale (HADS) [22]. Pain intensity was measured using the mean of three eleven-point box-scales for least, average (over the previous two weeks), and current pain [23, 24]. Standardized questions were used for documentation of the patients’ age, gender and body-mass-index (BMI), information on type of employment, days off work due to LBP and the duration of the back pain episode [25, 26].
The wording of two questions of the STarT-G were slightly modified lowering their item-difficulty. Because of the very high difficulty of item 5 and 8 found within the first study conducted in Switzerland, a rewording was undertaken in agreement with the developers of the SBT [17]. The STarT-G can be obtained from the authors via email.
The definitions for reference standard cases were catastrophizing (PCS score ≥ 20), fear (TSK score ≥ 41), depression (HADS-D score ≥ 8) and disability (RMDQ score ≥ 7) [11, 22]. Furthermore, a composite reference standard (CRS; ‘distress’) was determined, defined by individuals that were a ‘case’ simultaneously in the three psychosocial reference standard questionnaires: TSK, PCS and HADS depression. Following pretesting with selected LBP patients, the estimated time for the entire study questionnaire completion was 15 minutes.
Statistical analyses
Descriptive statistics were calculated to characterize the study population. The baseline characteristics of study participants were described to allow interpretability of the study sample, together with data about drop-outs, missing data and recruitment rate.
Discriminative ability was assessed by computing receiver operating characteristic curves with areas under the curves (AUC) and 95 % confidence interval (CI). Consistent with the original validation of the English SBT, this was done for disability, catastrophising and distress [27]. Adjectives that can be used to describe AUC-values have been proposed by Hosmer and Lemeshow with an AUC = 0.5 suggesting ‘no discrimination’, 0.7 to < 0.8 considered ‘acceptable discrimination’, 0.8 to 0.9 considered ‘excellent discrimination’ and >0.9 considered ‘outstanding discrimination’ [28]. To determine if a patient was a ‘case’ on reference standard instruments, the individual’s scores were compared to cut-off values given under the subheading Instruments (see “definitions for reference standard cases”). Since the CI determined by Hill et al. did not fall short of AUC = 0.7 [10], equivalence was expected if the lower CI did not fall short of the same cut-off.
In addition to the AUC, helping to interpret the relations between the instruments, Spearmans correlation coefficients were calculated for the STarT-G total and subscale scores for the RMDQ, TSK, PCS and HADS depression scores in order to be consistent to the approach of the original SBT authors.
To test if the psychosocial subscale could be regarded as one factor, a principal components analysis was undertaken. In general, at least four items should exceed 0.6 [29]. For the original version of the SBT, factor loadings between 0.6 and 0.8 were calculated; therefore equivalence was expected if the STarT-G values would exceed 0.6 for these five psychosocial items.
To determine internal consistency and item redundancy for the psychosocial subscale, the Cronbach’s alpha was calculated (poor internal consistency was defined as α < 0.70, item redundancy was defined as α > 0.90) [30]. Since the original SBT validation study reported values ranging between 0.7 and 0.9, equivalence was expected if Alpha was within this same range.
To investigate the test-retest reliability, Cohen’s quadratic weighted Kappa was calculated for the overall and subscale scores [31]. Since we had to expect that the health status would change between t0 and t1 at least for some patients, and that the STarT-G is responsive, test-retest calculations were limited to patients who self-reported their health problems to be unchanged over the two time-points [32]. A range between Kappa 0.6 and 0.8 was defined as good agreement. The values of 0.79 for the SBT total score and 0.76 for the subscale score calculated by Hill et al. lay within this range [10]. Therefore, equivalence was expected with a Kappa score of > 0.6. A sensitivity analysis was planned excluding retest data gathered via telephone.
Floor and ceiling effects were defined as present if more than 15 % of the responders achieved the lowest or highest possible STarT-G total score [33].
All statistical tests were two-sided and a significance level of alpha = 5 % was used. Analysis was generally performed using SPSS version 20.0. Principal component analyses and Kappa calculations were performed using the R language and environment for statistical computing, version 3.1.1 [34].
Sample size
Principal component analysis was expected to be the procedure with the need for the largest sample size. For calculation, the formula given by Bortz and Schuster was considered [29]. With a minimally expected factor loading of 0.4 and a stability of 0.9, a sample size of n = 180 resulted. This led to the conclusion that using the same sample size of 200 as defined for the original SBT validation study would be sufficient [10].