Clinical classification in low back pain: best-evidence diagnostic rules based on systematic reviews

Background Clinical examination findings are used in primary care to give an initial diagnosis to patients with low back pain and related leg symptoms. The purpose of this study was to develop best evidence Clinical Diagnostic Rules (CDR] for the identification of the most common patho-anatomical disorders in the lumbar spine; i.e. intervertebral discs, sacroiliac joints, facet joints, bone, muscles, nerve roots, muscles, peripheral nerve tissue, and central nervous system sensitization. Methods A sensitive electronic search strategy using MEDLINE, EMBASE and CINAHL databases was combined with hand searching and citation tracking to identify eligible studies. Criteria for inclusion were: persons with low back pain with or without related leg symptoms, history or physical examination findings suitable for use in primary care, comparison with acceptable reference standards, and statistical reporting permitting calculation of diagnostic value. Quality assessments were made independently by two reviewers using the Quality Assessment of Diagnostic Accuracy Studies tool. Clinical examination findings that were investigated by at least two studies were included and results that met our predefined threshold of positive likelihood ratio ≥ 2 or negative likelihood ratio ≤ 0.5 were considered for the CDR. Results Sixty-four studies satisfied our eligible criteria. We were able to construct promising CDRs for symptomatic intervertebral disc, sacroiliac joint, spondylolisthesis, disc herniation with nerve root involvement, and spinal stenosis. Single clinical test appear not to be as useful as clusters of tests that are more closely in line with clinical decision making. Conclusions This is the first comprehensive systematic review of diagnostic accuracy studies that evaluate clinical examination findings for their ability to identify the most common patho-anatomical disorders in the lumbar spine. In some diagnostic categories we have sufficient evidence to recommend a CDR. In others, we have only preliminary evidence that needs testing in future studies. Most findings were tested in secondary or tertiary care. Thus, the accuracy of the findings in a primary care setting has yet to be confirmed. Electronic supplementary material The online version of this article (doi:10.1186/s12891-017-1549-6) contains supplementary material, which is available to authorized users.


Background
Identifying diagnostic, prognostic and treatment orientated subgroups of patients with low back pain (LBP] has been on the research agenda for many years [1,2]. Diagnostic reasoning with a structural/pathoanatomical focus is common among clinicians [3], and it is regarded as an essential component of the biopsychosocial model [4][5][6]. Within this model, emphasis has been on the role of psychosocial considerations and how these factors can interfere with recovery. Indeed, there is good quality evidence for the predictive value of a set of psychosocial factors for poorer outcome in patients with LBP [7,8]. These factors are multifactorial, interrelated, and only weakly associated to the development and prognosis of LBP [9], which might be one of the explanations why effects of treatments targeting those risk factors has been reported to be small, mostly short term, and there was little evidence that psychosocial treatments were superior to other active treatments [7,10].
Maybe it is time to swing the pendulum towards the "bio" in the biopsychosocial model. There are many examples in medicine where the pathology has been identified prior to any effective treatments being developed making it an ongoing challenge to generate new diagnostic knowledge on which to base more effective treatment strategies in the future. Alongside clinicians, many researchers within the field of LBP feel that choosing the most effective treatment for the individual patient is not possible without better understanding of the biological component of the biopsychosocial model [4].
In 2003 the present authors suggested a diagnostic LBP classification system based on a review of the literature [11,12]. This system has been fully or partly used in prognostic and outcome studies by other research groups [13][14][15]. The present study is driven by the obvious need for an update based on recent evidence. The relevance of an updated diagnostic classification is as follows: First, diagnostic patterns of signs and symptoms from history and physical examination may assist the clinician in explaining the origin of pain to the patient and in directing treatment at the painful structure. Patients with persistent LBP often have misconceptions about what is going on [16], and may have been given all sorts of speculative explanations for their symptoms resulting in anxiety and confusion. These patients often seek an explanation about what is wrong [17], and new evidence suggests that offering clear explanations and information about aetiology, prognosis and interventions may improve patient outcomes [7]. Giving an explanation based on best evidence may contribute to 1) reducing the patient's confusion and conceptual chaos, 2) reassurance that the clinician knows what is going on, 3) visualizing the potential benefit of treatment directed at the painful structure (mental imagery has been suggested to have potential in pain management [18,19], 4) provided that the above efforts are successful, motivating the patient to open a therapeutic window.
Second, the need for studies testing the effect of treatment strategies for subgroups of patients with LBP in primary care has been emphasized in consensus-papers [1,20] as well as current European guidelines [21]. Targeting treatment to classifications merely based on prognostic patient characteristics has not been convincingly successful in finding treatment modalities that are more beneficial than others [22]. A diagnostic classification may assist in generating hypotheses as to which treatment modalities are more likely to target the pain source for future testing in randomized trials.
Finally, an evidence-based clinical diagnosis with acceptable accuracy will reduce the need for invasive or expensive diagnostic methods (often with substantial waiting time and expense).
The focus of this review is to outline the diagnostic value of signs and symptoms for use in primary care without access to confirmatory paraclinical methods. The clinician must not mislead the patient, so it is important to distinguish between diagnostic labels that can be given to patients with reasonable confidence and those only suggesting suspected best evidence pathoanatomy. Therefore, it is of interest to identify signs and symptoms with the potential to diagnose common sources and causes of LBP i.e. intervertebral discs, sacroiliac joints, facet joints, bones, nerve roots, muscles, peripheral nerve tissue, and central nervous system sensitization.
Throughout this review, we use the term Clinical Diagnostic Rule (CDR) meaning that we have applied a clinical decision rule to the field of clinical diagnostics. A clinical decision rule "is a clinical tool that quantifies the individual contributions that various components of the history, physical examination, and basic laboratory results make toward the diagnosis, prognosis, or likely response to treatment in a patient. Clinical decision rules attempt to formally test, simplify, and increase the accuracy of clinicians' diagnostic and prognostic assessments" [23].
The aim of this paper was to develop multi-faceted Clinical Diagnostic Rules (CDRs) for the lumbar spine using individual diagnostic accuracy scores based on best evidence for use in primary care clinical practice and research. If possible, single clinical examination findings would be clustered in CDRs based on well-defined criteria.

Methods
The reporting of this review was based on the Preferred Reporting Items for Systematic reviews and Metaanalyses statement (PRISMA) [24].

Eligibility criteria and study selection
To be included studies were required to meet the following criteria: 1) Participants had LBP with or without leg pain 2) Use of an appropriate reference standard as listed in Table 1. 3) Evaluation of at least one clinical finding available to primary care clinicians. 4) Presentation of data enabling calculation of sensitivity and specificity.
For some diagnostic categories, recent systematic reviews were found covering our topic. These were included if they complied with the principles recommended by the Cochrane Collaboration [25]. In other categories, where searches in included systematic reviews were terminated before 2011, our searches were performed up to May 2015 from the date where the search of those reviews was terminated. In categories where no systematic reviews were found, we conducted systematic searches in the electronic databases PubMed, Embase, and CINAHL. Details of the search strategy are presented in Additional files 1, 2, 3 and 4. One of the authors (TP) reviewed the search results from the databases (titles and abstracts). Any titles and abstracts from studies that appeared to compare the results of clinical examination findings on patients with LBP with those of diagnostic reference standards were selected for full text review. Reference lists of selected studies were reviewed for additional studies. If necessary, authors were contacted for clarification of unclear reporting. The data extraction from the selected studies was prepared by one author (TP) and the second author (ML) reviewed the complete data extraction form for accuracy. Any disagreements were resolved by discussion. In diagnostic findings where no studies presenting sensitivity and specificity were found, studies presenting predictive values (sensitivity only) were included. We extracted values of diagnostic accuracy for clinical examination findings that were investigated by at least two studies.

Reference standards
In this review, we used the best available reference standards for diagnosis of the relevant source and cause of LBP. See Table 1. Index tests results were reported if they were investigated in at least two studies using the best available reference standard.

Quality assessment
Original studies were retrieved in full text and independently scored for quality and risk of bias using Quality Assessment of Diagnostic Accuracy Studies (QUADAS) in accordance with the recommendations of the Cochrane Handbook for Systematic Reviews of DTA [26]. Any disagreements were resolved by discussion. In a few cases, one of the present authors were coauthoring a paper or we were not able to acquire the original papers included in previous reviews. In these cases the results of QUADAS were transferred from the review in question to the present paper.

Grading of recommendations
There is currently no consensus regarding criteria to assess the quality of evidence of diagnostic tests [27]. In this study, diagnostic values that were in agreement in more than two thirds of studies were included in our final recommendations. Downgrading of recommendations from strong to weak was made in cases with serious risk of bias due to verification bias, partial verification bias, differential verification, incorporation bias, or test review bias.

Diagnostic accuracy measures
In order to be clinically useful, we considered the cut-off for a clinical finding to rule in the disorder to be a positive likelihood ratio (LR) above 2.0 [28], meaning that a Intervertebral disc Provocation discography with control disc verification [171] Facet joint Double block procedure in joint space or at nerve supply [148] Sacroiliac joint Double block procedure in joint space [172] Nerve root involvement Magnetic resonance imaging, myelography, or surgical findings with or without clinical findings [173] Spinal stenosis Expert opinion based on radiographs, magnetic resonance imaging or surgical findings with or without clinical findings [75,174] Spondylolisthesis Sagittal plane rotation or translation movement on functional radiograph or translation on static radiograph [152,155] Fracture Radiographs, computed tomography or magnetic resonance imaging [155] Myofascial structures None available.
Peripheral nerve None available.
Central sensitization Expert consensus positive index test will at least double the ratio of having the disorder compared to not having the disorder. This means that if the pretest probability is 0.3, the pretest odds is 0.3/0.7 = 0.43 and if the LR is 2.0 the posttest odds is 2*0.43 = 0.86 and the posttest probability can then be estimated to 0.46. For a useful clinical finding to rule out the disorder, we considered the cut-off to be a negative LR below 0.5 [28], meaning that a negative index test will reduce the odds of having the disorder at least by half compared to not having the disorder. Overall, the change from pretest to posttest chance of having the disorder in question depends on the pretest probability.
In summary, clinical examination findings that were investigated by at least two studies were included. Diagnostic values that were in agreement in more than two thirds of studies and met our predefined threshold of positive likelihood ratio ≥ 2 or negative likelihood ratio ≤ 0.5 were considered for the CDR.

Statistics
A meta-analysis was considered if evidence of clinical homogeneity could be established. Clinical heterogeneity was assessed by comparing the similarity of patient samples, performance of tests, and reference standards. However, a qualitative synthesis of studies according to principles of best-evidence synthesis [29] was performed if studies were clinically heterogeneous. Table 2 outlines the findings in each of the diagnostic categories that are supported by more than one study. Characteristics of the included studies are presented in Additional file 5. Results of the quality assessments are presented in Additional file 6. Results of the searches of the literature are presented in Additional files 7, 8, 9, 10, 1, 2, 3 and 4.

Results
Because of heterogeneous study populations, performance of index tests, and choice of reference standards, only descriptive statistics were used to summarize findings across studies. The diagnostic value of findings in each category is presented below.

Intervertebral disc
A previous systematic review of clinical diagnosis of lumbar intervertebral discs (ID) has terminated the literature search at February 2006 [30], Therefore, databases were searched by the present authors from that date up to May 2015. The results of the search are presented in Additional file 7. Three studies [31][32][33] from the Hancock review and one study [34] from our updated search were included ( Table 2).
The evidence is sufficient to constitute a Clinical Diagnostic Rule (CDR). We recommend the use of centralization of symptoms during physical examination. Two studies using strict criteria for centralization (change of pain in the furthermost whole body region) reported high levels of positive LR [32,33], meaning that a positive test is useful for ruling in the diagnosis. One study using less strict criteria for centralization (change in any furthermost extent of pain] [31], However, a positive LR of 2.1 even in this study indicates the presence of relatively few false positive tests.

Facet joint
A previous systematic review of clinical diagnosis of facet joints (FJ) terminated the literature search at February 2006 [30]. The current search started from that date up to May 2015. The results are presented in Additional file 7. Seven studies [32,[35][36][37][38][39][40] from the Hancock review and three studies [41][42][43] from our updated search were included in this review ( Table 2).
The evidence is insufficient to constitute a CDR. No studies supporting Revel's suggested rule [35] or part thereof were identified.
The only negative findings from studies with single block reference standards that appeared potentially useful for ruling out FJ pain were centralization [32,39] and no relief with recumbency [37,38].

Sacroiliac joint
A previous systematic review of clinical diagnosis of sacroiliac joints (SIJ] terminated the literature search at February 2006 [30]. The current search started from that date up to May 2015. Results are presented in Additional file 7. Four studies [32,[44][45][46] from the Hancock review and three studies [47][48][49] from our updated search were included ( Table 2).
The evidence is sufficient to constitute a CDR. We recommend the use of the Laslett rule [44] comprising at least 3 positive out of 5 of the following findings from physical examination: distraction, compression, thigh thrust, Gaenslen's test, or sacral thrust.
The rule was supported by two additional studies where composites of at least 3 positive out of 5 tests resulted in high levels of positive LR [45,48]. There is only a slight difference in tests included in the composites.
We recommend the addition of no centralization from the "Laslett composite" to the CDR as it increases the positive LR without compromising the negative LR. The value of centralization for screening out SIJ pain was supported by one more study with single block reference standards reporting an acceptable negative LR [32].
Furthermore, we recommend the use of the physical examination finding dominant pain the posterior superior iliac crest (PSIS) area. This finding was only investigated in one study using the double block standard [49]. However, the usefulness is supported by the fact that all included studies comprised patients with pain location in the PSIS area and it is a logical assumption that a      strict interpretation of pain location; i.e. dominant pain in the PSIS area opposed to any level of pain, will increase the specificity of this finding.

Disc herniation with nerve root involvement
A systematic review in the field of clinical diagnostic of disc herniation with lumbar nerve root involvement (NRI) has terminated the search of literature at October 2008 [50] and an update is in progress. [51] Therefore, no search of the literature was performed by the present authors. However, we reviewed the included studies and the reference lists of those studies for additional clinical findings. Thirteen studies [52][53][54][55][56][57][58][59][60][61][62][63][64] were included from the systematic review and one study was excluded due to lack of a reference standard negative population [65].
The evidence is sufficient to constitute a CDR. We recommend initial screening by use of the straight leg raise (SLR) test in combination with the Hancock rule [52] comprising at least 3 positive out of 4 of the following findings: dermatomal pain location in concordance with a nerve root, and corresponding sensory deficit, reflex and motor weakness.
The CDR was supported by another composite [74] who reported the diagnostic value of a combination of 3 neurological signs in patients with monoradicular pain.
The value of a negative SLR test for screening out nerve root involvement was supported by the vast majority of single studies reporting acceptable levels of negative LRs regardless of level of nerve root involvement [55-58, 62-64, 71, 72].
Furthermore, we recommend the use of crossed SLR that was supported by acceptable positive LRs in the vast majority of studies [55,58,59,62,70].
The single findings included in the Hancock rule were supported by most studies reporting diagnostic value. Findings were supported by studies reporting acceptable levels of positive LRs: dermatomal S1 pain location [54], L2-L5 sensory deficits [55][56][57], L4 patellar reflex weakness [56,58], S1 Achilles reflex weakness [55][56][57][58], L4 knee extension weakness [56], L5 dorsiflexion weakness of ankle and toes [55,56,58], or S1 plantarflexion weakness of ankle [55,56]. One study reported acceptable level of negative LR: any nerve dermatomal pain location [53]. The diagnostic value of dermatomal pain location in the Hancock rule was supported by only one additional study and only regarding S1 distribution [54]. However, the usefulness is supported by the fact that 11 out of 14 studies included a patient population with radicular pain location, and it is a logical assumption that a strict interpretation of radicular pain; i.e. dermatomal distribution corresponding neurological findings, will increase the specificity of this finding.

Spinal stenosis
A recently updated systematic review in the field of clinical diagnostic of lumbar spinal stenosis (SS) terminated at March 2011 [75]. Therefore, no search of the literature was performed by the present authors. Nine studies [76][77][78][79][80][81][82][83][84] were included from the systematic review ( Table 2). Two of the nine studies included the same population [82,84] and we chose to use values from one [82] because it reported diagnostic accuracy of questionnaire items not necessarily part of the reference standard based on physical examination and imaging. In addition, we included one study that was identified by our hand search of reference lists [85].
The evidence is sufficient to constitute a CDR. We recommend the use of the Cook rule [76] comprising at least 3 positive out of 5 of the following findings from patient history: age more than 48 years, bilateral symptoms, leg pain more than back pain, pain during walking/standing, and pain relief upon sitting (Table 2). Furthermore, we recommend the use of improved walking tolerance with the spine in flexion that was supported by two studies with acceptable levels of positive LRs [83,85], and the patient history report of relief by forward bending that was supported by two studies with acceptable levels of positive LRs [77] or negative LRs [79].
The single findings included in the Cook rule were supported by other studies reporting diagnostic value. Some findings were supported by studies reporting high levels of positive LRs: age above 50 years [77], bilateral pain [78], severe leg pain [79], leg pain worse with walking [77,80], pseudoclaudication [81], pain worse with standing [77], and symptoms improved when seated [79]. Other studies reported acceptable levels of negative LRs: no leg pain [77,81], pain not worse when walking or standing [82,83], and sitting not best posture [83].

Spondylolisthesis
A recently updated systematic review of clinical diagnosis of lumbar spondylolisthesis terminated at March 2010 [86]. Therefore, databases were searched by the present authors from that date up to May 2015. Results of the search are presented in Additional file 8. Three studies from the systematic review [87][88][89] and five studies from our updated search [90][91][92][93][94] were included ( Table 2).
The evidence is sufficient to constitute a CDR. We recommend a combination of two physical examination findings positive: intervertebral slip by inspection or palpation and segmental hypermobility by use of manual passive physiological intervertebral motion test (Table 2). Furthermore, we recommend the use of the passive lumbar extension test as a supplement for the identification of degenerative spondylolisthesis in the elderly. All tests were supported by two studies with acceptable levels of positive LRs.
The evidence is insufficient to constitute a CDR. Best evidence synthesis indicates the potential benefit of the Henschke rule [96] comprising at least 1 negative out of 3 of the following findings from patient history: findings: age >70 years, prolonged use of corticosteroids, and significant trauma ( Table 2). This rule presented with the lowest negative LR meaning that when none of these findings are present, the clinician will be able to rule out a lumbar fracture with acceptable confidence.
Regardless of setting in which the studies were conducted, single studies provided inconsistent results, and the Henschke rule has not been validated in other studies.

Myofascial pain
There is no available evidence regarding diagnostic value. We have conducted a systematic search of the literature to May 2015 revealing that studies in the field are hampered by the lack of an adequate diagnostic reference standard. The results of the search are presented in Additional file 9. It appears that clinical criteria are in fact the reference standard. Firm manual pressure applied to the muscle and elicited feedback from the patient appears to be the only means to establish the diagnosis. However, there is considerable variability of criteria used to diagnose a Myofascial Pain Syndrome [104]. The original criteria for a myofascial trigger point (TrP) originally proposed by Travell and Simons [105], have been revised based on clinical experience and results from reliability studies, but neither have been rigorously validated [104].
We suggest a composite of four minimum criteria that support the diagnosis: 1) presence of a palpable taut band within a skeletal muscle, 2) presence of a hypersensitive spot within the taut band with or without reproduction of a distinct referred pain sensation with stimulation of the spot, 3) patient recognition of the elicited pain. These criteria are based on a strict interpretation of the nine criteria currently under debate by The International Association for the Study of Pain (IASP) [106].
We have found no accepted reference standard by which a TrP can be diagnosed. However, several methods have been suggested in order to at least demonstrate construct validity of the clinical criteria. The results of our search revealed some attempts to demonstrate construct validity when TrPs were compared to electromyography [107][108][109][110][111], sonoelastography [112], and quantitative sensory testing [113,114]. Methodological quality is generally low due to lack of blinding, differences in definition of active and latent TrPs, and all studies but two [108,113] investigated the shoulder and neck region making generalizability questionable when results are transposed to the low back.
In the absence of evidence regarding diagnostic accuracy, physical examination findings should demonstrate inter-rater reliability in order to be considered clinically meaningful. Two recent systematic reviews conclude that physical examination findings cannot identify TrPs with an acceptable degree of reliability [115,116]. However, the authors state that if diagnostic criteria were revised to include only a palpable tender spot in the muscle that when palpated reproduces the patients' familiar pain in that spot or in a distinct pattern, then the present evidence indicates that worthwhile agreement might be achieved. This reasoning is in line with our suggestion of including three of the IASP criteria.
There are significant issues in relation to the intraand inter-observer reliability of identifying a muscle containing a TrP, and there are no data supporting the ability of different examiners to agree on the exact location of a TrP within a specific muscle.
Taken together, no conclusions can be made based on the present evidence although our suggested criteria to be used in future diagnostic studies appear to have face validity.

Peripheral nerve
There is no available evidence regarding diagnostic value. We have conducted a systematic search of the literature up to May 2015 revealing that all studies in the field are hampered by the lack of an adequate diagnostic reference standard. The results of the search are presented in Additional file 10. It appears that clinical criteria are in fact the reference standard. We suggest the following criteria to be used in future diagnostic studies: Patient recognition of usual lumbar or leg pain with at least two stages of sensitizing maneuvers, i.e. knee extension, ankle dorsiflexion, or neck flexion during SLR or slump test.
Although it has not been possible to report rigorous diagnostic validity of our suggested criteria, they appear to have some degree of face validity across authors. However, there is considerable variability of criteria used to diagnose increased peripheral neural mechanosensitivity [117]. Most commonly used are SLR and slump, but the interpretation of a positive test response differs. Authors may put emphasis on provocation of any lumbar or leg pain, patient recognition of their usual pain, and/or restriction of movement during testing [118].
Our search identified no studies that made comparisons between peripheral nerve mechanosensitivity testing and diagnostic procedures that appear to have the potential to be considered as reference standard (i.e. nerve conduction electrodiagnostics, ultrasound imaging, or magnetic resonance neurography]. However, our literature searches identified a number of studies attempting to demonstrate construct validity of particular aspects of the clinical representation of peripheral nerve pain.
Several studies found that reduction in range of movement (ROM] during SLR or slump as criterion for increased neural mechanosensitivity had no proven value in discriminating between patients with LBP and asymptomatic persons [119][120][121][122][123][124]. Also the hypothesis, that increased muscle tension might be responsible for the changes in ROM during SLR and slump test, has been refuted by electromyographic studies [122,[125][126][127]. These studies found that muscle tension is an unlikely source to ROM reduction during SLR and slump, but they did not address the main concern, that is, that any fascial network in the back and legs would be a equally plausible source of pain provocation during neural sensitizing maneuvers. Taken together, the data support the view of Shacklock [118] who claimed that reproduction of the patients usual symptoms should be an integral part of the diagnostic criteria. In the absence of an accepted reference standard, physical examination findings should demonstrate interrater reliability in order to be considered clinically meaningful. Our search did not identify any reviews exploring the inter-tester reliability of SLR or slump in patients with LBP. However, we found three individual studies in which the inter-tester reliability of patient recognition of lumbar or leg pain with at least two stages of sensitizing maneuvers was investigated. In all studies, Kappa values (K] indicated substantial agreement between examiners [128]. Walsh et al. [129] reported K = 0.80 (CI 0.39-0.94) for SLR and 0.71 (CI 0.33-0.71) for Slump, Philip et al. [130] reported K = 0.89 (CI 0.81-0.97) for Slump, and Petersen et al. [12] reported K = 0.59 (CI 0.39-0.79) for SLR and Slump.
To summarize, no conclusions can be made based on the present evidence although our suggested criteria to be used in future diagnostic studies appear to have face validity and acceptable level of intertester reliability.

Central sensitization
There is insufficient evidence to generate a diagnostic rule to identify patients with a condition characterized by "increased responsiveness of nociceptive neurons in the central nervous system to their normal or subthreshold afferent input" [131]. We have not conducted a systematic search of the literature inasmuch as studies in the field are hampered by the lack of an adequate diagnostic reference standard because the underlying mechanisms behind localized, regional and widespread pain are not fully understood [132,133]. In the absence of anything better, we suggest the consensus-based Nijs rule to support the diagnosis of central sensitization (CS) [134].
The first step in the rule is to exclude a neuropathic pain source by use of the IASP criteria [135] and NeuP-SIG guidelines [136]. The next step is to make sure that the following criterion 1 is satisfied in combination with either criterion 2 or 3: Criterion 1. Pain experience disproportionate to the nature and extent of injury or pathology, i.e. not sufficient evidence of injury, pathology, or objective dysfunctions capable of generating nociceptive input consistent with the patient's severity of pain and disability. Criterion 2. At least one of the following patterns present: bilateral pain/mirror pain (i.e., symmetrical pain pattern) pain varying in (anatomical) location/travelling pain to anatomical locations unrelated to the presumed source of nociception e.g., hemilateral pain, large pain areas with non-segmental (i.e., neuroanatomically illogical) distribution widespread pain (defined as pain located axially, on the left and right side of the body and both above and below the waist) allodynia/hyperalgesia outside the segmental area of (presumed] nociception. These findings are based on testing of light touch by means of a swap or cold items (allodynia) as well as testing by pin prick or pressure (hyperalgesia). Criterion 3. Hypersensitivity of senses unrelated to the muscular system. These findings are based on a score of at least 40 on the Central Sensitization Inventory [137,138].
Our suggested criteria are based on a consensus report by researchers from different professions [134] and are in line with other experts in neurophysiology [139][140][141]. Thus, although it has not been possible to report diagnostic value of the criteria, and only aspects of construct validity have been reported [142], they appear to have face validity. Results of systematic reviews are not consistent with respect to prevalence of generalized or widespread sensitization after quantitative sensory testing as stand-alone tests in patients with chronic LBP [142,143]. However, a composite of criteria fairly similar to those of the Nijs rule for separating CS from nociceptive and peripheral neuropathic pain sources have been reported to have acceptable levels of inter-tester reliability (K = 0.77, CI 0.57-0.96) [144] and discriminative validity (positive LR 40.6, CI 20.4-80.8) [145].
Taken together, no conclusions can be made based on the present evidence although our suggested criteria to be used in future diagnostic studies appear to have face validity, and promising aspects of construct validity and level of intertester reliability has been reported.

Discussion
We found no composites of clinical findings that were able to fully substitute for the respective reference standards. Thus, in cases where a patho-anatomical diagnosis is of crucial importance for the clinician or the patient, the patient must be referred for more sophisticated diagnostic procedures, which may include high tech imaging or minimally invasive, controlled and guided injection procedures.

Intervertebral disc
Our recommendation for the disc CDR is strong due to risks of partial verification bias in only one [32] of the three studies investigating the finding of centralization. In all studies, a high risk of selection bias is present, because they included patients from secondary care referred for diagnostic invasive procedures. Consequently, the studies are likely to overestimate the diagnostic gain of using the CDR in comparison to primary care settings where the prevalence is somewhat lower.
In addition to the discography studies, our search identified two studies reporting the diagnostic value of centralization for identifying patients with MRI findings of extruded or sequestrated discs [146,147] Results of these studies were not in concordance and warrant further investigation.

Facet joint
It was not possible to constitute a CDR for the identification of painful FJ. Double block procedure in joint space or at nerve supply was judged to be acceptable as reference standard when at least one of the following criteria were satisfied: a positive controlled block, i.e. the anesthetic block definitely reduced the pain from the injected joint, where as a block in a non-painful joint had no marked effect on pain, a positive confirmatory block, the anesthetic block definitely reduced the pain from the injected joint at two separate occasions 1 to 2 weeks apart, or a positive comparative dual block, i.e. a short-followed by a long lasting anesthetic significantly reduced pain in the predicted time periods [148].
The only negative findings from studies with single block reference standards that supported single tests of the Revel rule for ruling out FJ pain was no relief with recumbency [37,38]. However, the quality of evidence for this finding was downgraded due to serious risk of test review bias in both studies.
We found two additional single block studies investigating diagnostic value of non-centralization using a single block reference standard [32,39]. Both studies reported acceptable levels of sensitivity (0.96 and 0.97 respectively) and negative LRs (0.22 and 0.28 respectively). However, the quality of evidence for this finding was downgraded due to risk of partial-or differential bias in the two studies. Although validated with only a single block reference standard, a finding of centralization might have preliminary merit for ruling out a symptomatic facet joint because there is no point in giving patients with a negative screening block a second block, even if the second block was positive the same conclusion is reached, non-FJ pain. The same reasoning applies to the value of no relief in recumbency.
The results regarding no relief with recumbency and non-centralization appear promising, but they need verification in future studies.
It is unclear whether the three studies by Manchikanti et al. [35,36,41] might include the same populations. However, this issue would have no influence on the conclusion.

Sacroiliac joint
Our recommendation for the SIJ CDR is strong. Only one out of three studies supporting the diagnostic value of the composite of tests displayed risk of differential bias [44]. In all studies, however, a high risk of selection bias is present, because they included patients from secondary or tertiary care referred for diagnostic invasive procedures. The CDR is supported by an additional two out of three studies where composites of at least 3 positive out of 5 tests resulted in high levels of positive LRs [45,48]. Although the content of the composites are comparable there is a slight difference in the use Patrick's PABER test and Mennell's test. The fact that one study did not support the rule [47], might be explained by the fact that the double block were performed only 30 min apart, which increases the risk of false positive findings. Furthermore, the quality of this study suffered from the risk of test review bias.
The recommendation of no centralization during physical examination was weak based on two studies [32,44]. One of those was reporting an acceptable level of negative LR for centralization using a single block reference standard, making non-centralization useful for ruling out a symptomatic SIJ [32]. However, both studies suffered from risk of partial verification bias leading to a downgrading of the quality of evidence.
We found two additional studies investigating diagnostic value of SIJ area pointing, without indication of whether or not the pain was dominant, using insufficient reference standard in terms of a single or periarticular SIJ blocks [46,149]. The results were not in concordance and warrant further investigation.

Nerve root involvement
The strength of our recommendation for the CDR is weak based on mediocre methodological quality in most of the studies. Studies revealed serious risk of bias in relation to differential verification, incorporation, or test review.
The studies included used surgical or imaging findings as a reference standard. We found no differences in diagnostic values when results from surgical and imaging studies were compared, which indicates that the findings are similar across reference standards used. Readers, interesting in results from pooling of studies exclusively using surgery as reference standard, are referred to the most recent systematic reviews [50,66].
The reference standards have an influence on the diagnostic value of index tests. Studies using surgery means that results were obtained in a patient population with high prevalence of severe disc herniations, and thus results cannot be generalized to primary care populations where prevalence is much lower. Studies using imaging may display prevalence more like what is found in primary care, however at the expense of more false positive findings [150]. Consequently, uncertainty remains as to the generalizability of the results in primary care settings. Only two studies [53] and [68] included patients representative of those seen in primary care.
As suggested by others [66] we have tried to increase the performance of tests in clinical practice by recommending a CDR using a combination of tests with high levels of sensitivity and specificity. Other combinations of tests have been suggested [53,69,72,151], but these are not summarized in the format of CDRs and they are not supported as well by single studies as the Hancock rule.
When possible, we chose to report one level disc or nerve root as reference standard in order to reduce the number of false positives due to noise from other nonrelevant levels. This choice reflects the clinical reasoning process in daily practice. The clinician needs to compare dermatomal pain distribution with corresponding motor or reflex weakness in order to make a meaningful diagnostic pattern.

Spinal stenosis
The strength of our recommendation for the CDR is weak, based on low methodological quality of studies. Many of the quality items revealed serious risks of bias. First, the index test was part of the reference standard (incorporation bias) in all studies resulting in a high risk of overestimation of the diagnostic value of findings. Most studies used expert opinion based on a combination of physical examination findings and imaging even though data suggest that imaging is probably not sufficient as a reference standard in comparison with surgical findings [150]. Only two studies used surgical verification of diagnosis as part of the reference standard [77,78]. Second, the majority of studies had problematic reporting of blinding (test review bias) i.e. whether the reference standard result was interpreted blind to those of the index test and vice versa [76-78, 82, 83, 85]. Third, all studies included patients from secondary or tertiary settings with a high prevalence of patients with SS. Consequently, there is a high risk of selection bias that is likely to overestimate the diagnostic gain of using the CDR in comparison to primary care settings where the prevalence is dramatically lower.

Spondylolisthesis
The strength of our recommendation for the CDR is strong based on the methodological quality of studies. Although several of the studies displayed risk of disease progression bias and poor description of index tests, the quality items reveal serious risks of bias in few cases [90,94].
In the present review, functional dynamic radiographs were accepted to identify segmental instability if index tests were pain provocation or movement tests and plain static radiographs if index tests were palpation of slip.
Flexion-extension functional radiographs are considered the "gold standard" in degenerative spondylolisthesis, and a disc angle change >10°or change in translation > 3 mm are generally used as cut-offs [152]. Plain radiographs with lateral views are useful in the initial investigation of isthmic spondylolisthesis [153]. A slip of > 3 mm has been suggested as cut-off [154], but the literature is lacking as to what degree of slip is significant [153]. Instead, the descriptive Meyerding classification [154] is often reported.
All studies used a definition of spondylolisthesis similar to the above, except Abbott et al. [88] that used a cut-off of 2 standard deviations beyond the mean of a sample of pain free individuals.
Even though the positive LRs across single studies are only of moderate levels, the magnitude of LRs will probably rise to a level sufficient to be useful in clinical practice when they are used in combination.
All studies, except one [88] were performed in tertiary settings resulting in high risk of selection bias that is likely to overestimate the diagnostic gain of using the CDR when applied to primary care.

Fracture
It was not possible to constitute a CDR for the identification of a painful fracture. Results of single studies were not in concurrence and the majority of studies had serious risks of bias with respect to differential verification, test review, and uninterpretable results/withdrawals. A symptomatic fracture is considered a 'red flag' warranting referral to secondary care. Consequently we have emphasized findings that are able to exclude patients with this condition.
The Henschke rule [96] has the potential to be a useful screening tool in primary care. However, the results need confirmation in future studies as the results of the only other primary care study included in this review were not in concordance [100]. Overall, the results from these two studies did not differ markedly from the rest.
Trauma (major in young persons and minor in the elderly] is a highly plausible mechanism that can lead to fracture and a highly increased prevalence of osteoporotic fractures are seen in patients, mainly female, with age above 75 years [97]. Both of these features contribute to the diagnostic value of the rule although not validated as stand-alone findings.
The inconsistency of results may be influenced by the method of imaging. Radiography was used in all studies with the addition of CT-scan in only one study [102]. No study used MRI. Radiographs may be adequately sensitive, but their ability to distinguish acute from chronic fractures is poor. MRI is more specific because it identifies marrow edema or an associated hematoma, which may indicate a symptomatic fracture [155].

Myofascial pain
The suggested criteria should be regarded as the first step in defining a common set of diagnostic criteria for selection of patients to be included in future reliability and validity studies.
Our literature searches identified a number of studies attempting to demonstrate construct validity, but we did not perform a systematic search for additional studies in reference lists. Therefore, the included studies must be regarded as important examples of attempts of validation rather than a systematic review of this type of literature. The studies used TrPs found by manual palpation as the reference standard, meaning that the purpose of these studies were to identify the underlying physiological mechanisms behind the presence of TrPs rather than a diagnostic validation of palpation findings. Several hypothetical theories have been suggested in order to explain the formation and persistence of TrPs [156].
It is a matter of controversy whether TrPs should be regarded as stand-alone entities that are a primary pain source or whether they are secondary to other painful disorders [106,157]. Consequently, a myofascial pain syndrome may coexist with several other syndromes in our proposed classification system. It is essential to exclude underlying disorders capable of causing reproduction of a referred pain sensation with stimulation of a hypersensitive spot in the muscle before a conclusion can be made as to whether the myofascial TrP is the dominant source of the patient's pain.

Peripheral nerve
While diagnostic value of the SLR and slump is demonstrated in patients with lumbar radiculopathy, the value in relation to painful peripheral nerve tissue is unknown. Our search did not identify any studies investigating the ability of these tests to discriminate patients with peripheral nerve pain from other competing disorders. The suggested criteria should be regarded as an attempt to define a common set of diagnostic criteria for selection of patients to be included in future validity studies.
The spread of sensitizing effects along the nerve is a plausible explanation for why movement of a distant body part can change sensory responses. However, it has been argued that the fascial network in the back and legs and may account for positive findings in terms of pain and limited range of movement during SLR and slump test [127,158]. Therefore, structural differentiation between neural tissues as opposed to musculoskeletal connective tissues has been proposed. When lumbar or leg pain increase during the SLR test with dorsiflexion of the ankle or flexion of the neck, a neural pain source is alleged to be identified [118]. Likewise, regarding the slump test, with the addition that the pain decrease with the release of neck flexion [118,159]. Our search of the literature did not identify any studies that specifically tested this hypothesis.
In line with other authors [160,161], we suggest the term "Increased neural mechanosensitivity" to describe a condition where the patient's usual pain is reproduced by sensitizing maneuvers. Increased neural mechanosensitivity has been given several other labels, i.e. adverse neural tension, neurodynamics, and neural tension dysfunction [118,160].
The issues discussed in the myofascial pain section above, concerning coexistence with several other syndromes in our proposed classification system, apply to peripheral nerve as a pain source as well.

Central sensitization
Although the Nijs rule is the result of a consensus process, caution is warranted because the participating experts are a selective sample within the field of neuroscience. Therefore, the suggested criteria should be regarded as an attempt to define a common set of diagnostic criteria for selection of patients to be included in future validity studies. A possible use of the Nijs rule in clinical practice has been exemplified in a recent paper [162]. CS might be explained by an amplification of neural signaling within the central nervous system that elicits pain hypersensitivity" [139] However, controversy exists as to the nature of CS and whether it is possible to identify this condition in clinical practice [140,163].
The pathophysiological mechanisms are not fully understood, but there is increasing evidence that CS and chronic widespread musculoskeletal pain is associated with plasticity changes in of the central nervous system leading to hypersensitivity that can explain the clinical findings in chronic widespread LBP [133,139,141]. The main clinical manifestations are widespread lowered pain thresholds, exaggerated pain response to stimuli, and enlargement of pain referral areas. Most studies in the field have used clinical manifestations as the reference standard, meaning that the purpose of these studies were to identify the underlying physiology behind the presence of CS and widespread pain rather than a diagnostic validation of clinical findings.
In patients with chronic LBP it has been reported that 25-38% develop chronic widespread pain [164][165][166], and the condition is closely associated with systemic comorbidity and psychological disorders [167].
In our opinion, the suggested rule is useful for increasing the likelihood of identifying patients with CS in primary care. Central sensitization may coexist with other structure-specific syndromes in our diagnostic classification system because it is generally recognized that there is a structural pain generator behind initial nociception and peripheral sensitization involved [132]. However, we would not expect a patient with CS to fit any of the clinical patterns of specific pain producing structures in the classification system. In order to choose the best treatment strategy, the clinician has to make a decision as to which pain sources are the dominant in the individual patient with LBP [140,163].

Reference standards
At the present time is seems obvious that there are no 'gold' standards, either in the form of clinical tests, high tech imaging or other procedures. What is available are reference standards that, while not perfect, are appropriate and quite adequate for the majority of patients, and for use as comparators with clinical tests in diagnostic accuracy studies. The diagnostic utility of discography and FJ or SIJ blocks is a matter of controversy. Some consensus reports do not support the use of these procedures due to insufficient evidence of validity [168], the main problem being the absence of gold standards for identifying a "true" pain source. In this review we have tried to reduce the possible false positive rate by using the strictest available criteria for the reference standards as a requirement for inclusion of studies.
What is apparent from our systematic review is that there generally is sufficient published data that can form a framework for an intelligent use of clinical examination procedures and more expensive and invasive diagnostic investigations when required. Diagnosis of the source and cause of presenting back pain remains a challenge, and only further high quality research will improve certainty for clinicians and patients alike.
It is true that for a large proportion of patients in the acute or subacute phase, an accurate patho-anatomic diagnosis is not required, even though possible with some degree of confidence. However for patients whose symptoms are not improving after several months, the need for a more precise diagnosis becomes increasingly valuable as a guide to more effective and targeted management. To this extent, the recommendations from this systematic review might be helpful, in that patient selection for expensive high tech imaging and minimally invasive diagnostic injection procedures is facilitated, with consequent better utilization of resources.

Implication for practice
Our recommendations are based on considerations of the consequences of false positives and false negatives. In most diagnoses, we put the most emphasis on tests with high specificity indicating few false positives and positive LRs to indicate the ratio of true positive tests results above the false positives. The consequence is that the clinician will be quite certain that a patient would actually have the disorder if the reference standard procedure were to be performed. Often, high specificity is a trade off at the expense of low sensitivity, meaning that a substantial proportion of patients with the disorders are not identified, and remain unclassified. However, the consequences in primary care are not serious inasmuch as the patient remains in the category of non-specific LBP. In daily clinical practice, referral to further diagnostics most often depends on assessment of red flags, severity of symptoms and functional limitations rather than diagnostic classification.
Only in cases where an undiagnosed spinal fracture is present, do primary treatment methods have potential to harm the patient if unidentified. Consequently, we have prioritized the recommendation of tests with a high sensitivity and low negative LRs in this diagnosis.
For the clinician, the diagnostic considerations do not stop here. The diagnostic certainty that a positive test will identify a pathological disorder is dependent on the prevalence of the disorder. Prevalence of categories like nerve root involvement, spinal stenosis, spondylolisthesis, and fracture are generally much lower in primary care settings than in secondary or tertiary settings of the vast majority of diagnostic studies. This means that the diagnostic accuracy of a positive test is likely much lower when the index tests are applied to primary care settings. For example, the pre-test probability of having a symptomatic spinal stenosis in primary care is estimated to be only 3% [168]. By use of the Cook rule, the posttest probability will rise to 7%. When improved walking tolerance with the spine in flexion or patient history report of relief by forward bending are added to the rule we would expect the post-test probability to rise further. By means of the LRs presented in this review, the clinician can use Fagan's nomogram [169] as a graphical tool for estimating how much the result on a diagnostic test changes the probability that a patient has the disorder in question.
In daily practice, it is unlikely that clinicians make conclusions based on a single finding. This practice is supported by our results that generally provide the most promising accuracy in diagnosis in which a composite of findings can be identified. Some studies do report diagnostic accuracy of test combinations and clusters, but this does not totally reflect the reasoning process of expert clinicians. Clinicians do not use individual tests or clusters of tests out of context from the total clinical picture. Sometimes pattern recognition is used, and sometimes a sequential, algorithmic or staged approach is used. Another way to utilize multiple test results is to consider the probability of specific disorders based on prevalence within a defined group or subgroup. Prevalence is equal to pre-test probability so the probability of any given disorder is equivalent to its prevalence in any given setting. The process of progressively reducing the size of the group labelled as 'non-specific' , by abstracting out those cases with very high probability of a known condition, may be called 'Diagnosis by Subtraction'.

Diagnosis by subtraction
To illustrate, assume for this current purpose, that in a specific setting, the prevalence of 'centralizers' is 0.5 or 50%. The high specificity of this clinical finding to discogenic pain confirmed by discography indicates that these patients do not have 'non-specific' back pain but a 'specific' anatomical source of pain [33]. Whatever the prevalence of the remaining possible causes of pain in the whole group, it is twice as high in the 'non-centralizer' group. Thus the probability that a non-centralizer has of having, say sacroiliac joint pain or facet joint pain, is doubled. This review has shown that certain CDRs have high specificity for sacroiliac joint pain, spondylolisthesis, disc herniation with nerve root involvement, and spinal stenosis. If we sequentially subtract those cases satisfying the CDR's for these conditions, the prevalence / probability of other conditions being the cause of pain progressively rises as the size of the nonspecific low back pain category reduces.

Limitations of this review
One of the main limitations in this review is that the search of the literature was not updated to year 2015 in all diagnostic categories. Due to limited resources, this has not been possible for the present authors. If an existing review fulfilled the criteria of being current, relevant, and of high-quality, then we chose to use our resources to conduct systematic searches within fields where recent reviews had not been published.
The vast majority of patients is most likely not representative of those that present for treatment in primary care. Almost all patients were preselected having a referral to specialist centers for specific diagnostic evaluation making them likely to have the target disorder in question.
Although some of the included reviews have used a QUADAS score of 10/14 as a marker for high versus low quality studies, we agree with the developers of the tool that no meaningful cut off exists [170].
It is our judgment that pooling of data was not feasible due to great variability across studies: The patient characteristics and prevalence of the target disorders varied considerably, the same reference standard was seldom used across studies, definition of a positive reference standard was not often specified, and execution of index tests was likely to vary among studies. Though it is tempting to pool data and perform a meta-analysis, we chose not to do this since in our opinion, pooling systematically homogenizes studies that are in fact acknowledged as heterogeneous. We chose to put emphasis on the results of those studies that had satisfactory quality assessments, and seemed to be closest in context to the environment this classification targets i.e. primary care.

Conclusions
In some diagnostic categories we have sufficient evidence to suggest a CDR. In others, we have only preliminary evidence that needs testing in future studies. The use of single clinical tests appears to be less useful than clusters of tests which is more closely in line with clinical decision making.
With respect to clinical diagnostic of symptomatic intervertebral disc, sacroiliac joint, spondylolisthesis, disc herniation with nerve root involvement, and spinal stenosis, we were able to construct promising CDRs (see Fig. 1]. However, the accuracy of these findings in a primary care setting has yet to be confirmed.