The efficacy of duloxetine, non-steroidal anti-inflammatory drugs, and opioids in osteoarthritis: a systematic literature review and meta-analysis

Background This meta-analysis assessed the efficacy of duloxetine versus other oral treatments used after failure of acetaminophen for management of patients with osteoarthritis. Methods A systematic literature review of English language articles was performed in PUBMED, EMBASE, MedLine In-Process, Cochrane Library, and ClinicalTrials.gov between January 1985 and March 2013. Randomized controlled trials of duloxetine and all oral non-steroidal anti-inflammatory drugs and opioids were included if treatment was ≥12 weeks and the Western Ontario and McMaster Universities Index (WOMAC) total score was available. Studies were assessed for quality using the assessment tool from the National Institute for Health and Clinical Excellence guidelines for single technology appraisal submissions. WOMAC baseline and change from baseline total scores were extracted and standardized. A frequentist meta-analysis, meta-regression, and indirect comparison were performed using the DerSimonian-Laird and Bucher methods. Bayesian analyses with and without adjustment for study-level covariates were performed using noninformative priors. Results Thirty-two publications reported 34 trials (2 publications each reported 2 trials) that met inclusion criteria. The analyses found all treatments except oxycodone (frequentist) and hydromorphone (frequentist and Bayesian) to be more effective than placebo. Indirect comparisons to duloxetine found no significant differences for most of the compounds. Some analyses showed evidence of a difference with duloxetine for etoricoxib (better), tramadol and oxycodone (worse), but without consistent results between analyses. Forest plots revealed positive trends in overall efficacy improvement with baseline scores. Adjusting for baseline, the probability duloxetine is superior to other treatments ranges between 15% to 100%. Limitations of this study include the low number of studies included in the analyses, the inclusion of only English language publications, and possible ecological fallacy associated with patient level characteristics. Conclusions This analysis suggests no difference between duloxetine and other post-first line oral treatments for osteoarthritis (OA) in total WOMAC score after approximately 12 weeks of treatment. Significant results for 3 compounds (1 better and 2 worse) were not consistent across performed analyses.


Background
Over 50 treatment modalities for osteoarthritis (OA) of the hip and knee have been evaluated by the Osteoarthritis Research Society International (OARSI) [1,2]. Oral pharmacologic modalities included acetaminophen, non-steroidal anti-inflammatory drugs (NSAIDs), and both strong and weak opioids. Guidelines have recommended acetaminophen for first-line use, with NSAIDs and opioids as second and third lines of treatment [1,[3][4][5]. However, reservations have been expressed concerning the long-term safety and efficacy of NSAIDs and opioids [1,2,5,6]. Some reviews have gone further and recommended against their long-term use [7,8]. Recently published meta-analyses suggest that currently available oral treatments have only limited efficacy in the average patient with OA [6]. In addition, the efficacy seen in trials seems to be impacted by trial design and baseline factors and may be limited to the first few weeks of use [6].
Duloxetine is a selective serotonin and norepinephrine reuptake inhibitor (SNRI) that has demonstrated efficacy in OA in Phase III clinical trials as well as a favorable adverse event profile across indications [26][27][28]. Duloxetine is thought to inhibit pain through its enhancement of serotonergic and noradrenergic activity in the central nervous system. It is currently indicated in the US for the management of pain disorders, including diabetic peripheral neuropathic pain (DPNP), fibromyalgia, and chronic musculoskeletal pain due to OA and chronic low back pain [29].
We conducted a systematic literature review followed by a meta-analysis to assess the efficacy of duloxetine versus other commonly used post first-line OA treatments, including NSAIDs and opioids. Our study reflected the chronic nature of OA by including only trials of 12 or more weeks duration (the recommended duration for confirmatory trials) [30] and a more inclusive set of OA symptoms by using the Western Ontario MacMaster Universities Osteoarthritis Index (WOMAC), which includes subscales for function and stiffness as well as pain [31]. We also sought to confirm the influence of design and baseline factors observed in a recent OA meta-analysis [6]. Both frequentist and Bayesian analyses were undertaken to assess the effect of duloxetine compared to the other available oral treatments.

Inclusion and exclusion criteria
Randomized controlled trials (RCTs) were included for OA treatment with duloxetine, NSAIDs or opioids at dosages consistent with United Kingdom prescribing information [32]. All included studies were of at least 12 weeks duration and published in English. Articles were included if they evaluated clinical efficacy using WOMAC total scores. Studies were excluded that did not report clinical efficacy of OA, and did not have at least 2 arms of a treatment of interest, or 1 arm of a treatment of interest and a placebo arm.
When it was unclear from the title or abstract whether a study met the criteria, the full paper was acquired and read. Determination of inclusion/exclusion was performed by 2 persons working independently. When their conclusions were not in agreement the persons met and came to a consensus.

Literature search
The

Data extraction
Data extraction was performed by 1 reviewer and checked by a second reviewer using a predefined data extraction form. Discrepancies were resolved by discussion between reviewers. For each study, reviewers extracted data that were deemed to potentially impact efficacy outcomes, such as study population (percent women, mean age, mean duration of OA), study design (duration, washout period, flare requirement, concomitant analgesic use, enriched enrollment, missing imputation technique), and outcomes (WOMAC score at baseline, endpoint, and change from baseline with measures of variance). Studies were categorized as having a washout period if the publication mentioned a period of washout or no treatment before randomization. A study was classified as requiring flare if the publication stated that after the washout/no treatment period patients were required to exhibit a flare of symptoms to continue in the study. Studies were classified as allowing concomitant analgesic use if patients could use analgesic medications in addition to their assigned treatment throughout the study; rescue medication was not considered concomitant use.
For studies that did not report sufficient data to be included in the analysis, 3 attempts were made to contact authors by email to obtain missing information. Studies were assessed for quality using the assessment tool from the National Institute for Health and Clinical Excellence (NICE) guidelines for Single Technology Appraisal submissions [33]. This 7-item questionnaire evaluates each trial based on randomization, adequate concealment of treatment allocation, similarities between treatment groups, degree of blinding, balance of withdrawals and dropouts between treatment groups, reporting of all outcomes measured, and use of intention to treat analyses. Studies were assessed by one reviewer and independently checked by a second reviewer. Positive responses were tallied for a total possible score of 7, with higher scores representing better quality.

Outcome measure
The outcome measure for the meta-analysis was the change from baseline total WOMAC score as reported at 12 or more weeks. The WOMAC instrument consists of 24 questions answered on a 0-4 Likert or 0-100 visual analogue scale (VAS). The WOMAC has 3 subscales: function (17 questions), pain (5 questions), and stiffness (2 questions). A lower WOMAC score indicates fewer symptoms, thus improvement is shown as a negative value; negative values of larger magnitude are indicative of greater efficacy. WOMAC total and subscale scores are reported inconsistently, with publications reporting scores on different scales, some subscale scores and not others, different measures of variance, or no measures of  Note: a value imputed by estimating a stiffness subscore from other scores reported for that treatment; b study longer than 12 weeks duration; c included in Bayesian analysis only, no placebo arm, d washout is not considered as complete in studies with concomitant analgesic use; e denotes studies without a washout period; f denotes studies with enriched enrollment design; g indicates endpoint WOMAC score, change from baseline not available in these studies; h indicates difference from placebo in WOMAC score change from baseline.
variance. Scores are commonly reported as: a) a total of the Likert scores, b) a total of the VAS scores, or c) normalized units with total and subscale scores reported on 0-100 scales [34]. To overcome this issue, WOMAC total scores were converted to a 0-100 normalized scale using a direct ratio. If change from baseline was not reported, it was calculated as the difference between baseline and endpoint or, if not possible, as the difference between baseline and a weighted average of multiple observations during treatment [35]. When subscale scores were reported without the total score, the total score and variance were calculated from the subscales. Missing stiffness subscale scores were imputed by substituting the mean of those reported for that treatment. Studies reporting neither the total score nor the pain and function subscale scores were omitted from the analysis.

Statistical analysis
Frequentist and Bayesian methods were used to assess the effect of including the direct and indirect data in the analysis. The frequentist meta-analysis using Bucher indirect comparisons was chosen because it reports traditional statistical measures, whereas the Bayesian network meta-analysis allows for inclusion of both direct and indirect information in a single step. In both frequentist and Bayesian methods, if multiple arms for a treatment were present in a study at different doses, the arms used were consistent with the United Kingdom prescribing information. For tramadol, the 400-mg daily dose was not included as it is associated with higher rates of adverse events and similar efficacy to the 300-mg dose [36]. The frequentist meta-analysis used the difference between treatment and placebo of the change from baseline WOMAC score for each active treatment. Random effects models using the DerSimonion-Laird method were employed regardless of heterogeneity due to study design and population dissimilarities [37]. Estimated treatment effects compared to placebo and compared to duloxetine were calculated with their 95% confidence intervals using the Bucher method of indirect comparison [38][39][40][41]. Frequentist analyses were performed with Comprehensive Meta-Analysis software (CMA; Biostat, Englewood NJ) [42]. Publication bias was assessed by funnel plot with Duval and Tweedie's trim and fill [37].
Random effects Bayesian network meta-analyses were performed using the change from baseline score for all available studies. Bayesian methods described in NICE Decision Support Unit documents were modified to accommodate continuous data analysis [43,44]. Each trial's specific relative treatment effect was assumed to be drawn from a random effects normal distribution with a common random effects variance for all treatment comparisons. The best model was selected based on the deviance information criteria (DIC), described in Cooper et al. [45] and Dias et al. [46], and standard deviation (SD), which provide measures of model fit. The consistency between direct and indirect evidence was performed using node splitting methods described by Dias et al. [46]. Estimated treatment effects compared to placebo and duloxetine were given with their associated 95% credible intervals as well as the probability of the treatment being superior to duloxetine. Sensitivity analyses were run on various scenarios, including adjustment for baseline scores, flare requirement, and analgesic use. The Bayesian analyses were conducted using WinBUGS version 1.4.3 (MRC Biostatistics Unit; Cambridge, UK) [47].
Heterogeneity was assessed by calculating the I 2 statistic. Twelve population and study characteristics were assessed as possible confounding factors by visually inspecting forest plots for the magnitude and variability of study WOMAC scores.     Were there any unexpected imbalances in drop-outs between groups? If so, were they explained or adjusted for?". f "Is there any evidence to suggest that the authors measured more outcomes than they reported?". g "Did the analysis include an intention-to-treat analysis? If so, was this appropriate and were appropriate methods used to account for missing data?". h Quality Score is calculated by summing the positive answers to each question ("yes" answers to questions 1-4 and 7, and "no" answers to questions 5 &6).

Literature search
incomplete reporting of WOMAC scores, especially the omission of a measure of variance. One full paper was unavailable [48]. Table 1 presents the studies included in the metaanalysis with 5 extracted study characteristics as well as baseline and change from baseline WOMAC scores. The duration of nearly all studies was 12 to13 weeks, with a range of 12 to 26 weeks. The size of treatment arms ranged from 51 patients in a placebo arm to 481 in a celecoxib arm. Seven studies did not report baseline WOMAC scores. Three studies were identified in which complete WOMAC scores were not reported in the publication, but were available on clinicaltrials.gov. These studies are identified in the table with both the publication reference and the NTC number from clinicaltrials. gov. Table 2 presents descriptive statistics of the included studies grouped by treatment. In Table 3 the quality assessments of the included studies are presented. Of the 32 included articles, 26 (81%) had a quality score of 6 or 7 (maximum score 7) and the other 6 studies had a quality score of 5, indicating that the included studies were of sufficiently high quality. A funnel plot assessing publication bias, run on all studies as not enough studies per compound were available, was roughly symmetrical, with slightly more studies on the left, indicating little effect of publication bias on the results of this analysis (Figure 2). Missing publications have been imputed using Duval and Tweedie's trim and fill and appear as solid points among the actual publications depicted as circles [37]. This method suggests that possible missing studies would trend to nonsignificant differences in means.

Statistical results
Results of both the frequentist and Bayesian analyses are shown in Table 4. The frequentist approach analyzed 32 of the 34 studies, excluding Sowers et al. [74] and Essex et al. [58] due to the lack of placebo arms. All active treatments, except hydromorphone and oxycodone, were found to statistically improve the WOMAC total score compared to placebo. Indirect comparisons to duloxetine using the Bucher method found all confidence intervals but etoricoxib encompassed zero, indicating the differences between duloxetine and all treatments except etoricoxib were not statistically significant. Two compounds, ibuprofen and etoricoxib, had an I 2 of zero while naproxen, celecoxib, duloxetine, oxycodone, hydromorphone, and tramadol had I 2 s of 52%, 33%, 44%,72%, 64%, and 58%, respectively, indicating substantial heterogeneity [78,79]. However, the direction of the treatment effect was the same for all but one study; the magnitude of the treatment effect in these studies was the source of heterogeneity.
The Bayesian network meta-analysis included all 34 studies. Figure 3 depicts the network of direct and indirect evidence. As shown in Table 4, the results lead to similar conclusions as the frequentist results, as all 95% credible intervals of the difference between duloxetine and active treatments included zero.
To explain heterogeneity/inconsistency, we graphically explored the association of relative effect of the active treatment versus placebo with study-level covariates. Forest plots were generated for each population and study characteristic showing the difference between placebo and treatment of the change from baseline, ordered by the value of the characteristic (see Additional files 1, 2, 3, 4, 5, 6,7,8,9,10,11). Figure 4 is the forest plot for baseline WOMAC scores. A visual association was indicated between baseline and change from baseline scores, with a higher baseline score associated with a larger negative (improved) change from baseline. Figure 5 is a verifying scatter plot showing the trial-level baseline  WOMAC scores between 45 and 70 and the relative treatment effect appearing to increase as the trial-level baseline increases. A frequentist meta-regression confirmed an association between the baseline and change from baseline scores (p < 0.0001) with an R 2 of 0. 573, indicating much of the observed improvement in symptoms was associated with a higher baseline level of symptoms.
Bayesian meta-regression models including study-level covariates were used to evaluate the extent to which covariates accounted for heterogeneity of treatment effects. Three models including study-level covariates yielded lower, similar DICs. (See Table 5). The model including the baseline score yielded both the lowest DIC and a substantially smaller SD of heterogeneity. Therefore, the model including the baseline score was preferred. Adjusted for baseline score, credible intervals of all treatments but tramadol and hydromorphone included zero, indicating no evidence of difference from duloxetine. In the cases of tramadol and hydromorphone, duloxetine demonstrated evidence of a clear advantage. When adjusted for baseline, the probability of duloxetine being superior increased for naproxen (19% to 57%), ibuprofen (28% to 82%), and etoricoxib (4% to 38%), but went down for oxycodone (41% to 15%).

Discussion
Our analysis employed the WOMAC, a common instrument in OA trials, with subscales for function, pain, and stiffness. It is, therefore, a broader measure of OA health than instruments that focus solely on pain. Randomized controlled trials and meta-analyses in OA commonly focus on the difference between the treatment and placebo arms of improvement from baseline to endpoint. Although a commonly reported measure in meta-analysis is the standardized mean difference Cohens d, we chose to report the unstandardized total WOMAC score, as it is a more meaningful outcome to clinicians. In the absence of consistent statistical significance, clinical relevance was not discussed. Because OA is a chronic condition, studies were included only with a treatment duration of at least 12 weeks, the current recommended minimum duration of confirmatory chronic pain trials [30]. This has not been universal practice in other meta-analyses of OA [8][9][10][11][15][16][17].
With our choice of the WOMAC composite score as the outcome of interest, we chose a continuous endpoint (mean and standard deviation) rather than a dichotomous variable. It is recognized that others recommend the use of dichotomous variables (eg, 50% reduction in pain score) for evaluation of chronic pain trials. This recommendation is based on the benefits of treatment being frequently unequally distributed, typically presenting as a u-shaped distribution [81]. The WOMAC, however, is rarely reported in this manner, and our aim was to report the broader definition of health that the WOMAC encompasses, rather than pain alone.
Song et al. [41] suggests that judicious use of metaanalytical methodology can come to similar results as direct head-to-head evidence. It is frequently not possible, however, to fully account for differences in patient populations, the impact of different trial designs, and additional hidden confounders. For example, some of the trials applied flexible dose regimens (including 1 duloxetine trial) while others applied fixed dose regimens; this could impact comparative results. Enriched enrollment, a treatment run-in after screening to titrate patients up to optimal tolerability, is frequently used in opioid trials due to their well-known dosing requirements. NSAID trials, on the other hand, tend to exclude patients with a known bleeding risk or cardiovascular risk factors due to NSAIDs' known safety profile. In the case of duloxetine, and in contrast to most other trials, a washout of previous NSAIDs was not enforced. Patients in duloxetine trials were allowed to continue (but not increase) treatment with NSAIDs with a higher proportion of patients receiving NSAIDs in placebo arms. Because this design feature only applied to duloxetine trials, they could not be accounted for overall. Such aspects can limit the interpretation and generalizability of meta-analytic results.
Statistical analyses were performed using both frequentist and Bayesian methods. Frequentist methods have the advantage of using more familiar concepts and terminology. Bayesian network meta-analysis methods have the advantage of using all the data available, such as arms from active treatment controlled trials. In this study both methods produced similar results.  Our results mirror similar findings from previous studies. A 1997 study could not recommend a choice of NSAID therapy [21]. A more recent meta-analysis commissioned by NICE did not find a statistically significant difference among NSAIDs [82]; guidelines treat NSAIDs as a class differentiated primarily by adverse events [2,3]. A metaanalysis of the short-term efficacy of treatments for OA of the knee found no statistical difference in pain relief between NSAIDs and opioids [6]. For duloxetine, our analysis repeats findings from previous studies in other pain indications. For both DPNP and fibromyalgia, duloxetine has been shown to be of similar efficacy to alternative treatment options [83,84]. Our study found a significant relationship between baseline symptoms and the magnitude of treatment effect. The related issue of the influence of flare design in trials of NSAIDs has previously been noted [7,85].
A limitation of this meta-analysis was the low number of studies available for analysis. Four or more studies were available for celecoxib, naproxen, tramadol, and etoricoxib. For all other treatments, 3 or fewer studies were found. Eight studies were omitted from the Bayesian adjusted for baseline WOMAC analysis, due to the omission of baseline scores in study publications. These numbers were, however, similar to several other metaanalyses in OA [7,8,18,21]. Limiting the literature search to English language publications may have lead to missed RCTs. However, a study examining the effect of an English-language restriction in systematic reviews and meta-analyses found no evidence of bias as a result of the restriction [86]. The funnel plot suggests that publication bias, if any, was towards the exclusion of statistically nonsignificant studies, further supporting our findings of no difference among comparators. Another limitation of this study is the potential for ecological fallacy associated with patient level characteristics. For example, the mean baseline WOMAC score used in the regression analysis could represent a wide variety of patient level baseline scores. A study by Lange et al. [13] points out that imputed data may bias results, showing benefit of treatment where no  benefit is seen in the non-imputed data. Thus, the imputation methods used in several of the included studies could have introduced bias in the results However, its reported effect size seems to be in the range of alternative opioid treatment options such as tramadol or oxycodone [50,87].

Conclusions
This meta-analysis found no difference between duloxetine and other post-first line oral treatments for OA in the total WOMAC score after approximately 12 weeks of treatment in a consistent manner. Etoricoxib was more effective than duloxetine in the frequentist analysis and resulted in a 96% probability of being better than duloxetine in the nonadjusted Bayesian analysis. After adjustment for baseline pain score, however, duloxetine showed evidence of superiority to both tramadol and hydromorphone, but not for the other treatments, including etoricoxib.