The reporting quality of studies investigating the diagnostic accuracy of anti-CCP antibody in rheumatoid arthritis and its impact on diagnostic estimates

Background Recently anti-CCP testing has become popular in the diagnosis of rheumatoid arthritis (RA). However, the inadequate reporting of the relevant diagnostic studies may overestimate and bias the results, directing scientists into making false decisions. The aim of the present study was to evaluate the reporting quality of studies used anti-CCP2 for the diagnosis of RA and to explore the impact of reporting quality on pooled estimates of diagnostic measures. Methods PubMed was searched for clinical studies investigated the diagnostic accuracy of anti-CCP. The studies were evaluated for their reporting quality according to STARD statement. The overall reporting quality and the differences between high and low quality studies were explored. The effect of reporting quality on pooled estimates of diagnostic accuracy was also examined. Results The overall reporting quality was relatively good but there are some essential methodological aspects of the studies that are seldom reported making the assessment of study validity difficult. Comparing the quality of reporting in high versus low quality articles, significant differences were seen in a relatively large number of methodological items. Overall, the STARD score (high/low) has no effect on the pooled sensitivities and specificities. However, the reporting of specific STARD items (e.g. reporting sufficiently the methods used in calculating the measures of diagnostic accuracy and reporting of demographic and clinical characteristics/features of the study population) has an effect on sensitivity and specificity. Conclusions The reporting quality of the diagnostic studies needs further improvement since the study quality may bias the estimates of diagnostic accuracy.


Background
Rheumatoid arthritis (RA) is a chronic, systemic inflammatory disorder that affects many tissues and organs, mainly synovial joints [1]. The disease leads progressively to the destruction of articular cartilage and ankylosis of the joints [2]. Although the cause of RA is unknown, autoimmunity plays a pivotal role in both its chronicity and progression [3]. RA affects females more frequently than males and it is diagnosed mainly in age 40-60 years [4].
The diagnosis of RA is based on clinical criteria and laboratory tests. Regarding the later tests, the presence of the rheumatoid factor (RF), an autoantibody, consists one of the American College of Rheumatology (ACR) criteria for presence and severity of RA [5]. However, RF has a limited specificity since it can be detected in other autoimmune or infectious diseases, and in the healthy elderly. Anti-cyclic citrullinated protein antibodies (anti-CCP) are other autoantibodies that may be detected in RA patients. Recently anti-CCP testing has become substantial part of ACR-EULAR classification criteria for RA [6]. There is evidence that CCP-assays provide comparable performance with that of RF [7]. However, analysis of the association between anti-CCP antibody titre and RA activity produced contradictory results [8,9]. Anti-CCP2 assay is the most popular because of its high diagnostic specificity and its predictive and prognostic value in RA [10][11][12].
Currently, diagnostic studies on anti-CCP assays are publishing with a high rate [13]. However, overestimated and biased results from poorly designed and reported studies may direct scientists into making false decisions [14][15][16]. The reporting information on design and conduct of diagnostic studies is crucial, though, its absence has already been noticed [17,18]. Nevertheless, appropriate reporting may allow researchers to detect potential bias in studies' internal validity, to assess generalizability and applicability of their results [19]. A survey of published studies of diagnostic accuracy showed that the methodological quality was not optimal. In addition, information on issues like study design, conduct and data analysis was often not reported [20,21].
Inadequate reporting of the published diagnostic accuracy studies may restrict the generalizability, applicability and credibility of studies' results. A number of guidelines and statements have been developed to improve the quality of a variety of study designs [22], including the diagnostic accuracy studies [19]. In particular, in order to improve the reporting of diagnostic accuracy studies, the Standards for Reporting of Diagnostic Accuracy (STARD) statement has been proposed (http://www.stard-statement.org/) [19]. The STARD statement is a checklist of 25 criteria that diagnostic accuracy studies should conform to in order to make their conclusions easier to assess, interpret and generalize, and lead as a result to better decisions in diagnosis. However, STARD does not assess the actual quality of the research study but the reporting quality, two issues which are not necessarily correlated. In addition to STARD, another tooled has been proposed, called QUADAS, for assessing the methodological quality of diagnostic accuracy studies [23]. Recently, QUADAS was used to evaluate the quality of anti-CCP RA studies in a meta-analysis [13].
The aim of the present study was twofold: first, to evaluate the reporting quality of studies used anti-CCP2 for the diagnosis of RA, according to the STARD statement, and second, to investigate whether quality of reporting is associated with the effect size of diagnostic metrics using meta-analytic techniques (data synthesis). The analysis was focused on the reporting of methods and results sections of the STARD statement. The effect of quality on diagnostic accuracy was focused on studies scored as "high quality" and "low quality", and for specific items of STARD.

Study identification
PubMed was searched for clinical studies, published from January 1987 (date of imposing the revised ACR criteria [5] to September 2010 that assessed the utility of anti-CCP2 assay in the diagnosis of RA. The search used the following strategy: (("diagnosis" or "diagnostic" or "sensitivity" or "specificity") and ("rheumatoid arthritis" or "RA") and ("anti-cyclic citrullinated peptide antibodies" or "anti-CCP" or "antiCCP" or "anti-CCP2" or "antiCCP2")).
The authors independently reviewed the abstracts to determine the eligibility of each article to potentially meet the search strategy. The references of the retrieved articles were also searched. Only articles in English language, published as full papers or short reports were considered in our study. Reviews, editorials, letters and comments were excluded. The agreement level was reported using Kappa statistics.

Study selection
We included studies that evaluated the utility of anti-CCP2 antibody for diagnosis of RA with more than 10 participants enrolled that provided data sufficient to estimate both sensitivity and specificity. As controls were defined participants free of RA (i.e. diseased with other conditions or healthy). Disagreements were resolved by discussing the full articles.

Data abstraction
The data were abstracted from each study by two authors (AP and DZ) independently. Data were extracted by using a standardized form that included study setting and technical details of the assay, demographic characteristics of the patients and 2×2 contingency tables (disease status and test outcome) needed to calculate at least the sensitivity and specificity.
When articles reported more than one set of 2×2 data (such as assays data from different manufacturers and/or different cut-offs), then each data set was considered as a different study. Also, articles reported data separately for multiple control groups (diseased, healthy) were considered as separate studies. In overlapping studies, the most recent and/or the largest study was recorded. The agreement level was also reported using Kappa statistics.

Study quality assessment with STARD
Although all items in the STARD statement are considered important to help to improve the quality of reporting diagnostic accuracy studies, some are more subjective than others to assess potential biases. Thus, in the present study we focused on methodological related items, i.e. the items that correspond to methods and results sections (eleven items in each category). Thus, in total, 22 items were considered (Table 1). In order to determine better if an item is accurately reported in the articles, we took into account the guidance provided by the STARD Explanation and Elaboration document [21]. All items were investigated in terms of whether they were reported, not whether they were actually carried out during the study. Items were to be scored as "yes" if they were reported in enough detail to allow the reader to judge that the definition had been met. Especially in case of item (14) providing participant's information about the patient recruitment, the item was coded as "yes" only when the flow diagram was given or explicitly described (i.e. the number of controls per case was specified and the matching variables were clearly stated). Alternatives responses (apart from "yes" or "no") and unclear responses to each item were coded as negative responses.

Estimation of diagnostic accuracy
The estimation of the diagnostic accuracy was based on the sensitivity (Se) and specificity (Sp). Se and Sp were calculated from contingency tables abstracted from each study.

Data synthesis and analysis
For each study the diagnostic metrics (Se, Sp, positive and negative likelihood ratio) were calculated. A bivariate model [24,25] was used to estimate summary sensitivity and specificity, with 95% confidence and prediction regions around the summary points. Hierarchical SROC analysis that allows for between-study heterogeneity was also applied to four or more studies [25]. Heterogeneity was evaluated visually by using the SROC curve and numerically by using the variance of the logit-transformed sensitivity and specificity. A smaller value of variance indicates low between study heterogeneity. The statistical analysis was performed using Stata v.10 (metandi and metandiplot commands [26]) (StataCorp, College Station, Texas) and SPSS, version 13.0 (SPSS Inc., Chicago).

Effect of study quality
In addition, to the overall percentages of reporting the STARD statement items, the quality of reporting in high versus low quality articles was explored. Studies were classified as high quality of reporting when quality score ≥ 9 and as lower quality when quality score < 9. The choice of quality score = 9 as cut-off was the median of the overall quality scores of studies. The overall quality score for each article was calculated by summing the weighted score of reported items. A unit weight was applied for each of the item 2, 5, 7, 10, 13, 16 and 19 (considered subjectively more "important"), whereas, a weight of 0.5 for each of the other items. The effect of study quality on diagnostic accuracy was evaluated based on the level of quality (high/ low) and on the reporting results of the above "important" STARD items. Then, the estimates of pooled sensitivities and specificities were compared with a z-score test.

Eligible studies
The literature review identified 364 articles that met the search criteria in PubMed. Thereafter, these articles were retrieved and screened for eligibility. Overall, a total of 103 unique articles remained for analysis having complete full-text evaluation. Figure 1 presents a flow diagram of retrieved articles and articles excluded with specification of reasons. The agreement in article evaluation for eligibility and in extracting the data was both relatively high (kappa = 0.74 (0.70-0.78) and kappa = 0.86 (0.82-0.90)), respectively. A full list of the 103 articles that were retrieved as full-text and included in final analysis is located at the Web site http://biomath.med.uth.gr.

Study characteristics
The characteristics of studies included in the analysis are shown in Additional file 1: Table S1. A list of journals that endorsed the STARD statement is shown in Additional file 2: Table S2. The 103 eligible articles were published during the period 2003-2010. Consequently, all the eligible articles were published after the introduction of the STARD statement (i.e. 2003). In total 35 different populations (countries) were referred in the eligible articles. Most of the articles conducted in Europe (51 articles, 49.5%) and thereafter in Asia (31 articles, 30.1%), in Africa (7 articles, 6.8%), in North America (7 articles, 6.8%), in South America (6 articles, 5.8%) and in Oceania (1 article, 1.0%). Most of the articles referred to studies conducted in teaching hospitals (52 articles, 50.5%) and the second most frequent studies' setting was the rheumatologic clinics (31 articles, 30.1%). In 13 out of 103 articles, the detection of anti-CCP2 antibody was done with more than one assay. The four most popular manufacturer assays used, were the Euro-Diagnostica (33 studies, 25.0%), the Axis-Shield (32 studies, 24.2%), the Inova Diagnostics (25 studies, 18.9%) and the Euroimmun (19 studies, 14.4%). A variety of cutoffs were used to define a positive test result according to different manufacturers, but in 9 articles/studies the threshold used was not explicitly given. Control group consisted of participants free of RA (i.e. diseased with other conditions or healthy). From all the above reasons, the 103 articles we had, concluded to a total of 132 studies for the meta-analysis. The mean age of RA participants in the studies, where reported, ranged from 30 years to 70 years (missing information in 37 articles, 35.9%) and the proportion of women RA participants, where reported, ranged from 23.2% to 100% (missing information in 29 articles, 28.2%). Fifty three articles (51.5%) were published in high quality articles (STARD score ≥ 9) and 50 articles (48.5%) in lower quality articles (STARD score < 9) ( Table 2). Table 1 shows the overall proportion of reporting of the 22 items in the methods and results sections of the STARD statement and the corresponding proportions for high and low quality articles. Overall, 10 items (six and four items in methods and results sections, respectively) were reported by 85% or more of the studies (Table 1). In methods, the items include the reporting of 1) study population (inclusion/exclusion criteria, setting, location), 2) participants recruitment (eg. based on symptoms, previous testing), 3) participant sampling, 4) data collection (prospective or retrospective study), 5) methods for calculating or comparing measures of diagnostic accuracy and statistical methods used to quantify uncertainty and 6) methods for calculating reproducibility, if done. In results, the items include the reporting of 1) clinical and demographic characteristics of the study population (age, sex, presenting symptoms, comorbidity, current treatment), 2) the cross tabulation or the distribution of the test results by the results of the reference standard, 3) estimates of variability of diagnostic accuracy between subgroups of participants, centers, if done and 4) estimates of test reproducibility, if done.

Main results
Furthermore, 13 items (including the ten items already mentioned above) were reported by 70% or more of the studies. The 3 additional items were the reporting of 1) reference standard and its rationale of, 2) definition of and rationale for the units, cut-offs and/or categories of tests results and 3) estimates of diagnostic accuracy and measures of statistical uncertainty.
In contrast, some items were reported only by a small fraction of articles. For example, 20% of articles provided the number, training and expertise of persons executing the tests, 18% reported the blinding status, 13% provided information on recruitment, 12% reported adverse events and finally, 8% provided details about handling of missing responses and outliers. (Lower quality articles, score < 9 and higher quality articles, score ≥ 9). # for smaller number of articles (n = 25 articles for items 11 & 22), (n = 23 articles for item 21). { { P values were obtained from Fisher's exact test in order to express the association between proportions for reporting an item across the two groups of articles. * STARD = Standards for Reporting of Diagnostic Accuracy. Figure 1 Flow diagram of citations through the retrieval and the screening process.

Effect of study quality
In comparing the quality of reporting in high quality (quality score ≥ 9) versus lower quality (quality score < 9) articles, significant differences were seen in 11 items (P < 0.05) (6 items in methods: study population, data collection, reference standard, definition of units/cut-offs, number/training/expertise of persons executing the tests, methods for calculating diagnostic measures and 5 in results: dates of recruitment, clinical/demographic characteristics, information on recruitment, time interval between tests, estimates of diagnostic accuracy). In all these items high quality articles showed better performance. An itemby-item comparison is presented in Table 1.
Impact of study quality on diagnostic estimates Table 2 shows the meta-analysis' overall results (pooled sensitivities and specificities), the results according to STARD score (high/low quality) and the results for specific STARD items (comparison of outcome "yes" vs. "no").
In comparing specific items ("yes" vs. "not"), the estimates of pooled sensitivities were statistically significant for items 10 and 13 [p = 0.03 and p = 0.06 (marginal), respectively]. In addition, the estimates of pooled specificities were statistically significant for items 13 and 16 (p = 0.01 and p = 0.01, respectively).

Discussion
The present study investigated the quality of reporting of studies using the anti-CCP2 assay in RA patients according to the STARD statement. The differences between high and low quality studies were explored. The effect of reporting quality on pooled estimates of diagnostic metrics was also examined. Our analysis focused on the reporting of methodological items (items in method and results' sections). In total, the 103 articles (corresponding to 132 studies) covered a publication period of 23 years. Almost the articles used in our analysis were published after the introduction of STARD statement (only 4 of them were published during 2003, year of STARD appearance). Although the overall reporting quality was relatively good (13 items were reported by 70% or more of the studies) there are some essential methodological aspects of the studies (such as number/training/expertise of persons executing the tests, readers' blinding to results, information on recruitment, adverse events from performing the tests, handling of missing responses and outliers) that are seldom reported making it difficult for the reader to assess explicitly the validity of a study. Comparing the quality of reporting in high versus low quality articles, significant differences were seen in a relatively large number of methodological items (11 items referred to: study population, data collection, reference standard, definition of units/cutoffs, number/training/expertise of persons executing and reading the tests, methods for calculating diagnostic measures, dates of recruitment, clinical/demographic characteristics, information on recruitment, time interval between tests, estimates of diagnostic accuracy).
Overall, the STARD quality score (high/low) has no effect on pooled sensitivity and pooled specificity. However, the meta-analysis showed an effect for specific STARD items. Studies not reporting sufficiently the methods used in calculating the measures of diagnostic accuracy (item 1), may have overestimated the sensitivity. In addition, the reporting of demographic and clinical characteristics/features of the study population (items 13 and 16) has affected the effect size of specificity, i.e. they have overestimated it, indicating also a spectrum bias [19].
However, the findings of the present synthesis (sensitivity of anti-CCP2, 71% and specificity, 96%) are compatible with those of earlier reviews (Nishimura et al. [27]: sensitivity, 67% and specificity, 95%, Whiting et al. [13]: sensitivity, 67%, specificity, 96%). An overestimation of our overall sensitivity might be resulted because of the lack of stratification by study design or disease duration in the analysis.
In a recent review, Whiting et al. [13] compared the accuracy of ACPA with that of RF in diagnosing RA in patients with early symptoms of the disease. They also assessed their studies for methodological quality by using a modification of the QUADAS criteria (items related to reporting quality, were removed). However, the impact of quality effect in diagnostic accuracy was not evaluated further. Nevertheless, the primary aim of the present study was to evaluate the effect of quality of reporting (according to STARD) in diagnostic accuracy rather than evaluating the effect of methodological quality (according to QUADAS); though, both tools can be useful for assessing the quality of diagnostic studies in a different perspective [28].
Applications of the STARD statement guidelines for assessing the quality of reporting in diagnostic accuracy studies, have been conducted in various medical fields such as in the field of diagnostic endoscopy [29], of juvenile idiopathic arthritis in peripheral joints [30], of diabetic retinopathy screening [31], of glucose monitor studies [32], of optical coherence tomography in glaucoma [33], of ultrasonography for the diagnosis of developmental dysplasia of the hip [34] and in the field of screening ultrasonography for trauma [35].
A limitation of the present study is that the literature search was restricted to PubMed. In addition, some studies may have been missed since we included only studies that provided data to estimate both sensitivity and specificity. However, the number of articles used is relatively large and an overview of reporting quality of studies may be obtained and the reached conclusions are unlikely to be affected by omitted studies. We would like to stress that lack of reporting of a STARD item does not necessarily implies that this item was not performed. Thus, a badly performed but well reported study will necessarily receive full credit. Finally, the published studies have had different design settings, and involved different stages of rheumatoid arthritis (study design, disease duration) which may question the synthesis of information, and therefore, the generalizability of results.
In conclusion, our attempt to assess the reporting quality of diagnostic accuracy studies in RA highlights the need for further improvement. Implementation of the quality reporting statements (e.g. CONSORT) have already improved the quality of reporting in other fields of medical research [36]. Thus, guidelines on the reporting of diagnostic accuracy studies are expected to improve the quality of reports of diagnostic studies as well. Finally, the study quality has no effect on the pooled estimates of diagnostic accuracy.

Additional files
Additional file 1: Table S1. Characteristics of the studies.
Additional file 2: Table S2. Endorsement of STARD statement by journals.