Our findings show that, in the primary care setting, most administrative data algorithms for RA had high specificity. We found that incorporating specialist diagnosis codes increased PPV (51-83%), and requiring multiple RA codes increased both specificity and PPV. However increasing the duration of the observation window to identify RA codes or varying the time between RA codes had little impact on algorithm performance. In addition, incorporating RA drugs only slightly improved PPV, hospitalization codes alone had poor sensitivity, and use of more complex cases definitions involving exclusion criteria did not improve algorithm performance.
Overall, we have comprehensively evaluated the accuracy of administrative data algorithms that were measured relative to the reference standards of two independent validation samples using the diagnoses documented within medical charts of rheumatologists (previously reported) [4] and family physicians (reported here). After testing administrative data algorithms and ranking algorithms according to performance characteristics, the optimal algorithm identified for identifying RA patients in both samples was identical. The algorithm of “[1 hospitalization RA code] OR [3 physician RA diagnosis codes with ≥1 RA code by a specialist in a 2 year period]” had a sensitivity of 78%, specificity of 100%, PPV of 78% and NPV of 100% when using our primary care reference standard. When we independently validated this algorithm among a random sample of 450 patients seen in rheumatology clinics [4], it demonstrated a sensitivity of 97%, specificity of 85%, PPV of 76% and NPV of 98%.
While we identified the same algorithm as optimal in both settings, we did not achieve identical results, which can be attributed to differences in the study samples (reference standards) with respect to disease prevalence, spectrum of disease, and type of comparator group. For example, 33% of patients had RA within the rheumatology sample [4] versus 0.9% within the primary care sample. The spectrum of disease (clinical characteristics) in our rheumatology sample included contemporary RA patients under active rheumatology care and treatment, compared to patients with a lifetime RA diagnosis who may only be currently receiving active primary care. The type of comparator group (non-cases) in our rheumatology study involved patients with other rheumatologic diagnoses, in contrast to our primary care patients who are healthy or have other diagnoses.
In both samples, algorithm sensitivity was computed only among study subjects with RA, and specificity was computed among those without RA. Sensitivity and specificity do not depend on the prevalence of the RA in the study population, but they can vary across populations [16]. For example, sensitivity or our administrative data algorithm was excellent (>97%) at identifying contemporary RA under active rheumatology care [4]. In contrast, sensitivity was moderately good (78%) at identifying RA patients with a lifetime diagnosis of RA (who include patients under active rheumatology care, but also patients whose symptoms may have resolved, and are no longer seeking RA care). When we varied our definition of RA based on levels of evidence in the primary care charts (i.e., varied the spectrum of disease in our cohort), more strict definitions of RA according to the reference standard increased sensitivity, whereas more liberal definitions of RA (such as allowing any mention of RA with no supporting evidence, or those with a query RA diagnosis) decreased sensitivity of the administrative data algorithms. Thus, defining an a priori reference standard to classify individuals with RA has implications for validation study methodology. For instance, patients who fulfill strict classification criteria, such as the 1987 RA criteria [17], may have more advanced disease (patients with longer disease duration, or more active disease requiring multiple physician visits) and have a greater chance of being detected by administrative data algorithms. This finding was also observed in a review of studies that tested administrative data algorithms amongst RA patients who were required to meet strict classification criteria: these patients had a higher sensitivity in comparison to patients who were classified by more liberal criteria (such as an RA diagnosis documented in the medical record) [2].
On the other hand, as specificity is computed only among those without RA, the type of patients included in the non-RA comparator group (as defined by the reference standard) can influence the estimates of specificity. The lower specificity observed amongst non-RA patients in the rheumatology clinic study (85% versus 100% specificity observed amongst the primary care sample), is reflective of the patients without RA in the rheumatology clinic sample are individuals with other rheumatologic diagnoses (which may resemble RA, and/or who at one point in time may have been considered a possible RA patient, and then evolved into a more clear diagnosis, such as systemic lupus). In contrast, the comparator group of individuals without RA in our primary care sample included patients with other conditions (unrelated to RA), which improved the specificity. This observation has implications for research that seeks to identify population-based algorithms with high specificity. Furthermore, it emphasizes the importance of reporting the characteristics of the patients in the comparator group to inform proper interpretation of specificity estimates.
While sensitivity and specificity are dependent on the characteristics of patients with and without the disease, respectively, predictive values depend on disease prevalence, in addition to sensitivity and specificity. An important finding of our study is that our optimal algorithm had virtually the same PPV (78%) in both studies. Our primary care sample had an RA prevalence reflective of that of the general population (0.9%) in comparison to the rheumatology sample which had a study RA prevalence of 33% [4]. However, there was variation among the PPV estimates ranging from 51-83% amongst our primary care sample, and ranging from 55-80% amongst the rheumatology sample. The PPV estimates improved substantially with the requirement of increasing the number of diagnosis codes and including musculoskeletal specialist codes for RA. In addition to differences in prevalence estimates in both samples, specificity differed, with all algorithms tested within our primary care sample achieving very high (≥97%) specificity. In general, for conditions that are present in a minority of the study population (such as our primary care sample), specificity has a greater impact than sensitivity on PPV. However specificity alone is not the sole factor increasing PPV amongst the primary care sample, as all algorithms with high specificity would have moderately good PPV. A more likely explanation is that the preferred algorithm identified fewer false positives in both settings, as having fewer false positives increases PPV. This is likely owing to the nature of RA management, which often involves referral to a specialist, frequent physician visits and a non-curative long disease course. These observations suggest that patients with prevalent (long-term) chronic conditions may have a higher probability of being identified by the use of similar administrative data algorithms owing to the disease course, management practices, and frequency of physician visits. This finding may be further supported by the concordance between our results and those from algorithms for case ascertainment of diabetes [17], hypertension [18], and other chronic diseases with substantially higher prevalence than RA.
This study has both strengths and limitations. Patients were randomly sampled from primary care physician records and we tested many more permutations of administrative data algorithms than other studies. We also conducted rigorous chart reviews. However, misclassification of RA is a potential risk if there is lack of documentation in the medical record, such as a failure to capture all specialist consultation notes. Further, clinical characteristics for the RA patients may be under-documented in primary care clinical records. Recognizing this challenge, we opted to include all physician-reported diagnoses to define our reference standard, as our retrospective study design can make it difficult to determine true disease status. However, our present findings extend those of previous research performed using the General Practice Research Database, which also found a sensitivity of 80% for patients with 3 RA diagnosis codes [19]. Further, as the purpose of our study was to test the accuracy of algorithms for classifying RA patients within administrative data, we were unable to confirm the validity of the diagnosis of RA itself (whether doctors were correctly diagnosing RA). However, the majority of RA patients had specialist notes to confirm RA.
Another potential limitation is that our findings are derived from patients who have a regular source of primary care. Consequently, our results may not be generalizable to patients who do not have a regular primary care physician. Although almost ten million Ontarians (that is, over 80% of the population) are now rostered to a primary care physician [20, 21], we acknowledge this limitation and opted to include inpatient RA diagnosis codes in our final preferred administrative algorithm, even though alone these codes had low sensitivity (22%) and offered little improvement over physician-billing algorithms. The addition of an inpatient RA code to “3 physician RA claims with ≥1 by a specialist in a 2 year period” may subsequently increase the sensitivity of our algorithm when it is applied to the entire population since hospitalization data may be needed to pick up RA cases who either have no regular physician, or who are followed by the approximately 5% of Ontario physicians who are salaried (and who do not necessarily contribute to billing data) [22].
In addition, while the overall goal was to recommend the optimal algorithm for use in Ontario, we report the results of numerous algorithms so that researchers can be better informed by choosing the case definition best suitable for their study populations and study purpose. Due to different characteristics inherit to different administrative databases it would be imprudent to suggest preferred algorithms for use outside of Canada, where other researchers may be better informed on the characteristics of their own databases under study. Rather, algorithms should be selected based on study purpose and feasibility and weigh the relative importance of accuracy measures that is most important to a particular study. Incorrectly choosing the wrong algorithm or prioritizing the wrong accuracy measure can lead to misclassification, which can lead to reduced power, loss of generalizability, as well as increased bias, and possibly study cost [23].