Skip to main content

Reliability of medical record abstraction by non-physicians for orthopedic research



Medical record review (MRR) is one of the most commonly used research methods in clinical studies because it provides rich clinical detail. However, because MRR involves subjective interpretation of information found in the medical record, it is critically important to understand the reproducibility of data obtained from MRR. Furthermore, because medical record review is both technically demanding and time intensive, it is important to establish whether trained research staff with no clinical training can abstract medical records reliably.


We assessed the reliability of abstraction of medical record information in a sample of patients who underwent total knee replacement (TKR) at a referral center. An orthopedic surgeon instructed two research coordinators (RCs) in the abstraction of inpatient medical records and operative notes for patients undergoing primary TKR. The two RCs and the surgeon each independently reviewed 75 patients’ records and one RC reviewed the records twice. Agreement was assessed using the proportion of items on which reviewers agreed and the kappa statistic.


The kappa for agreement between the surgeon and each RC ranged from 0.59 to 1 for one RC and 0.49 to 1 for the other; the percent agreement ranged from 82% to 100% for one RC and 70% to 100% for the other. The repeated abstractions by the same RC showed high intra-rater agreement, with kappas ranging from 0.66 to 1 and percent agreement ranging from 97% to 100%. Inter-rater agreement between the two RCs was moderate with kappa ranging from 0.49 to 1 and percent agreement ranging from 76% to 100%.


The MRR method used in this study showed excellent reliability for abstraction of information that had low technical complexity and moderate to good reliability for information that had greater complexity. Overall, these findings support the use of non-surgeons to abstract surgical data from operative notes.

Peer Review reports


Medical record review (MRR) is a commonly used method in clinical research to ascertain exposures (e.g. co-morbidities) or outcomes (e.g. complications) [1]. However, because medical records are meant to document care and are not designed as research tools, MRR poses several challenges in the research setting. Medical records may be incomplete, and the differential availability of information may result in misclassification and potential bias [1]. Medical information must be observed, recorded in the medical record, abstracted, coded, and analyzed; errors may occur at every step [2, 3].

Problems with validity and reliability of MRR are generally recognized [47]. Inter-observer variability can vary widely in the abstraction of medical records by physician reviewers [8]. Currently, there is no official standard for reporting on the process used for MRR in clinical research as there is for meta-analysis research, such as the QUOROM statement [9]. Various proposed strategies for improvement such as abstraction monitoring and continuous abstractor training appear to be successful [1012]. Nevertheless, it is impossible to ensure perfect validity and reliability; therefore, these parameters should be reported in MRR studies to provide context for interpreting the results. Physician review is expensive and consequently MRR studies are often carried out by researchers without medical training. The reliability of medical record abstraction by non-clinical personnel has received little study.

In this study, we evaluated the reproducibility of MRR in the context of studying risk factors for revision of total knee replacement (TKR). We assessed inter-rater reliability of MRR abstraction between a physician and two research coordinators (RCs) and between the two RCs; we also assessed the intra-rater (test-retest) reliability of a single RC.



This reliability study was conducted as part of a larger nested case-controlled study that examined risk factors for revision of TKR. The subjects of the study were 438 patients who received a primary TKR at a tertiary referral hospital between 1996 and 2009. Of these patients, 147 went on to have a revision TKR at the same institution or another sister tertiary referral hospital. The remaining patients (N = 291) did not have revision TKR and served as controls. Controls were matched to the cases based on surgery year and orthopedic surgeon. We developed an abstraction tool and used it to record medical record information on each subject. The tool included patient demographic information, medical history, social history, and prosthesis information. In particular, we abstracted details on the surgical procedures from the surgeons’ operative notes. The study was approved by the Partners HealthCare Human Subject Committee.

Research coordinator training

Because operative notes contain medical jargon and technical language, an orthopedic surgeon taught the abstraction method to two research coordinators who had no formal clinical training. RC1 is a college graduate with no higher level degrees and two years of experience in clinical orthopedics research. RC2 attended college with no higher level degrees and one year of experience in clinical orthopedics research. First, the RCs and the surgeon reviewed charts together to learn the language and approach. Subsequently, during the pilot phase, the RCs independently reviewed charts, and their results were reviewed by the surgeon. Training was complete when the surgeon deemed the reviews to be accurate.

Sub-study design

To test the reliability of this abstraction method and training process, we randomly selected 75 subjects from the larger study population. The sample size was chosen to ensure reasonable precision in the estimate of agreement statistics, such as the Cohen’s kappa. More specifically, given an a priori estimate of 75% agreement, a sample size of 75 provided a 95% C.I. around the point estimate of 65% to 85% agreement. To ensure that the abstractors were blinded to the data and had no prior exposure to the medical records, this study was carried out prior to the full data abstraction for the nested case–control parent TKR study.

Each patient’s operative note was reviewed four times: once by an orthopedic surgeon, once by one RC, and twice by another RC. This design permitted us to assess validity (agreement between the surgeon and each RC), inter-rater reliability (reproducibility between two RCs), and intra-rater reliability (reproducibility between the two abstractions by the same RC). Using the abstraction form developed for the parent TKR study, we created an abridged abstraction form to test validity and reliability (see Appendix). The form primarily addressed surgical techniques and bone deformities, as we were especially interested in the reliability of abstraction of the most technically sophisticated elements. Key words were appended to the form to guide the abstractor with the classification of data elements. The source of the information was Partners HealthCare’s Longitudinal Medical Record (LMR) system, which included radiological reports, pre-operative evaluation notes and operative notes.

Statistical analysis

We combined response categories to create a new variable for some questions in order to improve clinical interpretation. Notably, “Lateral Release Performed” was combined with “Lateral Release Type” into a single new category, which had the options of “No”; “Yes – Patellar Tracking”; “Yes – Tibial Femoral Alignment”; “Yes – Both”; and “Insufficient Information”, which incorporated “Not Documented” (see Appendix). In addition, the “Bone Deformity” section was also simplified. Rather than splitting the categories of “Alignment” and “Predominant Compartment” by 3 different sources of information, a single category of “Alignment” and a single category of “Predominant Compartment” were created by combining information from the various sources, i.e., “D1a Alignment”, “D2a Alignment” and “D3a Alignment” combined to form one “Alignment” category (see Appendix).

The raters were de-identified for the analysis to minimize bias. We created two way tables for each pair of raters (six possible pairs) in each data category. To quantify intra- and inter-rater reliability, we calculated percent agreement and Cohen’s kappa coefficients with associated 95% confidence intervals based on the method described by Fleiss, et al.[13]. Cohen’s kappa is a statistical measure of agreement that is calculated based on expected vs. observed values and frequencies [14]. The formula for kappa is as follows:

κ = p o p e / 1 p e

where p o  is the observed percent agreement and p e is the expected percent agreement. The value of kappa falls between 0 and 1, with numbers closer to 0 indicating low agreement and values closer to 1 indicating high agreement. While there is no standardized guideline for the kappa value that constitutes acceptable agreement, Landis and Koch recommend the following categorization as shown in Table 1[15]:

Table 1 Categorization of different levels of Kappa by strength of agreement

Kappa is a useful statistical measure because it corrects for agreement that may arise based on chance alone; however, the kappa statistic can be biased by the distribution of agreement (see Discussion for further explanation). Therefore, we calculated kappa as well as percent agreement for the intra-rater agreement (same RC twice) and inter-rater agreement (between each RC and the expert clinician as well as between the two RCs). All statistical analyses were carried out using SAS v9.2 (SAS Inc., Carry, NC) and R (


To ensure that the random sample of patients for the reliability study was representative of the larger sample chosen for the parent TKR study, we compared the two samples. As shown in Table 2, the reliability sample (n = 75) was similar to subjects from the rest of the parent population (n = 363) in terms of age at primary TKR surgery, gender, race, marital status, and the operating orthopedic surgeon. A higher proportion of patients in the reliability study than in the control sample had a revision.

Table 2 Comparison of demographic information of patients selected for the reliability study versus that of all the patients from the risk factors for TKR revision study

In Table 3, we show the final categories and each reviewer’s tabulations. Inter-rater agreement between the RCs and the surgeon was very good overall with kappa ranging from 0.49 to 1 and percent agreement from 70.4% to 100% (Table 4). In the cases of “Cement Fixation” for RC1 vs. RC2 and RC1 vs. RC1, the agreement was perfect, and “Yes” was selected for all patients; therefore, kappa was not calculable (Tables 3, 5 and 6). For RC1, there were moderate levels of agreement with the surgeon based on kappa of 0.59 for arthroplasty approach type and kappa of 0.66 for the predominant knee compartment (Table 4). The rest of the categories had substantial to perfect levels of agreement. RC2 had somewhat lower levels of kappa and percent agreement with the surgeon than RC1 (Table 4). The items for which RC2 had the highest levels of agreement with surgeon’s evaluation were the same as those for which RC1 had high agreement with the surgeon: index knee, bilateral operation, lateral release type, and whether the posterior cruciate ligament (PCL) was recessed. RC2 had moderate agreement with the surgeon in the more technical categories of arthroplasty approach type, alignment of knee, and predominant compartment of disease, with kappas of 0.49, 0.53, and 0.53, respectively.

Table 3 MRR categories and reviewers’ tabulations
Table 4 Inter-rater agreement, surgeon vs. RC1 (1st abstraction) and RC2
Table 5 Inter-rater agreement, RC1 (1st abstraction) vs. RC2
Table 6 Intra-rater agreement, RC1 (1st abstraction) vs. RC1 (2nd abstraction)

We found that the inter-observer reliability between RC1 and RC2 was better than that between each of the RCs and the surgeon (Tables 4 and 5). The intra-rater agreement for RC1 was very good as demonstrated by kappas ranging from 0.66 to 1 and percent agreement from 97.3% to 100%. With the exception of index knee and arthroplasty approach type, there was perfect agreement between RC1’s first and second abstraction for all other variables (Table 6). Index knee had almost perfect agreement (98.6%). Arthroplasty approach type also had a high percent agreement (97.3%), and a substantial kappa of 0.66.


We examined the validity and the intra- and inter-rater reliabilities of abstraction of operative notes in a study of patients who underwent TKR. The findings suggest that trained research staff without prior clinical knowledge and experience can abstract medical records reliably and accurately. We found that both inter- and intra-rater reliability analyses showed almost perfect percent agreement and kappa values ranged from moderate to almost perfect depending on the type of data category. Simple data elements—the knee on which the TKR was performed and whether both knees were operated on at the same time—had almost perfect agreement. On the other hand, complex categories that require interpretation of how the surgery was conducted, such as the type of arthroplasty approach or the knee deformity, had lower agreement. Our results were consistent with previous findings, which have shown that demographic data (e.g. gender, age, etc.) typically have higher kappa than narrative text data looking for a key word (e.g. presence of a symptom) and that data requiring judgment have the lowest kappa [10, 16, 17]. Even for the most technical items, however, agreement between the RCs and the surgeons and between the two RCs was moderate to substantial.

One noteworthy aspect of the results is that certain categories had kappa values that seemed disproportionately low given the high percent agreement. This can be explained by the paradox of low Cohen’s kappa in the setting of high percent agreement—as can be seen for cement fixation and arthroplasty approach type. As seen in Table 3, nearly every patient was rated as having received a “Medial/Median Peripatellar” arthroplasty approach type. Consequently, the expected agreement is very high, and the formula for calculating kappa creates a large decrease in kappa for a relatively smaller decrease in percent agreement. As Kraemer wrote when she first reported this problem, a measurement method may have poor kappas simply because of the lack of variability in the population and not because of the intrinsic inaccuracy of the measurement method itself [18]. In essence, if the prevalence of a trait is very rare (or exceedingly common), then the expected agreement becomes so large that it is difficult to document reliability. Feinstein and Cicchetti further explored this issue and proposed that the kappa should be accompanied by additional information, such as percent agreement, to describe the degree to which a given kappa is biased [19, 20]. In this paper, we followed their recommendations and reported both kappa and percent agreement.

In an analysis of the American College of Surgeons National Surgical Quality Improvement Program, Shiloach et al. reported comprehensively the inter-rater agreement for numerous chart abstraction categories, which provides a good basis of comparison for the inter-rater agreement documented in this study [12]. Shiloach et al. reported kappas for a range of dichotomous variables, which ranged from fair (0.32) to almost perfect (0.93). Variables with the lowest kappas were: do not resuscitate (DNR) status (0.32), history of angina (0.32), rest pain (0.38) and bleeding disorder (0.38). The percent agreement for these variables ranged from 94-99%, showing that, as in our study, low kappas may arise from high levels of chance agreement in studies of the reliability of medical record review [12].

This study had a few limitations in its design. First, we treated the surgeon’s MRR abstraction data as the “gold standard.” However, the abstractions of clinicians are not perfectly reliable [8]. Clinicians may introduce clinical judgment into the abstraction, potentially distorting results. On the other hand, research assistants are taught a standardized abstraction that is entirely objective and may be more reliable on that basis. Repeating this project with multiple surgeons and multiple research assistants would help clarify this issue. In addition, to more robustly measure reliability for all aspects of surgical information, more variables should be compared. Last, it is important to note that this study assumed that the information in the medical records was accurate and complete, which we could not assess.

The conclusions of any scientific study rely heavily on the assumption that the data collection process is both valid and reliable. In an effort to assess the quality of MRR studies, Gilbert et al. examined use of methodological features that may maximize validity and reliability. They identified eight possible strategies: proper training of abstractors, explicit case selection protocols, precisely defined variables, standardized abstraction forms, periodic review meetings to resolve problems, monitoring of abstractor performance, blinding chart reviewers to the hypothesis and group assignment, and testing inter-rater agreement [2]. Among 986 published studies reviewed, only 5% mentioned testing inter-rater reliability, and 0.4% reported the results of testing inter-rater reliability. Ten years later in a follow-up study, Worster et al. reported that inter-rater reliability was mentioned 22% of the time and tested 13% of the time [21]. Although these studies show some improvement in frequency of reported inter-rater reliability analysis, this remains an underreported (and perhaps underperformed) aspect of MRR research [22, 23]. We hope that our study will contribute to the increased reporting of the quality of data collected for clinical research. To the best of our knowledge, this is the first assessment of agreement between clinically trained and clinically-untrained medical record reviewers. We cannot be certain, however, whether the paucity of studies of this issue simply reflects failure of authors of reliability studies to report the clinical training of the reviewers, or whether the question has not been addressed. To date, research has mainly addressed the interrater reliability of clinicians vs. non-clinicians and researchers of various levels of clinical experience when evaluating patients prospectively [24, 25].


Obtaining research data via medical record review involves multiple steps, each of which can introduce errors. Therefore, research that involves MRR should provide reasonable assurance that the data are valid and reliable. In this study, we assessed the reliability of a MRR method to abstract surgical information from TKR procedures. We found that the MRRs performed by research coordinators were reliable (inter- and intra-rater reliability) and valid (agreement with an orthopedic surgeon). Furthermore, our result was similar to that obtained from a nation-wide MRR survey of patients undergoing surgery [12]. The findings of this study provide support for the reliability and validity of MRR in the setting of research on risk factors for revision of TKR.


Reliability Study Primary TKR Chart Abstraction Tool

A. Administrative

A1. Chart Review Date: __________ A2. Chart Reviewer: __________

B. Patient Information

B1. MRN (last 4 digits): _______________________

B2. Index Knee: ☐1. Left ☐ 2. Right

B2b. Bilateral: ☐1. Yes ☐ 2. No

C. Surgery

C1. Arthroplasty Approach Type:

☐ 1. Medial/Median Peripatellar (> = 90% Primary)

☐ 2. Lateral Peripatellar (<1% Primary, even less Revision)

☐ 3. Subvastus/Midvastus (<5% Primary, 0% Revision)

☐ 4. Quadriceplasty (<1% Primary; <20% Revision)

☐ 5. Tibial Tubercle Osteotomy/TTO (<1% Primary; <5% Revision, if quadriceplasty fails)

☐ 6. Other (Lateral Peripatellar, Quadriceplasty, Tibial Tubercle Osteotomy/TTO)

☐ 9. Not Documented (if approach not stated, then Medial/Median is implied)

C2. Fixation

☐ 1. Cemented (cement sticker exists or mentioned in LMR/Big Board/OpNotes)

☐ 2. Cementless

☐ 9. Not Documented

C3a. Lateral Release Performed

☐ 0. No (if good/smooth patella traction, or good varus/valgus stability after trial components, extremely unusual in varus knee)

☐ 1. Yes (i.e. Release of: lateral retinaculum/capsule, iliotibial band; popliteus; lateral/collateral ligament (LCL); pie crust technique)

☐ 9. Not Documented

C3b. Lateral Release Type

☐ 1. Patellar Tracking (C3a = Yes: i.e. Release of: lateral retinaculum/capsule)

☐ 2. Tibial Femoral Alignment (C3a = Yes: Valgus, iliotibial band; popliteus; lateral/collateral ligament (LCL); pie crust technique)

☐ 3. Both (C3a = Yes)

☐ 7. N/A (C3a = No/Not Documented)

☐ 8. Insufficient Information

C4. Post-cruciate (PCL) Recession Performed (if performed, likely to be mentioned)

☐ 0. No (if stated that knee is balanced/stable in flexion, flexion & extension gaps are equal, no lift-off evidence, recessed back to the proposed tibial articular osteotomy)

☐ 1. Yes (tight flexion gap; positive lift-off test)

☐ 8. N/A (if Constraint is not CR)

☐ 9. Not Documented

D. Bone Deformity

D1. Pre-Operative Surgeon Visit

D1a. Alignment D1b. Predominant Compartment

☐ 1. Varus ☐ 1. Medial

☐ 2. Valgus ☐ 2. Lateral

☐ 3. Neutral ☐ 3. Even

☐ 8. Insufficient Information ☐ 8. Insufficient Information

☐ 9. Not Documented ☐ 9. Not Documented

D2. LMR Operative Note

D2a. Alignment D2b. Predominant Compartment

☐ 1. Varus (osteophytes on medial side) ☐ 1. Medial

☐ 2. Valgus (anticipated if Lateral Release ☐ 2. Lateral

Performed = Yes, lateral wear in general,

i.e. deficiency in lateral femoral condyle;

drilling holes in lateral tibial plateau)

☐ 3. Neutral ☐ 3. Even

☐ 8. Insufficient Information ☐ 8. Insufficient Information

☐ 9. Not Documented ☐ 9. Not Documented

D3. X-Ray

D3a. Alignment D3b. Predominant Compartment

☐ 1. Varus ☐ 1. Medial

☐ 2. Valgus ☐ 2. Lateral

☐ 3. Neutral ☐ 3. Even

☐ 8. Insufficient Information ☐ 8. Insufficient Information

☐ 9. Not Documented ☐ 9. Not Documented


  1. 1.

    Worster A, Haines T: Advanced statistics: Understanding medical record review (MRR) studies. Acad Emerg Med. 2004, 11: 187-192.

    Article  PubMed  Google Scholar 

  2. 2.

    Gilbert EH, Lowenstein SR, Koziol-McLain J, Barta DC, Steiner J: Chart reviews in emergency medicine research: where are the methods?. Ann Emerg Med. 1996, 27: 305-308. 10.1016/S0196-0644(96)70264-0.

    CAS  Article  PubMed  Google Scholar 

  3. 3.

    Eder C, Fullerton J, Benroth R, Lindsay SP: Pragmatic strategies that enhance the reliability of data abstracted from medical records. Appl Nurs Res. 2005, 18: 50-54. 10.1016/j.apnr.2004.04.005.

    Article  PubMed  Google Scholar 

  4. 4.

    Allison JJ, Wall TC, Spettell CM, Calhoun J, Fargason CA, Kobylinski RW, Farmer R, Kiefe C: The art and science of chart review. Jt Comm J Qual Improv. 2000, 26: 115-136.

    CAS  PubMed  Google Scholar 

  5. 5.

    Luck J, Peabody JW, Dresselhaus TR, Lee M, Glassman P: How well does chart abstraction measure quality? A prospective comparison of standardized patients with the medical record. Am J Med. 2000, 108: 642-649. 10.1016/S0002-9343(00)00363-6.

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Kvale JN, Gillanders WR, Buss TF, Gemmel D, Crenesse A, Griffiths-Marnejon J: Agreement between telephone survey and medical record data for the elderly patient. Fam Pract Res J. 1994, 14: 29-39.

    CAS  PubMed  Google Scholar 

  7. 7.

    Stange KC, Zyzanski SJ, Smith TF, Kelly R, Langa DM, Flocke SA, Jaén CR: How valid are medical records and patient questionnaires for physician profiling and health services research? A comparison with direct observation of patient visits. Medical Care. 1998, 36: 851-867. 10.1097/00005650-199806000-00009.

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Localio AR, Weaver SL, Landis JR, Lawthers AG, Brenhan TA, Hebert L, Sharp TJ: Identifying adverse events caused by medical care: degree of physician agreement in a retrospective chart review. Ann Intern Med. 1996, 125: 457-464. 10.7326/0003-4819-125-6-199609150-00005.

    CAS  Article  PubMed  Google Scholar 

  9. 9.

    Moher D, Cook DJ, Eastwood S, Olkin I, Rennie D, Stroup DF: Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement. Lancet. 1999, 354: 1896-1900. 10.1016/S0140-6736(99)04149-5.

    CAS  Article  PubMed  Google Scholar 

  10. 10.

    Yawn BP, Wollan P: Interrater reliability: completing the methods description in medical records review studies. Am J Epidemiol. 2005, 161: 974-977. 10.1093/aje/kwi122.

    Article  PubMed  Google Scholar 

  11. 11.

    Liddy C, Wiens M, Hogg W: Methods to achieve high interrater reliability in data collection from primary care medical records. Ann Fam Med. 2011, 9: 57-62. 10.1370/afm.1195.

    Article  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Shiloach M, Frencher SK, Steeger JE, Rowell KS, Bartzokis K, Tomeh MG, Richards KE, Ko CY, Hall BL: Toward robust information: data quality and inter-rater reliability in the American College of Surgeons National Surgical Quality Improvement Program. J Am Coll Surg. 2010, 210: 6-16. 10.1016/j.jamcollsurg.2009.09.031.

    Article  PubMed  Google Scholar 

  13. 13.

    Fleiss JL, Cohen J, Everitt BS: Large sample standard errors of kappa and weighted kappa. Psychol Bull. 1969, 72: 323-327.

    Article  Google Scholar 

  14. 14.

    Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960, 20: 37-46. 10.1177/001316446002000104.

    Article  Google Scholar 

  15. 15.

    Landis JR, Koch GG: Measurement of observer agreement for categorical. Biometrics. 1977, 33: 159-174. 10.2307/2529310.

    CAS  Article  PubMed  Google Scholar 

  16. 16.

    Beard CM, Yunginger JW, Reed CE, O'Connell EJ, Silverstein MD: Interobserver variability in medical record review: an epidemiological study of asthma. J Clin Epidemiol. 1992, 45: 1013-1020. 10.1016/0895-4356(92)90117-6.

    CAS  Article  PubMed  Google Scholar 

  17. 17.

    Engel L, Henderson C, Fergenbaum J, Colantonio A: Medical record review conduction model for improving interrater reliability of abstracting medical-related information. Eval Health Prof. 2009, 32: 281-298. 10.1177/0163278709338561.

    Article  PubMed  Google Scholar 

  18. 18.

    Kraemer HC: Ramifications of a population-model for kappa as a coefficient of reliability. Psychometrika. 1979, 44: 461-472. 10.1007/BF02296208.

    Article  Google Scholar 

  19. 19.

    Feinstein AR, Cicchetti DV: High agreement but low kappa: I. The problem of two paradoxes. J Clin Epidemiol. 1990, 43: 543-549. 10.1016/0895-4356(90)90158-L.

    CAS  Article  PubMed  Google Scholar 

  20. 20.

    Cicchetti DV, Feinstein AR: High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol. 1990, 43: 551-558. 10.1016/0895-4356(90)90159-M.

    CAS  Article  PubMed  Google Scholar 

  21. 21.

    Worster A, Bledsoe RD, Cleve P, Fernandes CM, Upadhye S, Eva K: Reassessing the methods of medical record review studies in emergency medicine research. Ann Emerg Med. 2005, 45: 448-451. 10.1016/j.annemergmed.2004.11.021.

    Article  PubMed  Google Scholar 

  22. 22.

    Gow RM, Barrowman NJ, Lai L, Moher D: A review of five cardiology journals found that observer variability of measured variables was infrequently reported. J Clin Epidemiol. 2008, 61: 394-401. 10.1016/j.jclinepi.2007.05.010.

    Article  PubMed  Google Scholar 

  23. 23.

    Badcock D, Kelly AM, Kerr D, Reade T: The quality of medical record review studies in the international emergency medicine literature. Ann Emerg Med. 2005, 45: 444-447. 10.1016/j.annemergmed.2004.11.011.

    Article  PubMed  Google Scholar 

  24. 24.

    Cruz CO, Meshberg EB, Shofer FS, McCusker CM, Chang AM, Hollander JE: Interrater reliability and accuracy of clinicians and trained research assistants performing prospective data collection in emergency department patients with potential acute coronary syndrome. Ann Emerg Med. 2009, 54: 1-7. 10.1016/j.annemergmed.2008.11.023.

    Article  PubMed  Google Scholar 

  25. 25.

    Rowley G, Fielding K: Reliability and accuracy of the Glasgow Coma Scale with experienced and inexperienced users. Lancet. 1991, 337: 535-538. 10.1016/0140-6736(91)91309-I.

    CAS  Article  PubMed  Google Scholar 

Pre-publication history

  1. The pre-publication history for this paper can be accessed here:

Download references


None of the authors report any relevant financial conflict of interest. We thank Dr. William Reichmann for his help with the study design and review of the manuscript.

Funding sources

Department of Orthopedic Surgery, Brigham and Women’s Hospital; NIH/NIAMS T32 AR 055885, K24 AR 057827, P60 AR 47782.

Author information



Corresponding author

Correspondence to Jeffrey N Katz.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

MM drafted the manuscript, collected data, and performed the statistical analysis. JC helped with drafting the manuscript and statistical analysis. SL collected data for the project and participated in the design. EL participated in the conception of the project and its design. JK conceived the project and its design and helped draft the manuscript. All authors read and approved the manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Mi, M.Y., Collins, J.E., Lerner, V. et al. Reliability of medical record abstraction by non-physicians for orthopedic research. BMC Musculoskelet Disord 14, 181 (2013).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Medical record review
  • Reliability
  • Kappa statistic
  • Total knee replacement