Binary Tönnis classification: simplified modification demonstrates better inter- and intra-observer reliability as well as agreement in surgical management of hip pathology
BMC Musculoskeletal Disorders volume 21, Article number: 502 (2020)
The traditional Tönnis Classification System has inherent drawbacks as it is vulnerable to the subjectivity of a four-grade system. A two-grade classification could potentially be more reliable. The purpose of this study is to (1) compare the inter-observer and intra-observer reliability of the traditional Tönnis Classification System and a simplified Binary Tönnis Classification System for hip osteoarthritis and to (2) evaluate the clinical applicability of both systems. Our hypothesis is that the proposed Binary Tönnis Classification System will have better reliability and agreement for surgical decision-making.
Forty consecutive patients were selected to participate in this study. Patients were included in this study if they were between 35 and 60 years old. Patients were excluded if they had prior hip surgeries or conditions. All radiographs were randomized and blinded by a non-observer. Five fellowship-trained hip surgeons from a single center, in a fully crossed design, analyzed and graded all the radiographs utilizing the traditional Tönnis Classification System and the proposed Binary Tönnis Classification System. Intra- and inter-observer reliability values for both the systems were calculated using the Cohen’s κ coefficient. A multi-rater κ was calculated using the weighted Fleiss method.
The study sample contained 40 anterosuperior hip radiographs. For the traditional Tönnis Classification System, the weighted κ showed a fair inter-observer reliability (κ = 0.474) and excellent intra-observer reliability (κ mean = 0.866). For the proposed Binary Tönnis Classification System, both inter-observer and intra-observer reliability demonstrated excellent values, (κ = 0.858 and 0.928, respectively). On average, the Binary Tönnis Classification System correctly captured 87% of cases. When the traditional Tönnis Classification System was dichotomized, the capture rate was 84%.
A simplified binary Tönnis Classification System demonstrates better reliability and clinical implementation than the traditional Tönnis Classification System.
For hip joint pathologies, two major operative treatments exist: hip preservation and hip replacement. The presence of osteoarthritis is a critical factor in a surgeon’s decision between the two options . Efforts to preserve the hip joint are hindered by the presence of osteoarthritis . Therefore, a reliable evaluation of the degree of osteoarthritis is necessary for optimizing patient outcomes. Radiographic assessment provides essential information concerning the diagnosis and treatment of osteoarthritis . The traditional Tönnis Classification System is commonly used to classify the severity of osteoarthritis. The literature generally supports hip preservation for hips graded as Tönnis 0 and 1, and replacement for hips graded Tönnis 2 and 3 [2, 4]. However, despite its extensive use in clinical practice and medical literature, the traditional Tönnis Classification System has some drawbacks . First, several studies have reported questionable inter-observer and intra-observer reliability [3, 6, 7]. A cardinal drawback of the traditional Tönnis Classification System is it’s subjectivity. It has been criticized for being unclear and having overlapping parameters. Yet, another difficulty may rise when parameters from different grades are found in a single radiograph e.g. moderate loss of head sphericity and slight narrowing of the joint space, which pretrain to grade 2 and 1, respectively . The pitfalls associated with the traditional Tönnis Classification System reach beyond the boundaries of orthopedics and may have multidisciplinary manifestations that impair the cross talk between radiologists, general practitioners, and rheumatologists. Similar to the traditional Tönnis Classification System, the Garden Classification for femoral neck fractures also demonstrated poor reliability derived from the challenging radiographic distinctions between the grades. Based on the clinical relevancy of the Garden Classification, a simplified binary classification was developed that demonstrated higher reliability compared to the original classification [8,9,10,11]. Given the binary nature of available surgical interventions (i.e. hip preservation versus reconstruction) derived from the traditional Tönnis Classification System, a two-level classification could be more reliable and reproducible without compromising the clinical relevance. Taking into consideration Occam’s Razor , which states that the simplest answer is typically the correct answer, a two-level classification for surgical treatment options seems most appropriate. The goal of this study is to validate a simplified Binary Tönnis Classification System to reduce excessive complexity and better capture the diagnostic essence of having a certain classification. Specifically, this study (1) compares the inter-observer and intra-observer reliability of the traditional Tönnis Classification System and a new simplified Binary Tönnis Classification System for hip osteoarthritis and (2) evaluates the clinical applicability of both systems, notably its agreement with the clinician’s decision for either preservation or replacement. Our hypothesis is that a binary system will have better reliability and agreement for surgical decision-making.
Patient selection and data acquisition
Forty consecutive patients who presented to the clinic for hip pain between February 2018 to March 2018 were selected to participate in this study. Patients were included in the study if they were between the ages of 35 and 60 years old. Patients were excluded if they had prior ipsilateral or contralateral surgeries or had prior hip conditions such as Legg-Calve-Perthes disease, slipped capital femoral epiphysis, pigmented villonodular synovitis, or ankylosing spondylitis. All patients underwent operative management due to radiographic FAI, osteoarthritis, and/or symptoms of hip pain that were unresponsive to conservative treatment and significantly limited activities. Demographic data, such as sex of patients, laterality, and age at surgery, was collected for all patients.
All patients underwent routine radiographic imaging at their preoperative clinic visit. A standard anteroposterior supine radiograph was used for this study to grade the severity of osteoarthritis, the protocol for which is detailed by Clohisy et al. 
This study was approved by the Institutional Review Board and did not receive any funding. All patients participated in the American Hip Institute Hip Preservation Registry through written consent. While the present study represents a unique analysis, data on some patients in this study may have been reported in other studies.
The traditional Tönnis Classification (Table 1) and the simplified Binary Tönnis Classification systems (Table 2) were used in this study. The simplified Binary Tönnis Classification System was fashioned to reflect the primary indications that our institution uses with the traditional Tönnis Classification System: hip preservation or reconstruction.
Inter-rater reliability and agreement with surgical treatment
Five fellowship-trained hip surgeons from a single center were the observers for this study. Three observers were hip preservation and reconstruction fellows and two observers were attendings who had trained in both hip preservation and reconstruction. Radiographic grading of hip OA is part of the observers’ daily practice. However, to minimize inter-observer discrepancies, both the traditional Tönnis Classification System and the Binary Tönnis Classification System were provided on each individual excel sheet that was utilized to grade the radiographs. This study was a full-crossed study in which all observers read the same set of radiographs. All images were uploaded to the digital imaging system and retrieved by a non-observer who randomized and blinded the films, Fig. 1.
The five observers independently assessed the series of radiographs. Observers classified the radiographs utilizing the traditional Tönnis Classification System and rated another set of randomized radiographs with the Binary Tönnis Classification System after at least a week had transpired. Images were randomized again, and observers repeated their respective assessment at least 3 weeks later.
Statistical analysis was conducted in R (R software foundation, version 3.6.0) and Microsoft Excel (Redmond, WA). Demographic data was separated and analyzed for patients who underwent arthroscopy or THA. To analyze demographic data, the Chi-squared and Fisher’s Exact tests were utilized to evaluate differences in the proportions of categorical data between the arthroscopy and THA groups. For continuous variables, the F-test was performed to evaluate variance, and the Shapiro-Wilk test was utilized to evaluate distribution. A p > 0.05 indicated equal variance and normal distribution, respectively. The independent-samples t-test was performed for unpaired data comparisons between both groups. Significance was set to 0.05.
Intra-observer and inter-observer reliability were calculated using the Cohen’s κ coefficient for the traditional Tönnis Classification System and the simplified Binary Tönnis Classification System. Further, the traditional Tönnis Classification System was dichotomized (0 and 1 vs. 2 and 3) and the Cohen’s κ coefficient was calculated. The multi-rater κ was calculated using the weighted Fleiss method. The degree of agreement based on the κ coefficient were interpreted by the ranges recommended by Landis and Koch: a κ value of 0–0.2 indicated slight agreement, 0.2–0.4 to be fair, 0.4–0.6 to be moderate, 0.6–0.8 to be substantial, and greater than 0.8 to be near perfect .
The traditional Tönnis Classification System and the simplified Binary Tönnis Classification System were assessed for agreement with the surgical treatment received by the patient (either hip preservation or hip replacement).
The study sample contained 40 anterosuperior hip radiographs, 19 of which received hip preservation and 21 of which received hip replacement. There were 15 males and 25 females (age 35.05–59.25 years). The demographics of the overall group and subgroups are presented in Table 3.
The traditional Tönnis Classification System showed fair reliability for the inter-observer reliability, (κ = 0.474) and excellent reliability for the intra-observer reliability (κmean = 0.866, range = 0.780–0.907), as calculated by the weighted κ agreement.
The inter-observer and intra-observer reliability showed improvement with the simplified Binary Tönnis Classification System. The inter-observer reliability was (κ = 0.858) and intra-observer reliability was (κ mean = 0.928, range = 0.892–0.948). Both inter-observer and intra-observer reliability were deemed excellent (Table 4).
The Tönnis grading based on both systems and their agreement with the ultimate surgical management were calculated and are represented in Table 5. On average, the simplified Binary Tönnis Classification System correctly captured 87% of cases. When the traditional Tönnis Classification System was dichotomized (0 and 1 as hip preservation and 2 and 3 as hip replacement), the capture rate was 84%. The confusion matrices for the capture rates are depicted in Tables 6 and 7.
The aim of this study was to validate a simplified Binary Tönnis Classification System. In this study, 40 radiographs of consecutive patients were analyzed by five fellowship-trained orthopedic surgeons. Overall, the Binary Tönnis Classification System reported better inter-observer and intra-observer reliability and demonstrated higher agreement rate with the ultimate surgical treatment, as recommended by the treating surgeon, compared to the traditional Tönnis Classification System.
In their study, Clohisy et al.  (Table 8) evaluated the ability of hip specialists to reliably indicate the correct diagnosis based on plain radiographs alone. Five hip specialists and one fellow performed a blinded radiographic review of 25 hips with developmental dysplasia, 27 hips with femoroacetabular impingement, and 25 control hips. The readers assessed a variety of radiographic parameters including osteoarthritis using the traditional Tönnis Classification System. The combined κ for intra- and inter-observer reliability for the traditional Tönnis Classification System were 0.60 (95% CI: 0.54–0.66) and 0.59, respectively. Furthermore, Steppacher et al.  had two readers assess the Tönnis grade for a set of 50 radiographs illustrating dysplastic hips. The range of reported κ for intra-observer were 0.73 to 0.74. The Fleiss κ for interobserver reliability was 0.74. Clohisy et al. attributed the difference between their results and Steppacher’s to their inclusion of a non-dysplastic cohort, in contrast to a dysplastic only cohort in Steppacher’s study. The higher radiographic variability may have contributed to a decrease in reliability, especially in cases with none or mild arthritis. Troelsen et al.  aimed to investigate the variability of diagnostic assessment of the hip joint. In their study, four observers independently assessed the level of osteoarthritis in 25 radiographs. Treolsen dichotomized Tönnis grades. They assessed the dichotomized inter-observer reliability, of a quaternary classification, as well as its agreement with CT scan. The κ for inter-observer agreement was 0.54 for the Tönnis classification and 0.66 for the dichotomized version. Furthermore, the observed agreement with the CT scan was 70% for the traditional Tönnis Classification System and 88% for the dichotomized alternative. In this present study, κ for intra- and inter-observer reliability for the traditional Tönnis Classification System were 0.86 and 0.47, respectively. In contrast to the evidence reported for the traditional Tönnis Classification System, the simplified Binary Tönnis Classification System demonstrated excellent inter- and intra-observer κ (i.e. 0.86 and 0.85, respectively). Additionally, this study supports Troelsen’s findings from dichotomizing the traditional Tönnis Classification System. Adopting a true binary classification will better serve the clinician as it would eliminate the need for a preliminary low-reliable four level classification which requires further dichotomization for determining treatment.
The second step in validating the simplified Binary Tönnis Classification System was to assess its reliability in indicating the surgeon-recommended treatment. Valera et al.  evaluated the reliability of the traditional Tönnis Classification System as a reference for hip preservation. Three orthopedic surgeons examined 117 hip x-rays for hip joint osteoarthritis according to the traditional Tönnis Classification System. The κ value for interobserver reliability were slight or fair (0.173–0.379) and the κ value for intra-observer reliability were fair (0.364–0.379). Variance in classifying low grade osteoarthritis was the major cause for disagreement between observers. In contrast, experience did not play a significant role in grading reproducibility. The authors concluded that the traditional Tönnis Classification System is a poor method of assessing early hip osteoarthritis and that routine use in clinical decision-making for preservative surgery should be reconsidered. In this study, the traditional Tönnis Classification System correctly overlapped with actual surgical treatment in 85.2% of cases. The simplified Binary Tönnis Classification System had a higher overlap, correctly capturing 86.5% of the cases. While the binary classification did show a slightly better correlation with the indicated treatment, it should be emphasized that radiographic evaluation is only part of the overall patient assessment and thus a discrepancy between both classification systems and the actual performed treatment should be expected.
In summary, the simplified Binary Tönnis Classification System addresses the drawbacks of the traditional Tönnis Classification System without compromising clinical relevance. Adoption of a binary system would allow for more consistent data collection, thus improving the quality of studies. Practically for the clinician, a two-grade classification is more appropriate for a two-way treatment.
The major limitation of this study stems from the retrospective nature. We minimized this effect by blinding the investigators to any identifier including name and treatment. In addition, we excluded patients who were treated contralaterally, which could bias grading. Also, the readers in this study were all surgeons. A better generalization may have been generated with the inclusion of multidisciplinary readers (e.g. radiologists). Furthermore, whereas the actual procedure is performed by senior surgeon, either arthroscopy or arthroplasty, was indicated based on the overall patient’s assessment, the assigned procedures in this study were exclusively based on the radiographic classifications. This single blinding design may have introduced a bias to the study. Despite the effort to minimize the selection bias in the study by choosing consecutive series of patients, the resulted cohort was fairly homogenic in terms of demographic characteristics, which by itself may limit the generalization of the results. Last, patients without osteoarthritis, who traditionally were classified as Tönnis 0, no longer have a distinct grade according to the binary classification. This may potentially lead to lower threshold for indicating surgery. However, since hip arthroscopy is generally indicated based on the intra-articular mechanical impairments such as FAI and labral tears, osteoarthritis is normally considered a contraindication for such preservative measures.
A simplified Binary Tönnis Classification System demonstrates better reliability and clinical implementation than the Traditional Tönnis Classification System.
Availability of data and materials
The datasets generated and analyzed during the current study are not publicly available due to Health Insurance Portability and Accountability Act (HIPAA) regulations.
Valera M, Ibañez N, Sancho R, Tey M. Reliability of Tönnis classification in early hip arthritis: a useless reference for hip-preserving surgery. Arch Orthop Trauma Surg. 2016;136(1):27–33.
Domb BG, Gui C, Lodhia P. How much arthritis is too much for hip arthroscopy: a systematic review. Arthrosc J Arthrosc Relat Surg Off Publ Arthrosc Assoc N Am Int Arthrosc Assoc. 2015 Mar;31(3):520–9.
Clohisy JC, Carlisle JC, Trousdale R, Kim Y-J, Beaule PE, Morgan P, et al. Radiographic evaluation of the hip has limited reliability. Clin Orthop. 2009;467(3):666–75.
Troelsen A, Elmengaard B, Soballe K. Medium-term outcome of periacetabular osteotomy and predictors of conversion to total hip replacement. J Bone Joint Surg Am. 2009 Sep;91(9):2169–79.
Kovalenko B, Bremjit P, Fernando N. Classifications in brief: Tönnis classification of hip osteoarthritis. Clin Orthop. 2018;476(8):1680–4.
Steppacher SD, Tannast M, Ganz R, Siebenrock KA. Mean 20-year Followup of Bernese Periacetabular osteotomy. Clin Orthop. 2008;466(7):1633–44.
Troelsen A, Rømer L, Kring S, Elmengaard B, Søballe K. Assessment of hip dysplasia and osteoarthritis: variability of different methods. Acta Radiol. 2010;51(2):187–93.
Van Embden D, Rhemrev SJ, Genelin F, Meylaerts S. a. G, Roukema GR. The reliability of a simplified garden classification for intracapsular hip fractures. Orthop Traumatol Surg Res OTSR. 2012 Jun;98(4):405–8.
Beimers L, Kreder HJ, Berry GK, Stephen DJG, Schemitsch EH, McKee MD, et al. Subcapital hip fractures: the garden classification should be replaced, not collapsed. Can J Surg J Can Chir. 2002;45(6):411–4.
Parker MJ. Garden grading of intracapsular fractures: meaningful or misleading? Injury. 1993;24(4):241–2.
Parker MJ. Prediction of fracture union after internal fixation of intracapsular femoral neck fractures. Injury. 1994;25:SB3–6.
Foussias G, Remington G. Negative symptoms in schizophrenia: avolition and Occam’s razor. Schizophr Bull. 2008;36(2):359–69.
Clohisy JC, Carlisle JC, Beaulé PE, Kim Y-J, Trousdale RT, Sierra RJ, et al. A systematic approach to the plain radiographic evaluation of the young adult hip. J Bone Joint Surg Am. 2008;90(Suppl 4):47–66.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74.
No funding was received for this study.
Ethics approval and consent to participate
This study was approved by the Institutional Review Board of Advocate Health Care. All patients participated in the American Hip Institute Hip Preservation Registry through written consent. While the present study represents a unique analysis, data on some patients in this study may have been reported in other studies.
Consent for publication
These authors report the following past and present conflicts of interest:
Dr. Domb has HAD ownership interests in Hinsdale Orthopaedics, the American Hip Institute, SCD#3, North Shore Surgical Suites, and Munster Specialty Surgery Center; has received research support from Arthrex, ATI, the Kauffman Foundation, Stryker, and Pacira Pharmaceuticals; has received consulting fees from Adventist Hinsdale Hospital, Arthrex, MAKO Surgical, Medacta, Pacira Pharmaceuticals, and Stryker; has received educational support from Arthrex, Breg, and Medwest; has received speaking fees from Arthrex and Pacira Pharmaceuticals; and receives royalties from Arthrex, DJO Global, MAKO Surgical, Stryker, and Orthomerica. Dr. Domb is the Medical Director of Hip Preservation at St. Alexius Medical Center, a board member for the American Hip Institute Research Foundation, AANA Learning Center Committee, the Journal of Hip Preservation Surgery, and the Journal of Arthroscopy. The American Hip Institute Research Foundation fund research and is where our study was performed.
Dr. Lall reports grants, personal fees and non-financial support from Arthrex, non-financial support from Iroko, non-financial support from Medwest, non-financial support from Smith & Nephew, grants and non-financial support from Stryker, non-financial support from Vericel, non-financial support from Zimmer Biomet, personal fees from Graymont Medical, outside the submitted work.
Dr. Shapira reports food/beverage and travel/lodging support from Arthrex, Stryker, and Smith & Nephew.
Jeffrey W. Chen has no disclosures to report.
Rishika Bheem has no disclosures to report.
Dr. Rosinsky reports food/beverage and travel/lodging support from Arthrex, Stryker, and Smith & Nephew.
Dr. Maldonado reports food/beverage and travel/lodging support from Arthrex, Stryker, and Smith & Nephew. Dr. Maldonado is also a board member of the Journal of Arthroscopy.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Shapira, J., Chen, J.W., Bheem, R. et al. Binary Tönnis classification: simplified modification demonstrates better inter- and intra-observer reliability as well as agreement in surgical management of hip pathology. BMC Musculoskelet Disord 21, 502 (2020). https://doi.org/10.1186/s12891-020-03520-x
- Tönnis classification
- Hip osteoarthritis
- Total hip Arthroplasty
- Hip arthroscopy
- Hip Arthroplasty