The primary aim of this study was to examine the accuracy of the AO/OTA classification of distal radial fractures as it is used in the SFR. The accuracy, defined as agreement between the classification in the SFR and the gold standard classification, was moderate, regarding both the AO/OTA group (including the A2 subgroups) (kappa 0.41) and the type (kappa 0.48).
The purpose of the study was to assess the accuracy of the classification of distal radial fractures in adults as carried out in clinical practice in the SFR and to examine whether the majority of disagreements were between related fracture groups. The present study differs from previous studies of the validity of classification of DRFs since it compares a consensus classification (the gold standard classification) with the classification made in everyday clinical practice by clinicians with varying experience (the classification in the SFR). Despite this, the agreement between the classification in the SFR and the gold standard classification is in accordance with the inter-observer kappa values in previous studies. Weaver et al. reported kappa values of 0.45 and 0.24 (for types and groups respectively), Yinjie et al. of 0.47 for groups and Plant et al. of 0.56 and 0.29 (for types and groups respectively), which are all in the same range as in the present study [9, 11, 24]. In the present study, the inter-observer reliability was moderate (kappa 0.48) to substantial (kappa 0.76) for the AO/OTA type, but it dropped to moderate (kappa 0.48) to fair (kappa 0.22) at the group/subgroup level. The classification of DRFs has been shown to be more difficult than other end-segmental fractures in the AO/OTA classification [21]. The low inter- and intra-observer agreement regarding the classification of DRFs is not exclusive to the AO/OTA system. Moreover, the classification systems of Frykman, Olders, Fernandez and Melone have shown low inter- and intra-observer agreement [8, 9, 11].
A CT scan examination is currently widely used to assess the fracture details and to facilitate fracture classification. A number of studies have evaluated the value of this kind of investigation of the AO/OTA classification reliability. Flinkkilä et al., Kleinlugtenbelt et al. and Arealis et al. found that, although a CT scan improved the ability to detect intra-articular fracture lines, it added little value to improving inter-observer reliability [6, 10, 27]. As a seemingly natural consequence of this, both Flinkkilä et al. and Kleinlugtenbelt et al. found that the proportion of fractures classified as intra-articular was higher when a CT scan was used [7, 27]. The explanation for the lack of clear effect of a CT scan on the reliability in these studies could be the so-called coastline paradox, which states that the length of a coastline varies with the measurement scale or that complexity increases with a smaller measurement scale. In cases in which a CT scan is used to determine fracture classification, some questions are resolved (e.g. whether the fracture is intra-articular). However, due to more details becoming visible, new questions may arise that could confound the classification.
It is apparent that, for the 64 fractures in the present study where the classification of DRFs in the SFR and the gold standard classification diverged, the majority of the divergences were between related fracture classes. These fracture classes are separated by only one defining question. This question could relate to the presence of a fracture line, which may be difficult to determine, or the degree of comminution, which lacks a clear definition. The inclusion of subgroups (A2.1-A2.3) did not significantly lower the kappa value, which may be related to the defining questions for these subgroups being relatively clear-cut (volar, dorsal or no displacement). In the present study, when the classification was simplified from AO/OTA group/subgroup (4/5 signs) to AO/OTA type (3 signs), the agreement was not substantially improved (kappa 0.41 and 0.48 respectively). An explanation for this might be that, even though disagreements are often between related fracture classes, the related fracture classes are not always within the same fracture type. A previous study of the validity of humeral fracture classification showed that related fracture classes could be far apart on the pictorial chart of the classification scheme [17]. This is exemplified in the AO/OTA classification of DRF, where the one feature separating a C2 fracture from an A3 fracture is an intra-articular fracture line. In spite of this, A3 and C2 are far apart in the pictorial chart. Furthermore, kappa calculation does not take account of the degree of disagreement. Weighted kappa on the other hand is not suitable, as fracture classification is a nominal scale. A disagreement in the classification of fractures close to the border of two related categories may not necessarily be of major significance or clinical relevance, but it affects the kappa value. However, with the concept of related fractures, the present study shows full agreement or disagreement within related fracture classes between the SFR and the gold standard in 80% (102/128). Full disagreement (disagreement between unrelated fracture classes) between the SFR and the gold standard was only seen in 20% (26/128) (Table 7). This model for interpreting the results may explain why the kappa values do not increase considerably when simplifying the AO/OTA classification from group/subgroup to type.
Strengths and limitations
The study design is in accordance with the quality criteria of Audige et al. and is similar to other validity studies made in the SFR [16,17,18,19, 25, 26]. The study population of 128 fractures is extensive and, as the study period extends over six years, the study is not affected by seasonal variations. The long study period also means that the junior residents at the A&E have been replaced several times, reducing the possible bias of individual skills. The study had no specific exclusion criteria (except age above 16 years) – all fractures were eligible regardless of treatment. No fractures were classified as A1 or B2 in the gold standard classification, however, all the other fracture groups were represented in the study. One possible bias is that all the fractures came from the same hospital, Sahlgrenska University Hospital, where many of the co-workers understand the importance of correct registration. In future studies it would be of interest to study fractures treated at different departments affiliated with the SFR. This study reflects the classification made in real-life conditions, at the A&E by the attending orthopaedic surgeons, some with limited experience. The radiological images were not standardised or excluded in the event of poor quality. The fact that the quality of the images varied reflects clinical practice. There is no such a thing as a perfect classification system. Nor such a thing as “a perfect truth” in the interpretation of the system, more a “weighing of expert opinions”. Therefore, regarding classification of fractures, there will always be some disagreement between observers since it relies on both an interpretation of the radiological images as well as of the classification system used. It is possible to argue that the “gold standard” classification is arbitrary, but to our knowledge there is no better way to define the “correct” classification.
The question remains of why the classification of DRFs universally shows such low kappa values. It becomes apparent that, to improve the accuracy of wrist fracture classification, the classification systems need to be modified, based on defining questions that are well defined and easy to assess. Although simplifying the systems, e.g. reducing the AO/OTA system to types only (A, B, C), improves the agreement to some extent, this renders the classification meaningless for treatment selection, prognosis and scientific work. The current study shows kappa values that are similar to those in previous studies. However, the concept of related fracture classes, presented in the current study, offers some explanation to the background for the poor kappa values.