The findings of this study indicate that the Single Leg Mini Squat test has moderate to excellent reproducibility for children in the age of 9–10, representing 3rd grade and 12–14 years, representing 7th grade, tested by 2 inexperienced physiotherapists, with linear weighted kappa ranging from 0.54 to 0.86 being only 15% lower than the quadratic weighted values, ranging from 0.76 to 0.95. The lowest kappa values were observed at children in 7th grade on the knee component with a linear weighted kappa of 0.54. Further on, this study indicates that very different results are presented depending on the type of kappa chosen as statistical method.
Discussion of aim 1
The linear weighted kappa found in the current study varies from 0.54 to 0.86 (P 0.08 – 0.47, Po 0.86 - 0.97) depending on the component to be evaluated with the knee component as the most challenging. A previous study on adults evaluating the reproducibility of SLMS in adults found a higher kappa (kappa 0.92, Po 0.96) when using a nominal scale assessing postural orientation of the knee solely , while this study evaluated postural orientation of multiple components and with an ordinal scale. The number of possible scores (0–3) and the number of components (4) may have influenced the judgment of postural orientation and thereby the results, considering that an ordinal scale may be less accurate than a nominal scale, due to a higher risk for the two testers to disagree [9, 11]. The protocol for the Single Leg Mini Squat test for children was accomplished according to the description of Ortqvist , but with the scoring system of Trulsson  in order to try to differentiate the degree of displacements. The latter concept was originally developed for a test performed 5 times only and in a much slower pace, why it in this study can be difficult for the tester to manage to observe the described components on four different regions (ankle, knee, hip, trunk) and score on four different points (0–3) in a faster pace during 30 seconds. Another example of a real-time multi-component scoring system for evaluating frontal plane postural orientation in a jump-landing task is the Landing Error Scoring System - Real Time (LESS-RT) . 5 components with 2–3 scoring possibilities are scored over 2 trials of the jump-landing task with an additional trial to allow the tester to observe all 5 jump-landing characteristics. For athletes (18–23 years) the LESS-RT has high inter-rater reliability (ICC 0.79, 95% CI 0.64-0.88), evaluated by experienced athletic trainers. Yet, this system has not been tested on children or adolescents and still lacks predictive evidence for identifying individuals who are at high risk for injury.
When assessing the reproducibility of SLMS in a child population, a moderate inter-tester reproducibility (kappa 0.57) and an overall agreement of 0.79 were observed . Dichotomizing data from the right knee of the 7th grade to a nominal scale (0=negative test, 1-3=positive test), the current study obtained, with a prevalence of 0.37, a kappa value identical to the kappa presented by Ortqvist et al. (2010) (kappa 0.58, overall agreement 0.59 vs. 0.79). Ortqvist et al. concludes that the test clinically is useful in a pediatric population, based upon the relatively high overall agreement. However, this is in discrepancy with Landis & Koch, as overall agreement does not take into account the occurrence of agreement by statistical chance . To determine a given test’s reproducibility, the most comprehensive presentation is reported to include the prevalence, the overall agreement and the kappa value in combination [11, 12].
Kappa for the knee component is generally lower for children in the 7th grade compared with children in the 3rd grade. Possible explanations for this could be a high or low prevalence index, or that the children in 7th grade perform the test faster and hence produce a higher number of knee bends (median 22 vs. 17) making it more challenging to determine the score.
The trunk was the component with the highest kappa values. An explanation for this phenomenon could be that trunk displacement is easier to observe visually compared with displacement of the knee. The current protocol may not be thorough enough regarding determination of score, why it is advisable for future studies to standardize the test more detailed with emphasis on the knee component.
Prior to the study, the two physiotherapy students went, under supervision of an experienced physiotherapist, through standardization of the test in an extensive training phase performed on a large number of children in order to minimize bias and increase overall agreement. With a strict, standardized protocol and a thorough training phase, clinical experience ought to be less important, which some studies [9, 17] assessing movement quality indicates.
Discussion of aim 2
The authors of this study have chosen to interpret their results using linear weighted kappa (Table 1). However, it would have been possible to choose quadratic weighted kappa, which would have increased the reproducibility for SLMS (see Tables 4 and 5). Quadratic weighted kappa is often used for the purpose of comparing results with Intraclass Correlation Coefficient (ICC) . However the concern, the quadratic weighted kappa method may give a too positive picture of the reproducibility of this screening test with an equal distance between scoring categories, as quadratic weights increases with the number of categories, whereas linearly weights varies much less with the number of categories . As well as the kappa, the overall agreement increases depending on the type of kappa chosen as statistical method, with a divergence as much as 61% from un-weighted kappa to quadratic weighted kappa in the example of the right knee for the children in the 7th grade. Clinically, one must consider the application of the results of the quadratic weighted kappa, demonstrating an almost perfect reproducibility along with a very high overall agreement, which might provide several false-positive outcomes.
When interpreting the cross table (Table 3) as an example, it reveals that one or both of the testers obviously have over- or underestimated the displacement of the right knee of the children in the 7th grade. This illustrates why kappa values never can be used independently to evaluate the ability of a given test, but needs to be interpreted with the prevalence index and optimum with the content of cross tables presented, in addition to the overall agreement, in order to understand the full potential of a test’s reproducibility . The amount of misclassifications becomes clearer by observing the numbers in a cross table compared with only knowing the kappa value .
Tables 4 and 5 illustrates how different methods of calculating kappa may influence the final result and therefore the reported reproducibility. The results vary from moderate to excellent strength of agreement within the same component, which is a serious inconsistency, considering that they are calculated on the same dataset. The concern here is that there are no formal guidelines available as to when one should use which weighting values. Depending on what seems natural in the given context, it is even possible to develop one’s own weighting scale [11, 14] which is a considerable limitation for comparison of results, unless the weighting values are described and explained.
Another concern regarding this study and kappa statistics in general, is the benchmarking for an acceptable kappa value. When interpreting the kappa values according to Landis & Koch , 60% of the quadratic and linear weighted single component results would be substantial and 34% almost perfect. However, classification for interpreting the kappa values varies, and the benchmark for an acceptable kappa value, classified as intermediate to good, is according to Fleiss 0.40, and for an excellent kappa value 0.75 . This means that 75% of the kappa values in this study would be intermediate to good and 25% excellent. The question is, how high the inter-tester reproducibility coefficient should be for the extent of agreement to be considered good enough, and since the choice of classification scale and benchmark inevitably will be arbitrary, one should always interpret kappa in relation to the prevalence, the overall agreement and the bias .
The limitations of this study are the amount of components to be evaluated in a relative short time interval, and using an overall score without determining exactly when, during the 30 seconds, to evaluate on which component, concerning the influence of muscle fatigue on the performance. It may therefore be too demanding to be accurate when evaluating postural orientation on several components in a real-time test. A potential solution may be to consider the use of 2D video analysis with the option to observe one body part at a time, and to use the video facilities as slow motion and repetition as in other similar studies [6, 9].
In this study, the contrasts between children was minimal, as only non-injured children and only few children with minor pain participated, which can have affected the testers likelihood of scoring ´2´ or ´3´, as injured children most probably would have had higher scores. A study population consisting of injured and non-injured children could have affected the prevalence and might have made it more obviously for the testers when to score a ´2´ or ´3´. In clinical practice it is important to be able to screen children with potentially injury risks, which was one of the reasons for performing a reproducibility study in a study population that is normally seen in clinical practice. Using PABAK, as presented in the current study, may solve one of the statistical problems with the small group contrast. Screening tests with differentiated scores on an ordinal scale may also be more similar to clinical practice than screening tests on a nominal scale, why the necessity of evaluating the reliability as well as the predictive validity of screening tests becomes clear.
The strengths of this study are the high number of children included and thus the amount of information collected. Previous studies have only examined one component (the knee) at a time, while this study indicates that SLMS has potential as a screening test evaluating postural orientation of several components. The method is fast and easy to administer for clinical use and requires no equipment. The test was standardized before use, with an extensive training period performed on 100 children.
If the SLMS test is a predictor of complaints and sport injuries in children and adolescents, it could be used as a screening tool, thus targeting interventions at those children and adolescents with displacement of ankle, knee, hip or trunk components during the test. However, it is necessary to test concurrent and predictive validity in children and adolescents in relation to PFP and ACL injuries before such interventions are relevant.