The clubfoot assessment protocol (CAP); description and reliability of a structured multi-level instrument for follow-up

Background In most clubfoot studies, the outcome instruments used are designed to evaluate classification or long-term cross-sectional results. Variables deal mainly with factors on body function/structure level. Wide scorings intervals and total sum scores increase the risk that important changes and information are not detected. Studies of the reliability, validity and responsiveness of these instruments are sparse. The lack of an instrument for longitudinal follow-up led the investigators to develop the Clubfoot Assessment Protocol (CAP). The aim of this article is to introduce and describe the CAP and evaluate the items inter- and intra reliability in relation to patient age. Methods The CAP was created from 22 items divided between body function/structure (three subgroups) and activity (one subgroup) levels according to the International Classification of Function, Disability and Health (ICF). The focus is on item and subgroup development. Two experienced examiners assessed 69 clubfeet in 48 children who had a median age of 2.1 years (range, 0 to 6.7 years). Both treated and untreated feet with different grades of severity were included. Three age groups were constructed for studying the influence of age on reliability. The intra- rater study included 32 feet in 20 children who had a median age of 2.5 years (range, 4 months to 6.8 years). The Unweighted Kappa statistics, percentage observer agreement, and amount of categories defined how reliability was to be interpreted. Results The inter-rater reliability was assessed as moderate to good for all but one item. Eighteen items had kappa values > 0.40. Three items varied from 0.35 to 0.38. The mean percentage observed agreement was 82% (range, 62 to 95%). Different age groups showed sufficient agreement. Intra- rater; all items had kappa values > 0.40 [range, 0.54 to 1.00] and a mean percentage agreement of 89.5%. Categories varied from 3 to 5. Conclusion The CAP contains more detailed information than previous protocols. It is a multi-dimensional observer administered standardized measurement instrument with the focus on item and subgroup level. It can be used with sufficient reliability, independent of age, during the first seven years of childhood by examiners with good clinical experience. A few items showed low reliability, partly dependent on the child's age and /or varying professional backgrounds between the examiners. These items should be interpreted with caution, until further studies have confirmed the validity and sensitivity of the instrument.

The International Classification of Function, Disability and Health (ICF), developed by the World Health Organization (WHO), is a classification of health and health related domains that describe body function and body structure, activity and participation [20,21]. For studies on outcome, the ICF can be used as a tool to systematically describe measures according to these domains.
The lack of an instrument that is useful during the child's growth, and follows the guidelines of the ICF, led to the development of the Clubfoot Assessment Protocol (CAP). The aims of this study were to i) describe this new instrument, ii) to investigate item inter-rater reliability between two experienced clinicians with different professional backgrounds, iii) to investigate item intra-rater reliability and iv) to investigate the influence of age on reliability.

The Clubfoot Assessment Protocol (CAP)
The purpose of the CAP is to provide an overall profile of the clubfoot child's functional status within the domains of body function/structure and activity on single assessment occasions and over time. Furthermore, the CAP aims to provide structure and standardization for follow-up procedures from 0 to11 years of age in daily clinical decision making. It is an observer administered test. The selection of important items to be included in the protocol and scoring system was an act of balance between considerations of clinical utility and scientific interest. Literature studies, expert opinions and clinical experience on what patients /parents present as important factors formed the platform for the CAP prototype.
The CAP (shown in its entirety, as used in daily practice on side 19), (Table 3) contains 22 items in four sub-groups: mobility (8 items), muscle function (3 items), morphology (4 items), and motion quality I and II (7 items). The first three sub-groups relate to body function/structures and the last to activity according to ICF-2001 [8]. Questions about pain, stiffness and daily activity /sport participation are standard. These subjective items are not included in this reliability study.
Each item is described in a manual along with the criteria for scoring. The scoring is divided systematically in proportion to what is regarded as normal variation and its supposed impact on perceived physical function ranging from 0 (severe reduction/ no capacity) to 4 (normal). Score grading can vary between 3 to 5 levels. For subgroups the sum of the items scores are calculated and can be visualized as profiles (transformed to a 0-100 scale score, with 0 = extremely deviant and 100 within normal variance; sub-group transformation score = actual score/ maximal possible score × 100). Missing item assessment is treated by submitting the average scoring for that item. The CAP is not intended for total scores.
Administration time varies between 10-15 minutes dependent on the child's cooperation. Seven items assess motion quality and are age dependent. At the age of three years all children are presumed to be able to perform Motion Quality part I. At the age of 4 all children are also expected to be able to perform Motion Quality part II. Knowledge and experience on normal child neuro-motor development is a prerequisite for enabling proper assessment of the sub-groups muscle function and movement quality.

Procedure
The reliability study took place over a four month period at routine follow-ups at the clubfoot unit and in a normal clinical setting. The project was regarded as quality control in clinical work. The children were familiar with the examiners. Parents and older children were informed about the testing procedure of the instrument and its importance in increasing the quality of our follow-up program. They were also informed that they could withdraw whenever they wanted. They all gave their consent to participate.
Two examiners, one physical therapist (HA) and one pediatric orthopedic surgeon (GH), both well acquainted with clubfoot problems, assessed consecutively and independently of each other the children in random order. Both had been participants in developing the protocol. HA had clinical experience working with the CAP. GH carefully studied the manual and the protocol before entry. After the first eight patients, the two observers consulted with each other before continuing. To enhance the stability of the phenomenon tested and to prevent the children of getting bored and tired, the examiners took turns in instructing the children while testing the items of domain "motion quality".
The intra-rater reliability test was done by HA.

Patients
In the inter-rater study, 13 girls and 35 boys born with idiopathic clubfoot, median age of 2.1 years (range, 0 to 6.7 years) were assessed. Twenty-seven children had unilateral and twenty-one had bilateral clubfoot, which gave a total of 69 assessed feet. The feet's severity spectrum in new-born ranged from very mild to very severe [10]. The feet were assessed in different phases of our treatment program. This includes intensive stretching and manipulations on a daily basis during the first 2 month after birth supplemented with an adjustable splint worn 22 hours a day. At the age of 2 month in most cases an Achilles tenotomy and posterior-medial release was needed followed by a 5 week period of casting. At the age of 4.5 month old these children's clubfeet were fully corrected and treatment continued with a special designed dynamic orthosis. In the beginning these orthosis were used 18 hours a day and later on only at night (minimum of 8 hours) until four years of age.
The children were divided into three age groups: I. Newborn -walking debut (n = 22 feet, median age 3.2 months, range 0 to 1.1 years).
The intra -rater portion of this study consisted of 20 children, considered to be in a clinical stable phase and a median age of 2.5 years (range, 4 months to 6.8 years). A total of 32 feet, were assessed dispersed in the three age groups as following; 8:14:10. The mean re-examination time was 2.1 months (range, 0.5 to 3.0 months).
Most missing values were seen in age group II in the subgroup motion quality, especially for heel and toe walking (12 out of 25 assessments). This was caused by immaturity in the motor development. In three cases, the child refused to co-operate with one or the other of the observers (Table 2).

Table 1: Inter-and intra reliability. Unweighted Kappa values, confidence interval (CI) and overall agreement (Po) for the inter-and intra rater reliability tests and age interval 0 -7 years.
Inter-rater Intra-rater Item S n Kappa (95% CI)) Po (%) n Kappa (95% CI) Po (%) The distribution of the assessments scores were more equally spread in the age group I and for all ages together. Age group II and III had assessment shifting more to the right of the scale for the first 15 items.

Statistics
Unweighted Kappa (k) statistics for agreement were used [22][23][24] with 95% confidence interval. It calculates agreement beyond chance. As kappa values can become unstable under certain conditions [24,25], the observed percentage agreement (Po) was also calculated. A Po > 75% was regarded as good. In cases with limited distribution of cell frequency, the Po was preferred instead of k. The amount of categories is also regarded as kappa values decrease when categories increase [25]. The kappa has a maximum of 1 when agreement is perfect, but a value of 0 indicates no agreement better than chance, and negative values show worse than chance agreement. According to Altman [22] the kappa values are to be interpreted as follows: <0.20 as poor agreement, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as good and > 0.80 as very good agreement.
A good reliability was considered when the kappa value was high, or a low kappa value combined with a high in Po. A sufficient reliability was considered in cases with fair-moderate kappa values and good percentage agreement.
The SPSS 12.00 and StatXact (version 3) was used for the statistical analyses. The two examiners agreed totally in 82% of the assessments (range, 62 to 95%). (Table 2). A one -category disagreement was seen in 17% of the cases, whereas a twocategory disagreement was seen in 1 %. We conclude that all but one item had moderate to good agreement. Taking into account the kappa values, the Po and amount of scales, no age group showed clearly poor reliability values for its items except for item 20, running, in age group III.

Intra -rater reliability
A total of 587 assessments were done twice. All items had kappa values > 0.40 (range, 0.54 to 1.00) ( Table 1). Total agreement was reached in 89 %. A one-category disagreement was seen in 10 % and a two-category disagreement in 0.3 %.

Discussion
The CAP protocol items had moderate to very good interrater reliability for all the items in the age group 0-7 years and for most of the items when regarding the specific age groups.
The intra-rater test showed good to excellent reliability and indicates a good standardization of the protocol.
Most items in our protocol had moderate to excellent inter-observer reliability especially concerning sub-groups "passive mobility" and "morphology". This is a positive finding in the light of the fine-grained protocol with up to five different categories and the two observers' different professions and different experience with the protocol.

Methodological issues
Reliability studies in children are difficult to perform. The risk for errors is high as the children's co-operation and task understanding may vary from day to day and between different examiners. A child-friendly environment and familiarity with the examiners are important factors in enhancing reliability. We also wanted a situation that was comparable with a normal clinical setting where the instrument is intended to be normally used. These are the reasons why the investigation was unblinded and no more than two examiners were involved.
The fact that one of the examiners had extensive practical experience with the instrument while the other had only co-operated with the development of the protocol might have influenced the result.
In clinical practice teamwork often is the norm and therefore we chose two different professions. However Flynn et al. [18] observed in his study that including a physical therapist decreases reliability; agreement should be expected to increase if assessment is kept within the same profession.
The children available for our study represented the clubfoot spectrum [6] and illustrated the clinical development. Gender distribution corresponded well with the 3:1 (male/female) ratio normally described [3].

Statistics
When working with ordered categorical data as, in the case of our protocol, the right way of analyzing agreement is said to be Kappa [22,24]. We chose the unweighted Kappa as we wanted to know how the exact agreement would be for our finely graded instrument. It is more common though to use the weighted Kappa statistics that take into account the degree of disagreement [22]. These values are usually higher. We recalculated our kappa's to weighted and found that the values increased between 0.01 and 0.20. For example, our kappa value for the item "running" in age group III, changed from 0.13 to 0.46 when using weighted kappa statistics. This indicates that we can increase our reliability by combining categories. Within research, the finely graded protocol should be prioritized. Care should be taken when interpreting kappa statistics as the value of kappa depends upon the proportion of subjects in each category [24,26]. Haas [24] emphasizes that kappa becomes unstable under certain conditions. The problem-limited variation occurs when there is a large proportion of agreement and most of the agreement is limited to only one possible rating choice. We saw this problem for example in item 7. When all children between 0-7 years were included, untreated, treated and relapsing feet were assessed which meant that the whole scoring spectrum was used. Problems with limited distribution therefore became less. The older children generally had scores that lay more to the right on the protocol which caused a certain ceiling effect. Thus the CAP detects differences in severity which confirms part of its construct validity.
Another possibility for assessing reliability would have been to calculate statistical differences between the total sub scores for each observer, as Flynn et al. [18] did in their reliability study comparing the Pirani [13] and Dimeglio scores [11]. Another way might be to use the mean difference and calculate the 95% limits of agreement as Altman describes [22]. This could give us information on how much we can expect every new assessment to differ between new examiners and individuals and its clinical relevance.

Results
We have described the CAP; an alternative assessment tool for both short-and long-term follow-ups of children treated for clubfoot. Our protocol differs from most others through scoring grades with smaller intervals and incorporates a broader assessment on movement quality. It is also intended to be used longitudinally during the child's growth. The focus is primarily on item level and secondary on subgroup level. With sum scores and categorization/classification, important information can be lost and it should therefore be avoided [26,27]. Research profiles can be made for each item-score or subgroup(s) scores from the CAP at a certain time or over a time interval on group or individual level. In daily clinical work, the CAP is a promising tool in increasing the quality of follow -up procedures and clinical decision making through standardization and gives the possibility of a visual feedback. It also will give us the possibility to analyze factors influencing the clubfoot development.
With outcome studies, a holistic approach is of importance. The CAP should be supplemented with a patientand parent-based questionnaire with items specifically focusing on symptoms and limitations in daily life, such as the patient-based questionnaire developed by Roye et al [17]. The Laaveg-Ponseti [12] rating system also has a score distribution emphasizing the importance of patient satisfaction and participation. Recently, several outcome measures focusing on the child's physical functioning in her or his environment, such as the Pediatric Outcomes Data Collection Instrument (PODCI) [28] and the Activity Scales for Kids (ASK) [29,30], have been developed. The use of these kinds of outcome instruments in the future will increase our knowledge of factors that are probably of more importance for patient satisfaction than range of motion, strength and radiographic changes. In the future these factors will become more and more important when discussing outcome results [10,31,32].
Face validity (whether a test appears to measure what it is supposed to measure) and content validity (the extent to which the measures represents functions or items of relevance given the purpose and matter of issue) [34] are enhanced through the developmental procedure. This is based on literature studies, discussions, clinical experience and patient information. Through clinical trial the tool was adjusted several times during the years used at the clubfoot-clinic and might be further adjusted.
Reliability for the different age groups is, with respect to the difficulties met in assessing children, within acceptable limits. Items, which demanded maturity, co-operation and task comprehension such as muscle function, are more vulnerable for different assessment results as research conditions can change between the observers. This is clearly seen in the total group for item 10, kappa value of 0.36, and item 11 kappa value 0.35.
Distinguishing differences in running quality is not easy to assess which is expressed in a low kappa value of 0.38 (fair). It is a fast movement and to observe slight variations is difficult. In our study nearly all differences lay between slightly deviant and normal.
Wainwright et al. [21] assessed the reliability of four classification systems from Catterall [9], Dimeglio et al. [11], Harrold and Walker [35] and Ponseti and Smoley [3]. These instruments are only comparable with the CAP mobility domain. Nine children (13 clubfeet) were assessed by four examiners at different stages in the first 6months of life (= 180 examinations). The results showed kappa values varying between 0.14 and 0.77. It is not reported if the kappa is weighted or unweighted. The kappa values for our CAP-mobility items vary between 0.57 -0.73 for ages 0-7 years and ages 0-walking debut between 0.32 -1.00. We consider this to be positive in the light of the fine graded scales in our protocol.

Future research
Further studies on psychometric aspects are ongoing and are needed before the CAP can be used in a scientifically sound way. Changes in items used and item groupings are therefore expected.

Conclusion
The CAP contains more detailed information than previous protocols. It is a multidimensional observer-administered measurement instrument with the focus on item and subgroup level. It can be used with sufficient reliability independent of age during the first seven years of childhood by examiners with good clinical experience.
A few items showed low reliability, partly dependent on the child's age and /or varying professional backgrounds between the examiners. These items should be interpreted with caution, until further studies have confirmed the validity and sensitivity of the instrument.