Clinical history taking and assessment are the cornerstones of diagnosis and management [7, 10]. Establishing the relevance and reliability of such information is important not only for epidemiological research but also for clinical practice. This study investigated the reliability of two trained observers using a set of standardised questions and assessments derived from a Delphi study and existing literature.
Generally, for clinical interview questions, agreement was high and reliability was good. Reliability for items assessed using measurement instruments and recorded on a numerical scale, for example, grip strength, was generally higher than for items requiring observers to make judgements and interpret participants' responses.
The majority of variables requiring observation and palpation (skin condition, global impression of upper limb, muscle wasting, swelling and pain on resisted movement) showed poor reliability for inter-observer ratings. Reliability was moderate to good for observation and palpation of joint bony change and palpation of joint tenderness, which is similar to findings from previous studies [29, 30]. In our study, poor reliability was observed for measurement of thumb opposition (intra-observer), sensory testing and questions relating to altered sensation. Poor reliability may be attributable to several factors.
Real change in symptoms might explain poor reliability, although in this study it is unlikely to explain inter-observer variability. It is more reasonable to expect an effect on intra-observer variability because some change in symptoms over a month (i.e. the period of time between the first and the second assessment) might have occurred. However, the majority of participants reported that their hand symptoms were unaltered, implying a reasonable degree of stability. It should be noted, however, that stability was assessed using a single global question with three response options, and as such conclusions about change in specific symptoms are difficult to draw. Agreement for dimensions likely to change over one month, such as pain, tenderness and swelling, was no poorer for intra- than inter-observer comparisons, suggesting that poor agreement, notably for swelling, was unlikely to be due to change in symptoms.
Order effects are a possible explanation for variability, particularly for inter-observer comparisons of variables that might reasonably improve or deteriorate over the course of the two assessments. The potential for order effects was reduced in the design of the study and no systematic differences were noted when comparing assessors' results for variables likely to change over the course of the assessment.
Poor reliability, particularly for inter-observer ratings, may be explained by systematic differences between the observers. Systematic differences were found between the observers for two of the interview questions relating to altered sensation. Possible explanations for this are that either one of the observers influenced participants in the way in which the question was asked, or the observers interpreted participants" responses differently from each other. Systematic differences were also found for the assessment of muscle wasting, nodes, deformity and swelling with one observer consistently finding more positives than the other. For the assessment of bony enlargement, differences in the number of positive findings were related to the joint group, with one observer finding more enlargements at the proximal interphalangeal (PIP) joints and fewer at the distal interphalangeal (DIP) joints than the other observer. Observers' threshold for making positive judgements may be affected by several factors. Comparative rather than independent judgements may be made within or between participants. Within participants, observers may be influenced in their judgement of the presence of a feature in one joint by what they see in surrounding joints. Similarly, an observers' threshold for judging enlargement or deformity in the joints of one participant may be raised or lowered by judgements made during assessment of previous participants. Despite training the observers using the manual of study protocols, judgements may have been influenced by professional training, post qualification clinical experience, and prior expectation.
In the general population it may be more difficult to differentiate between 'normal' and 'abnormal'. Features in the hand are more likely to be milder and less pronounced than in a secondary care setting, making judgements about their presence more difficult to make, an observation which has been noted previously [30]. For example, in our study, inter and intra-observer reliability for objective testing of sensation using the Semmes-Weinstein™ monofilaments was fair to poor. Our results were similar to those found using healthy volunteers [31, 32], but differed from those using nerve injured patients [33, 34], where a high degree of reliability was established, suggesting that monofilaments are most reliable for those with definite nerve damage.
High levels of variability, in the face of high observed agreement, may be due to the effects of prevalence, that is, positives occurring either commonly, for example, normal skin condition, or rarely, for example, joint swelling. In these circumstances, a high or low prevalence tends to markedly reduce the magnitude of Kappa, despite high observed agreement. Where prevalence of swelling was not extreme, (notably the index and middle finger metacarpophalangeal joints), reliability was generally better.
Good agreement has previously been observed for the application of the ACR criteria for hand OA [35]. In our study, the observers demonstrated moderate reliability when applying the ACR criteria for hand OA. This slight difference may be due to variations in the two study populations.
Poor reliability is likely to be due to a combination of differences between the observers, features in the hand being indistinct in nature, and a high or low prevalence of features. The reliability of assessing items such as altered sensation may benefit from greater standardisation or alternative forms of data collection, for example, self-report questionnaire. The reliability of assessment of individual features at single joints, for example, nodes, may benefit from being viewed in combination for composite variables, cut-offs, or classifications. These results suggest that the ACR criteria for hand OA is more reliable than the individual components.
In the absence of accepted gold standards for assessing specific patient populations [36], it is difficult to comment on the accuracy of the observers' judgements. Where there was agreement between observers it does not necessarily mean that the answer is correct [37]. Similarly, where there was systematic disagreement, it is difficult to say which of the observers was correct.
This reliability study has several strengths. The questions and assessments were derived from Health Care Professional consensus [11], supplemented by measures from the literature. Participants were sampled purposively from a primary care setting to ensure a broad spectrum of hand problem severity. Potential sources of variability were minimised through observer training and the use of standardised protocols and aid memoirs. The potential for order effects was reduced in the design of the study. The time interval between repeat assessments was chosen to ensure a balance between participants remembering details of the assessment and true change occurring.
It has been acknowledged that there is no single design that would adequately address issues of external validity for method, measuring instruments, observers and participants [38]. Whereas this study focused on ensuring external validity in relation to participants, the results based on two observers will limit the extent to which generalisations about the reliability related to the wider population of clinicians can be made [39].
Although this study was designed to limit potential sources of variability, it is inevitable that some bias occurred. Systematic differences between observers may be responsible in part for some of the poor reliability achieved, and could be addressed to an extent by further training, strengthening of study protocols, and routine quality control checks to ensure adherence to protocols. However, it is inevitable that when making judgements, particularly about the presence of mild features, some variation will occur [39].