This study shows substantial to excellent agreement between two ultrasound observers for presence of osteophytes and effusion size. Moderate to substantial agreement was also demonstrated for measurement of ultrasound derived femoral cartilage thickness. Moderate to substantial validity was demonstrated when comparing osteophytes detected by ultrasound to those seen on radiographs.
There were some important methodological considerations to note in this study. Assessment of the whole process of both acquisition and reading of ultrasound images was performed: thereby including the main potential sources of variation. Some previous studies have only looked at the reliability in reading images between observers , but it is important to measure the differences in the acquisition of images, especially considering the dynamic nature of US imaging. The results of this study are therefore likely to be closer to the true value. Intra-rater reliability was not measured in this study but is likely to be as good as, if not better than, inter-rater reliability.
There was a time interval of up to six weeks between the two ultrasound observations, which might have altered the magnitude (size) of effusions. However, the participants were recruited from the community and not from attendance at either primary or secondary care. Therefore, it is unlikely that they had any significant steroid or other specific therapy in hospital for the incidental effusions that were picked up on the first ultrasound. As effusion size within participants might still have changed during this period, this interval could only have served to decrease the agreement between the two sonographers. The inter-rater agreement for size of effusions found in this study therefore is also likely to be conservative. When effusion was considered as a binary variable (using a cut off of ≥4 mm depth), the κ was 0.65 (right) and 0.77 (left); which remains very close to the ICC values obtained when effusion was used as a continuous variable.
Power Doppler assessment of synovitis (PDS) was not conducted in this study as the machine used for the study did not appear to have adequate sensitivity, based on images acquired prior to the study. PDS has been found to be a valid  method of detection of synovitis in the knees, although its reliability is still to be established. A EULAR group that assessed ultrasound features of inflammation decided not to evaluate PDS due to their concern that this was highly machine dependant .
This study did not seek to confirm that the osteophytes seen on ultrasound were the same ones on the radiographs, as the presence of any bone response is likely to be clinically important. The kappa values for validity were comparable when either sonographer's osteophyte results were compared with radiographic osteophytes. The confidence intervals of these values between the sonographers overlap significantly; which is reassuring. Previous methods evaluating femoral condylar cartilage [9, 10], have used semi-quantitative scores to assess the clarity and sharpness of cartilage, but this has the disadvantage of losing precision due its ordinal scale. In addition, the features of sharpness and clarity are quite likely to differ between the subjective assessments of observers. These features are also susceptible to change as more advanced ultrasound machines with better resolution are created.
The two sonographers agreed on a consensus for the acquisition and reading of images, prior to the commencement of this study. This would have decreased the learning curve that otherwise might have been seen. However, the scanning protocols did not include restrictive methods such as the use of grid lines to assist the placement of the probe for cartilage thickness measurement, as has been seen with previous studies [29, 30]. It is important that sonographers refer to the guidelines suggested by Backhaus et al  so that consistency can be achieved in future studies using ultrasound as an outcome measure in OA.
The demonstration of reliability and validity is an important precursor to any epidemiological study of OA using ultrasound. Previous studies that have assessed inter-rater reliability of ultrasound features of knee OA have included only small numbers of patients  or a small subset of the patients in the original study. Inter-rater reliability between multiple experts in Europe on six patients (two with Rheumatoid Arthritis, four with OA) showed an overall Kappa of 0.60 for the knee, with the agreement being 92% for effusion/synovitis and 85% for bony cortex abnormalities . Our study results show a slight improvement in the agreement between two ultrasound observers, when compared to a study of the knee  and the hip  previously. This may in part be due to the fact that the two observers in our study had the opportunity to agree on a consensus, prior to the commencement of the study. This is the largest study to date, involving 34 knees, to address the issue of inter-rater reliability of ultrasound for various features of knee OA.
Jonsson et al  studied six patients and four controls who had each of these imaging modalities repeated once within one to four weeks of the initial imaging procedure. Radiographs (although an indirect measure of cartilage thickness) were the most reproducible imaging modality to assess cartilage in the knees with a co-efficient of variation of 6.5%, while ultrasound performed next best with a co-efficient of variation of 8.4% and magnetic resonance imaging faring worst at 12%. While this may suggest that ultrasound demonstrates better test-retest reliability than MRI, it should also be noted that significant improvements have been made in the quantification of cartilage measurements by MRI  since that study took place. The kappa and ICC agreement values in our study using ultrasound are comparable to those of radiographic studies of inter-rater reliability [36, 37].
Ultrasound demonstrated excellent agreement with MRI in a validation exercise involving 14 observers from all around Europe. There was 100% agreement for effusion, 79% for synovial hypertrophy and 75% agreement for osteophytes, when compared to MR imaging, among the observers who imaged the knees of four patients with inflammatory arthritis . Yoon et al demonstrated validity of the longitudinal sagittal ultrasound image for assessment of cartilage thickness in a study using MRI as the comparator in 51 patients with knee OA in South Korea . However, the longitudinal sagittal image has not been performed subsequently in other studies or advocated previously in the EULAR guidelines ; this was not performed in our study either. The transverse image for femoral cartilage thickness has been validated by comparison with histopathological specimens , which can be considered to be the gold standard and superior to MRI in measurement of cartilage thickness. Naredo et al compared ultrasound with measures of pain and radiographs in 50 patients with knee OA . They showed that knee effusion, medial meniscal protrusion and displacement of the medial collateral ligament were associated with significantly higher knee pain. Medial meniscal protrusion was related to decreased medial joint space width on radiographs. A Danish study which compared ultrasound and MRI showed that ultrasound detected 100% of effusions seen on MRI and Spearman coefficients of 0.87 and 0.86 were seen for effusion and synovial thickness measurements between the two modalities, respectively . Our study could not validate features of inflammation because the comparator was radiographs. However, the Kappa values of 0.52 and 0.75 for comparison of osteophyte detection between ultrasound and radiographs in our study are similar to the results from the MRI study above.