The Valued Life Activities Scale (VLAs): linguistic validation, cultural adaptation and psychometric testing in people with rheumatic and musculoskeletal diseases in the UK

Background The Valued Life Activities Scale (VLAs) measures difficulty in daily activities and social participation. With various versions involving a different number of items, we have linguistically and culturally adopted the full VLAs (33-items) and psychometrically tested it in adults with rheumatic and musculoskeletal diseases in the United Kingdom. Methods Participants with Rheumatoid Arthritis, Ankylosing Spondylitis, Chronic Pain/ Fibromyalgia, Chronic Hand/ Upper Limb Conditions, Osteoarthritis, Systemic Lupus, Systemic Sclerosis and Primary Sjogren’s Syndrome were recruited from out-patient clinics in National Health Service Hospitals, General Practice and patient organisations in the UK. Phase1 involved linguistic and cultural adaptation: forward translation to British English; synthesis; expert panel review and cognitive debriefing interviews. In Phase2 participants completed postal questionnaires to assess internal construct validity using (i) Confirmatory Factor Analysis (CFA) (ii) Mokken scaling and (iii) Rasch model. Results Responders (n = 1544) had mean age of 59 years (SD13.3) and 77.2% women. A CFA failed to support a total score from the 33-items (Chi Square 3552:df 464: p < 0.0001). Mokken scaling indicated a strong non-parametric association between items. Fit to the Rasch model indicated that the VLAs was characterised by multidimensionality and item misfit, which may have been influenced by clusters of residual item correlations. An item banking approach resolved a 25-item calibrated set whose application could accommodate the ‘does not apply to me’ response option. Conclusions The UK version of the VLAs failed to satisfy classical and modern psychometric standards for complete item sets. However, as the scale is not usually applied in complete format, an item bank approach calibrated 25 items with fit to the Rasch model. Suitable Computer Adaptive Testing (CAT) software could implement the item set, giving patients the choice of whether an item applies to them, or not.


Background
Rheumatic and musculoskeletal diseases (RMDs) such as Osteoarthritis (OA), Rheumatoid Arthritis (RA), Chronic Pain (CP) and Fibromyalgia (FM), are common, and their prevalence is rising with the ageing population [1]. Many individuals with RMDs report moderate to high pain and fatigue which can lead to activity limitation and participation restriction, which affect Quality of Life (QoL) [2][3][4][5]. Therefore, European League Against Rheumatism (EULAR) recommendations for health professionals' approach to pain management in inflammatory arthritis and OA, emphasise pain is a complex and multifaceted experience. Treatment should be guided by patient's preferences and priorities, such as the impact on their activities and participation, in order to facilitate improved health outcomes [6]. Patient reported outcome measures (PROMs) can be used to identify such preferences and priorities. However, few include both activities and participation items.
Developed in the United States (USA), the Valued Life Activities scale (VLAs) is one such PROM, measuring both difficulty in daily activities and participation in society [7]. It was developed from the 75-item Activities Enumeration Index [8] which was derived from content analysis of diaries and telephone interviews with patients with RA or OA [7,[9][10][11].
The VLAs is based on Verbrugge and Jette's disablement model [12]. This defines activity and participation in three domains: Obligatory: required for survival and self-sufficiency, such as eating, hygiene, walking and transport Committed: related to one's principal social roles, such as paid work, child and family care and household responsibilities and Discretionary: engaged in for relaxation and pleasure, such as socialising, exercise, leisure, hobbies, religious activities, travel, volunteer work, educational activities, gardening.
The VLAs developers have allocated items to these three domains based on the model's definitions [7] (Additional File 1).
The VLAs has been used in over 10 cohort studies with large numbers of people with RA and systemic lupus erythematosus (SLE), but the way in which it has been administered varies, with different studies using different numbers of items (i.e. 33, 29, 26, 21, or 14 items)see Additional Table 1); some items differing between versions (depending on diagnosis), and several different scoring methods being used. These methods include: the average difficulty score for all items and for each of the three domains; the average score created by adjusting scores if the person reports changing how they perform the activity (e.g. use an assistive device, have help, take more time or limit time performing), with item scores being increased by one point if the score < 2 [13]; or calculating (unadjusted) scores only for those items identified as important by the participant [7,14]. Accordingly, we requested the definitive version and scoring method from the lead scale developer (Dr P. Katz). This was identified as the 33-item version scored on a 4-point scale (0 = no difficulty to 3 = unable to do). People are asked to record for each item: whether it is not applicable to them (i.e. the person does not normally perform the activity for reasons unrelated to their condition); their degree of difficulty performing it; and whether the item is important to them [13]. The overall score is then calculated as the mean of only those items identified as both applicable and important. As a result, different respondents' scores are based on different numbers of items within the VLAs, as the intention is to score only those activities which are "valued" by participants.
Some psychometric testing has been conducted with the 33-item and shorter versions, although with differing scoring methods, demonstrating internal consistency and test-retest reliability. The 14-item Short-VLAs, was developed using Rasch analysis, and unidimensionality, construct and concurrent validity have also been demonstrated [15,16]. However, the variability in how the tool has been administered (differing numbers of items) and scoring methods means there is currently limited evidence for the reliability and validity of the 33-item VLAs.
The VLAs, and the way in which is used, presents considerable challenges to deliver a robust psychometric analysis. For example, in the full 33-item set, respondents may simply respond to an item saying 'it is not relevant to me' then, in practice, a valid response may arise from any combination of the 33 items. As such, there are a vast number of possible combinations available (33 factorial). The current practice is to average the responses to the chosen items, giving a total score in the range 0-3. There are two major problems with this approach; the responses are ordinal and do not support mathematical operations such as averaging, which requires at least interval scaling. Even if this is unfortunately ignored, such averaging would only be interpretable if every item had the same level of difficulty. Neither of these conditions hold for items in ordinal scales [17,18].
How then can a scale such as the VLAs be shown to be psychometrically sound? To satisfy traditional psychometric standards, the various items sets need to be shown to be reliable, valid, unidimensional and invariant for key groups (32). The items themselves need to be locally independent (conditional on the trait), although failure of this requirement often reflects a degree of item redundancy. The key issue here is that the item set, from which choices of relevant items are made, is robust from a psychometric perspective.
Nevertheless, even if the various versions of the scale are shown to be robust, there remains the challenge of the scoring associated with, potentially, a very large number of subsets as chosen by the user. It is here that a variation of Computer Adaptive Testing (CAT) can resolve the issue. With a calibrated set of items (e.g. indicating the level of difficulty associated with each of the 33 VLAs items), these can be administered to the respondent, as long as there is a 'not relevant for me' option, which will be treated as a missing value by the CAT, so moving on to the next item.
Consequently, the analytical strategy required is to first assess the traditional psychometric properties of the VLAs versions, and then proceed to determine if a calibrated item set suitable for CAT can be found, given any limitations observed in the traditional analysis.
Before a PROM can be used in another language, or country with the same language, it is necessary to adapt the PROM and psychometrically test it in the target group(s). Thus the aims of this study were to develop a British English version of the VLAs (using the full 33item scale) following recommended linguistic and cultural adaptation guidelines [19,20], and to test its psychometric properties in adults with RMDs in the United Kingdom (UK). We also investigated the psychometric properties of two shorter versions of the VLAs (26 and 14-items), embedded within the 33-item definitive version. The 26-item version, which had split the 'physical activities' item into two, but was included as one item, as in the 33-item version, so making it, in practice, a 25-item version, together with the 14-item short form. Thus, the adaptation, and following psychometric analysis focused on the 33, 26(25) and 14-item versions.

Study setting
Recruitment of people with RA was conducted through rheumatology outpatient clinics in 17 National Health Service (NHS) Hospitals. Participants with RA from a previous PROM study were also contacted [21]. Recruitment of people with the other seven RMDs was from 19 rheumatology or orthopaedic out-patient hospital departments, four General Practitioner (GP) surgeries, and from 10 RMD patient organisations in the UK.

Eligibility criteria
Inclusion criteria were people: aged ≥18 years; diagnosed with Rheumatoid Arthritis (RA), Ankylosing Spondylitis (AS), Chronic Pain (CP) or Fibromyalgia (FM), Chronic Hand and Upper Limb Conditions (CHUL), Osteoarthritis (OA), Systemic Lupus (SLE), Systemic Sclerosis (SS), and Primary Sjogren's Syndrome (PSS) by either a rheumatology consultant, or an orthopaedic consultant, GP or extended-scope health professional (in the case of OA and CP/FM specifically); able to read, write and understand English; and provide written informed consent.

Procedures
Phase-1: cross-cultural adaptation We followed recommendations for linguistic and crosscultural adaptation [19,20]. As the 33-item VLAs is written in North American English, backward translation was not required [Additional File 1]. Two native British English speakers forward translated the VLAs; one of whom was a rheumatology occupational therapist and the other was not involved in health care and was unfamiliar with health outcome measures. Following forward translation, the two translators resolved any discrepancies. A North American speaker, with an academic background, also helped with checking that the forward translation reflected the accurate meaning of the item sets. An Expert Panel, consisting of three occupational therapists, a physiotherapist, a methodologist and a layperson with RA (all English speakers as their first language) discussed the translation to agree a prototype British English VLAs. This was then reviewed by the panel for semantic (i.e. do words mean the same thing), idiomatic (e.g. presence of colloquialism or idioms), experiential and conceptual equivalence to the original 33-item North American English version of the VLAs.

Cognitive de-briefing interviews
Cognitive de-briefing interviews were conducted with a purposive sample of participants with RA identified from the participants of a previous study residing within the Midlands and North West of England [21]. The sample included a wide range of demographic characteristics and health status (i.e. range of age, gender, disease duration and work status). The questionnaire booklet was posted for completion at home one week before a cognitive de-briefing interview conducted face-to-face or by telephone by an occupational therapist, depending on the participant's preference.
These semi-structured interviews determined whether the VLAs items were relevant, understandable and comprehensive, and to confirm participants' understanding of the items matches the intended use [19]. Participants were asked to rate the relevance and comprehensibility of the VLAs using a five-point likert scale (1 = not relevant to 5 = very relevant; and 1 = very easy to understand to 5 = very difficult to understand). Interviews were audio-recorded and transcribed for ease of content analyses. A preliminary report of the findings was reviewed by the Expert Panel to agree on recommended changes prior to finalisation. A final version of this report and the British English VLAs were submitted to the lead developer in the USA for review and the lead developer approved the changes.

Phase-2: psychometric testing
Participants Participants with one of the eight RMDs as their primary diagnosis were recruited by research nurses or therapists using an eligibility checklist to screen patients. Additionally, patient organisations, such as the National Rheumatoid Arthritis Society (NRAS), Arthritis Care, National Ankylosing Spondylitis Society (NASS) and Fibromyalgia Action UK (FMA UK), mailed out study invitation letters, information sheets and a reply form to random samples of their members to help recruit participants. The reply form included the eligibility checklist items. Both rural and urban populations and a wide mix of socio-demographic characteristics were included (Fig. 1).
Data collection Data were collected using postal questionnaires. The questionnaire booklet included demographic and health data (e.g. age, gender, marital, educational and employment status, disease duration, medication regimen), the 33-item VLAs and two measures of physical function: the Health Assessment Questionnaire (HAQ) [22], the SF36 v2.0 [23]; as well as a 0-10 Numeric Rating Scale (NRS) reporting disease activity.

Sample size
The sample size calculation for Rasch analysis suggested that a sample of at least 150 for each condition will give 99% confidence of the person estimate being within ±0.5 logits, irrespective of whether or not the scale is well targeted to the patients [24]. We chose to recruit a higher number of people with RA as we aimed to conduct secondary analysis with the RA data, if the VLAs demonstrated appropriate psychometric properties. We stopped recruitment once we had at least 150 sufficiently completed questionnaire booklets.

Statistical analysis Confirmatory factor analysis
The VLAs has undergone revision over time, such that there are several versions with 33-items being the definitive version. The other versions are nested within the 33-item scale, but the 26-item version includes two items for physical recreational activities (moderate and vigorous), rather than one item, as in the 33-item VLAs. Accordingly, when testing two shorter versions of the VLAs, we derived a 25-item VLAs (rather than 26-item version) from the 33-item version, as well as testing the Short VLAs (SVLAs: 14 items).
Confirmation of the 33-item structure from a classical test perspective would follow from a Confirmatory Factor Analysis (CFA) where a priori there is evidence that the item set constitutes one, or a series of domains [25]. Following Kline, fit is determined by a non-significant chi square statistic [26]. Ancillary fit statistics include the RMSEA where a value less than 0.06 would be appropriate, the Comparative Fit Index (CFI), a comparison of final model and baseline model, and the Tucker Lewis Index (TLI), another incremental fit Index which adds penalties for increasing the parameters. Both indices would suggest good fit with values above 0.95. Thus, in the present study, the item set is fit to a CFA model in Mplus [27] and tested for the three domains (Obligatory, Committed and Discretionary) and the total score only for "important and applicable" items.

Mokken scaling
The Mokken scale is a non-parametric probabilistic model that utilises the Loevingers H coefficient to determine the 'scalability' of a set of items. H appears to be a measure of the degree to which the score is able to discriminate between persons in the given sample [28]. It has been argued that Mokken scaling is a natural starting point for item analysis, and it is used here in that context, to identify if any items from the VLAs display a level of discrimination inconsistent with the expectations of the Rasch model, as represented by low values (< 0.3) of H [29]. In the present study Mokken scaling is examined through the msp procedure in STATA 13 [30].

Rasch model
Data from the 33 items were fitted to the Rasch model to ascertain if a quantitative structure was present within the domain(s) being measured [31]. Described in detail elsewhere [32], the process is used to test fit to the model expectations, unidimensionality, (conditional) local item independence and invariance (Differential Item Functioning) by contextual groups of age, gender, employment and marital status, duration of disease, and where data are pooled, by condition [33,34]. Briefly, the RUMM2030 Rasch software [35] has a summary Chi-Square Interaction statistic, which should be above 0.05 if data fit the model. It has residual item and person means and standard deviations, the latter which need to below 1.4 to ensure no individual item is beyond a ± 2.5 range. Reliability of the items set was also reported in the form of a 'person separation Index' which, should the data have a normal distribution, is equivalent to Cronbach's Alpha (internal consistency) [36], else the value will deviate from Alpha. A post hoc t-test is undertaken to determine unidimensionality, contrasting two estimates derived from item sub-sets loading positive and negative on the first residual principal component [37]. The number of contrasts between estimates where the ttest < 0.05 should not exceed 5% to be indicative of unidimensionality (or the lower confidence interval of that proportion of tests).
Following this, informed by the above analysis of 33 items, a calibration of the item set was attempted to form the basis of a CAT. To avoid the potential bias caused by a breach of the local independence assumption, first a set of 'core' items that fit the model and were free of local dependency were identified [38]. In doing so surplus items were set aside into a series of secondary item sets, which were subsequently fit to the model, anchored to the core metric by items in the core set which were free of dependency. Fit of the core and subsequent item sets to the Rasch model were tested by repeated sampling of the total data set to ensure the Type 1 error rate of the fit is accurate [39]. In this way, a calibrated set of items became available that could be administered in an innovative fashion by appropriate CAT software. The efficacy of the CAT process was evaluated by simulation using the Firestar programme [40].
The analysis uses the RUMM2030 software utilising the partial credit parameterisation of the Rasch model [35,41].

Phase 1: cross-cultural adaptation
Cognitive debriefing interviews were conducted with 31 participants with RA whose socio-demographic and health characteristics are detailed in Table 1.   In general, all British English VLAs items were deemed important and relevant. In terms of comprehensibility, item 13 "going to social events, parties, or celebrations" and item 18 "taking part in leisure activities OUTSIDE your home, such as going to the pub, bingo, going to the cinema, club meetings, restaurants" raised the question whether these are measuring the same concept amongst most participants (n = 21) as they required similar considerations to be able to participate. For example, participants noted participation depended on location and accessibility. Several participants (n = 8) queried whether item 21 [driving or getting around your community by public transport] should be divided into separate items as they perceived "driving" and "using public transport" different transport options. However, when explained that this item measures participation (i.e. at a societal level) rather than activity limitation (i.e. at a personal level) they did not think it needed to change. Two participants suggested that item 27 (taking care of social communication, such as writing letters, sending emails, making phone calls or texting) could be separated into verbal and written communication. However, as this was raised by only two out of 31 participants, the original item remained unchanged.
Participants also struggled with the question "Do you have to make changes to how you do this activity because of your arthritis?" They were unclear whether to tick 'no' or just leave it blank if 'unable to do' the activity. This issue was resolved by adding further instructions to the VLAs to aid responder's decision making. Item 33 "having intimate relations with your spouse/ partner" was perceived as too intrusive by some participants (n = 6). However, as the majority of the responders found this item to be relevant and appropriate, the item was retained.
Following the cognitive de-briefing interviews, no new items were added. Instead, some changes were made to the layout and wording of the items, so they are relevant and comprehensible to the British population (Additional File 2). The changes made were submitted to the lead developer who agreed to these, as these were acknowledged as differences in expression between North American and British English.

Phase-2: psychometric testing
In Phase-2, 1929 NHS patients were screened, and a further 3365 invitations were sent through Patient Organisations (Fig. 1). From both of these sources, 1946 were interested and eligible, of whom the most (97%) consented; and 1546 (81%) returned the postal questionnaire. The participants' socio-demographic and health characteristics are detailed in Tables 1 and 2. The response options to all 33 items are shown in Table 3, including the percentage of those reporting that an item "Does not apply to me". Only 79 (5%) respondents completed all 33 items (including the "does not apply to me" option). CFI .987; TLI 0.986). Modification indices throughout these analyses indicated substantial cross loading, particularly between Obligatory and Committed items, and substantial local dependency among pairs of items, thus requiring correlated errors. Given the ancillary fit statistics were more supportive, the results suggest that the disturbance of structure may be strongly influenced by clusters of locally dependent items. A Loevinger Coefficient from Mokken scaling of 0.87 for all 33 items indicated a strong non-parametric association between items, and despite the lack of evidence of unidimensionality (which is an assumption of Mokken), provided sufficient evidence to move forward to a Rasch analysis of the data.

Rasch: diagnostics
Fit of the data from the VLAs to the Rasch model is shown in Table 4. An initial Likelihood Ratio test to determine if a Rating scale or Partial Credit parameterisation was appropriate supported the latter (Chi-Square 1281.3 (df 63); p = < 0.0001). For each of the eight conditions, fit is shown for the 33, 25 and 14-item versions. Only four analyses satisfied the stochastic ordering (fit) and unidimensionality assumptions (AS-25; AS-14; SS-25; PS-14). Even here, the local independence assumption was breached by clusters of residual item correlations, although of insufficient magnitude to affect the fit and unidimensionality tests. Elsewhere, the VLAs was characterised by multidimensionality and misfit, which again may have been influenced by extensive clusters of residual item correlations. While reliability was high in all cases, this could be expected to be inflated in the presence of local response dependency, as identified through the residual correlation patterns. Differential item function was occasionally present for age, gender and marital status, but not for education or duration of condition. For example, "Doing heavy housework' was more difficult for females at any level of ability. DIF was also present for condition in 15 of the 33 items. For example, for those with RA, 'traveling long distances' was more difficult than other conditions at all levels of life activity. Likewise, 'Taking care of social communication' was more difficult for those with chronic hand/upper limb conditions, at any level of life activity. Overall, the easiest activities (difficulty rarely affirmed) were 'eating' and 'taking part in leisure activities in the home', while the hardest activities (difficulty common) were 'minor home repairs' and 'gardening'.
The clusters of locally dependent items did not necessarily conform to the Obligatory, Committed or Discretionary domains. For example, in people with RA, the items 'doing other work around the house' and 'gardening or outdoor property work' were designated as Committed and Discretionary respectfully, displayed a residual correlation of 0.506 in the 33-item version, and 0.447 in the 25-item version. Nevertheless, fit to the model constrained to within the domains showed some improvement, although the occasional misfit and multidimensionality remained (Table 5). This suggests that much of the disturbance of fit and dimensionality could be attributable to the local dependency issue.

An item Bank approach
Consequently, the item bank approach was applied. A core set of 15 items were shown to fit the Rasch model across most indicators ( Table 6, Analyses 1-3). However, the item 'Travelling long distances' showed DIF by condition with, for example, RA and OA showing distinct differences in expected response at any level of difficulty, the former having more difficulty than the latter (Fig. 2). Having set aside the surplus items from the local dependency analysis, a second item set was created with 10 items, which again showed fit to the model and no DIF by condition (Table 6. Analyses 4-6). Thus, 25 of the 33 items were available for CAT, and with the second set calibration anchored by three items from the core set, all items were calibrated onto the same unidimensional interval scale metric. The mean number of items chosen (i.e. excluding "does not apply to me" responses) from the VLA-CAT25 in the main data set was 17.4, and the maximum was 24 (3.5%) (Fig. 3). Simulation of the efficacy of the CAT identified that for group use the average number of items required to achieve an alpha of ≥0.7 was 4, and 11 to achieve an alpha of 0.85 for individual use. Consequently, it would appear that given the patient choice of relevant items, the CAT can in most cases accommodate both individual and group estimates with the required reliability, should the distribution shown in Fig. 3 be replicated elsewhere.

Summary of the results
The 33-item VLAs was linguistically validated and culturally adapted for British people aged ≥18 years with RMDs following recommended guidelines. The British English VLAs retained all of the original 33-items, with some changes to the wording, template and instructions to make it easily understandable by British people. Following this, the VLAs was tested in its 33, 26 and 14 item versions with British people across 8 different RMDs to verify its psychometric validity and reliability (internal consistency). The latter two versions were nested within the 33-item version, with a minor change to the 26-item version which had split an original item into two parts. The results of the statistical analysis show that the VLAs, in its various summated forms (i.e. adding together items in complete sets scored using those items identified as important to the person) was not a  valid measure of valued life activities. Only 5% of the sample considered all the items applied to them. When a calibration was made for use in a CAT, 25 of the 33 items were retained, and formed a valid unidimensional item set, largely invariant by condition. The CAT could provide sufficient reliability to accommodate both individual and group estimates. Using suitable CAT software, these items could be administered taking account of both the varying difficulty of the items, the local dependency that exists, and the DIF on the 'travel' item, so giving an estimate of VLA on an 0-100 interval scale, irrespective of the number of items chosen.

Discussion
The Valued Life Activities scale was completed by a large number of people across eight RMDs. The VLAs was perceived as a relevant and understandable measure of activities and participation by British people with RMDs. However, robust psychometric testing of the British VLAs in the context of the current scoring method (i.e. summing items identified as important to the respondent only) of the 33, 25 and 14 item versions showed that, due to local item dependency, multidimensionality and misfit to Rasch model expectations, the VLAs had insufficient validity to enable a recommendation for its use as summated item sets in clinical evaluation or research. The usual strategy of scoring only those items that apply to the individual does not exempt the underlying item set from basic psychometric requirements, as the choices that are made deliver an almost infinite subset of items from the whole, each of which should satisfy those same requirements.
The 'Does not apply to me' response also raises substantial problems with how these items are scored, and how to deal with this response (in addition to any other type of missingness). The problem is similar to that observed for Goal Attainment Scaling where patients are involved with the choice of goals for their rehabilitation [42]. While Rasch analysis can deal with both structural and ordinary missingness, and multiple imputation techniques can provide complete data sets, this is unlikely to be available in routine clinical practice [43]. Also, imputation techniques are not designed to deal with 'missing not at random' instances, which is likely to be the case with the 'does not apply to me' option. Furthermore, the usual strategy for scales to provide a transformation table from raw score to interval scaled Rasch metric would also not apply, as it is only valid in the presence of complete data, which is not attainable under the present scoring method. This also affected actions to remedy the effects of local dependency, that is by creating 'super items' (testlets) by adding together clusters of items, as the 'Does not apply to me' option resulted in case-wise deletion at the testlet level. Given these problems, it was not possible to test for DIF cancellation at the scale level due to the restriction upon creating testlets [44]. Furthermore, under the current scoring method, DIF would have to be assessed across all possible combinations of items to examine if any DIF is observed, and would cancel across the chosen items, given the person estimate would be re-estimated for each unique combination of items. Some of the above problems were accommodated through a CAT design, identifying 25 items (in two sets of 15 and 10) which demonstrated fit to the Rasch model, including unidimensionality and invariance by most contextual groups. The DIF by condition for the 'travel' item needed a condition-specific item location estimate for those conditions affected. The calibrated item set, given suitable CAT software, could be administered to patients, offering the option of 'not important for me' and 'not applicable to me'.

Implications for clinical and research practice
The main implication for clinical and research practice is that the implementation of the above solution requires access to CAT software or some system to provide CAT-based estimates, and appropriate IT infrastructure at the clinic level, or at least that the patient has online facilities at home. One application, the smartCAT system, was designed to facilitate such an environment, but requires on-line interaction with its server, which will return an estimate in real time to the source, including an appropriate clinical setting, as required [45]. It can cope with clusters of locally dependent items, and different estimates to account for DIF where present. Another CAT solution can be found with the Concerto software, which is an open-source online adapting testing platform https://concertoplatform.com/about [46]. The former has a small charge per assessment, while the latter is free, but psychometric and technical applications can be supported as required for a fee. So, when suitable  software is available, using the VLAS in this manner addresses the EULAR recommendations of assessing patient's preferences and priorities concerning the impact upon their activities and participation.

Limitations
We only conducted cognitive debriefing interviews with people with RA, predominantly from the North West and Midlands regions of England, due to budget and timeline constraints. However, we tested the psychometric properties of the VLAs amongst eight RMDs. Cognitive debriefing with people with other RMDs may have resulted in reduction or addition of new items to the British VLAs. We intended to also examine test-retest reliability. After 4 weeks of completing the first questionnaire booklet, participants were mailed a second including the VLAs. However, as the Rasch analysis identified significant challenges in calculating scores, we did not progress to test this. There is no reason why a 'stable' respondent should choose the same set of items, even within a short time frame. Given the potential number of item sets that could be chosen, the retest can only be done on those who have completed exactly the same set of items across time. The question then arises as to whether or not a failure to choose the same set of items constitutes a lack of test-retest reliability. Even where the same set of items are chosen, given the possible number of combinations available, each combination should have sufficient cases for the analysis, as though they were distinct scales. Further work is required to consider how test-retest reliability may be undertaken in such circumstances.
The VLA-CAT25 item bank itself has only just been developed within this study and will require further psychometric testing and testing in clinical settings to ascertain how well it works in a day-to-day setting. The smartCAT software is at its Beta test stage but has been trialled on a fatigue item bank in a clinical setting in Sweden. It requires careful management of the CAT process, assigning unique patient identifiers, setting up CAT in clinic, or providing links and passwords for use at home. Data protection must be considered as the estimate itself is created on the smartCAT server located outside of the European Union and delivered in real time back to the source or designated setting. Decisions need to be taken as to whether or not the estimate and its associated patient identifier is stored on the foreign server, or not. Similar software programmes are likely to have the same requirements.
The simulation of fit to the Rasch model was not ideal. For example, the distribution of 100 random samples would give more accurate picture of fit of the item bank item sets, rather than just three consecutive samples with replacement. Unfortunately, this option is not available in the software used. Consequently, further work would need to verify the 15 & 10 items set fit. It is not possible to test the fit of the 25 items together, as the second set holds items which were locally dependent with the core set, and which would generate multidimensionality and misfit.

Conclusions
The British version of the VLAs, across various scales, failed to satisfy classical and modern psychometric standards as full item sets. A CAT solution was found that