Text-mining applied to autoimmune disease research: the Sjögren’s syndrome knowledge base

Background Sjögren’s syndrome is a tissue-specific autoimmune disease that affects exocrine tissues, especially salivary glands and lacrimal glands. Despite a large body of evidence gathered over the past 60 years, significant gaps still exist in our understanding of Sjögren’s syndrome. The goal of this study was to develop a database that collects and organizes gene and protein expression data from the existing literature for comparative analysis with future gene expression and proteomic studies of Sjögren’s syndrome. Description To catalog the existing knowledge in the field, we used text mining to generate the Sjögren’s Syndrome Knowledge Base (SSKB) of published gene/protein data, which were extracted from PubMed using text mining of over 7,700 abstracts and listing approximately 500 potential genes/proteins. The raw data were manually evaluated to remove duplicates and false-positives and assign gene names. The data base was manually curated to 477 entries, including 377 potential functional genes, which were used for enrichment and pathway analysis using gene ontology and KEGG pathway analysis. Conclusions The Sjögren’s syndrome knowledge base ( http://sskb.umn.edu) can form the foundation for an informed search of existing knowledge in the field as new potential therapeutic targets are identified by conventional or high throughput experimental techniques.


Background
Sjögren's syndrome is a tissue-specific autoimmune disease that affects exocrine tissues, especially salivary glands and lacrimal glands. It is one of the most common autoimmune disorders in the U.S., with an estimated prevalence of 2-4 million people. The autoimmunemediated damage of the salivary and lacrimal glands in Sjögren's syndrome leads to a decrease in the production of saliva and tears and to the development of dry mouth and dry eyes. Without the lubricating and protective functions of saliva and tears, the oral and ocular surfaces are subject to infections and discomfort, leading to a significantly reduced quality of life [1,2].
Development of Sjögren's syndrome requires a complex interplay between a number of genetic, hormonal and environmental factors, most of which have not been defined. Genetic linkages, especially involving major histocompatibility complex (MHC) genes, have been reported for Sjögren's syndrome but it is not clear if, or how, the associated genes are involved in the development of the disease [3]. Additional non-MHC genes have also been linked with the development of Sjögren's syndrome.
In addition to genetic predisposition, some studies suggest that infection of a genetically-susceptible individual by a virus or other pathogen might trigger the development of an autoimmune disease [4]. The proposed mechanisms include activation of the innate immune system, release of self antigens from damaged or apoptotic tissues, and molecular mimicry that results in activation of T cells and/or B cells that react with tissue antigens [4].
Both the innate and the adaptive immune systems are involved in the pathogenesis of Sjögren's syndrome. The type I interferon (IFN) pathway, which plays an important role in the innate immune response to viruses, is also thought to play an important role in the development of Sjögren's syndrome and other autoimmune disorders, including SLE [5,6]. Moreover, type I IFNs can activate the adaptive immune system directly, by binding to IFN receptors on antigen presenting cells, T cells and B cells, or indirectly, by inducing the production and release of cytokines and chemokines that bind to these cells.
Autoantibodies to intracellular antigens, notably the nuclear proteins SSA/Ro and SSB/La, are found in the sera of many patients with Sjögren's syndrome. These autoantibodies are thought to develop when intracellular antigens, some of which have undergone proteolytic cleavage that reveals new antigenic epitopes, become "visible" to the immune system in membrane blebs on the surface of apoptotic cells [7]. Alternatively, antigenic epitopes from bacteria and viruses, including Epstein-Barr virus (EBV) and coxsackie virus, may act as molecular mimics that trigger the development of antibodies that cross react with similar epitopes on target tissue autoantigens [2,8,9]. Although autoantibodies to intracellular antigens are useful in the diagnosis of Sjögren's syndrome, it is not clear if they play a direct role in the development of salivary gland and lacrimal gland damage and hypofunction. In contrast, autoantibodies to the M3 muscarinic acetylcholine receptor (M3R) have been directly implicated in salivary gland hypofunction in the nonobese diabetic (NOD) mouse model of Sjögren's syndrome [10]. Importantly, functioninhibiting anti-M3R autoantibodies are found in the sera of many patients with Sjögren's syndrome [11].
Current therapy for Sjögren's syndrome usually consists of palliative treatment that relieves the symptoms of dry eye and dry mouth, but fails to modify the underlying disease. Novel disease-modifying treatment strategies, based on recent immunological insights in Sjögren's syndrome and other autoimmune diseases, have met with mixed results [12]. For example, in recent clinical trials, treatment of Sjögren's syndrome patients with a B cell-depleting anti-CD20 monoclonal antibody (rituximab) led to significant improvement of the stimulated whole saliva flow rate and a reduction in parotid gland inflammation [13]. In contrast, TNFα inhibitors have been ineffective in the treatment of Sjögren's syndrome. Detailed studies on the immune response in Sjögren's syndrome patients treated with one of the inhibitors (etanercept) revealed an increase in the circulating levels of TNFα [14]. These results suggest that TNFα may not play a pivotal role in the disease and that other therapeutic targets must be identified.
Despite a large body of evidence gathered over the past 60 years, significant gaps still exist in our understanding of Sjögren's syndrome. Recent gene expression and proteomic studies have identified many genes and pathways that may play a role in the pathogenesis of Sjögren's syndrome [15][16][17]. However, validation of these data will require significant additional effort. As an initial step in this validation, we have compiled the published data on Sjögren's syndrome that is not derived from gene expression or proteomic studies. No such unifying database currently exists. Through data curation, the existing data have been uniformly formatted to allow systematic retrieval and comparisons to newly generated gene expression data. As an example of its functionality, the Sjögren's Syndrome Knowledge Base (SSKB) was analyzed for biological functions and pathways that are likely to play a role in the disease.

Data mining
To catalog the existing knowledge in the field, we used text mining to generate the Sjögren's Syndrome Knowledge Base (SSKB) of published gene/protein data (http://sskb.umn.edu/) [18]. The focus of this data-base is on individually identified genes and proteins. Thus, microarray experiments were not included. The raw data for SSKB was extracted from PubMed [19]) using the text mining program EBIMed (http://www.ebi.ac.uk/ Rebholz-srv/ebimed/) [20] with the search term "Sjogren's Syndrome" restricted to "MeshHeadingsList". The foundational search identified over 7,700 abstracts and approximately 500 potential genes/proteins. The SSKB is continually updated by regular automated searches of PubMed followed by manual curation.

Curation of raw data
The identified abstracts were manually evaluated to remove duplicates and false-positives. In older publications, where gene names were not readily identifiable, names were assigned based on in depth evaluation of the protein name context and available gene data in public databases, including the National Center for Biotechnology Information's Entrez search engine [21] and UniProt [22,23]. The SSKB includes data from human studies and animal models. For the genes identified in animal models, the human homolog was identified by automated ortholog search, using WebGestalt 2.0 [24,25]. These steps reduced the database to 477 current entries. The online database contains the fully curated data and currently contains 413 entries, which can be accessed at http://sskb.umn.edu. Updates and newly curated data are continually added.
The 477 entries were sorted to identify autoantigens and viral/bacterial antigens, resulting in 377 potential functional genes, which were used for enrichment and pathway analysis.

Enrichment analysis
The 377 human gene entries were used for subsequent enrichment analyses in Webgestalt [24,25]. Gene enrichment in the SSKB gene set was compared to the human genome using the hypergeometric test with multiple test adjustment [26] and a significance level of P <0.01.
The Gene Ontology [27,28] was accessed with Webgestalt and analysis was restricted to processes and functions represented by two or more genes. Pathway analysis was performed with Webgestalt in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [29,30]. The selection was restricted to pathways with 4 or more genes represented, resulting in identification of 72 KEGG pathways. The "salivary secretion" pathway (KO04970) was recently added to KEGG (11/9/10) and was not included in this analysis. This pathway contains 59 genes, seven of which are found in the SSKB gene set.

Utility and discussion
We constructed a database containing proteins and genes associated with Sjögren's syndrome in human disease or animal models, as identified by text mining of published data. The public SSKB currently contains 413 genes/proteins and can be viewed online (http://sskb. umn.edu/). All genes have been assigned gene symbols and UniProt IDs, which allows rapid retrieval of genespecific data from external databases. The SSKB data base can be used to determine whether a list of genes is enriched with known Sjögren's syndrome genes and one can carry out a function enrichment analysis (hypergeometric distribution). Individual genes and the corresponding gene products, synonyms and alternate names can be searched by using a web browser search function. Autoantigens, viral antigens and bacterial antigens are separately identified under "Antigens". The SSKB is continually maintained and updated and new genes are added as their analysis is completed.
Based on the abstracts used to retrieve the SSKB genes/proteins, 85 proteins were initially characterized as autoantigens and 15 proteins were characterized as viral (14) or bacterial (1) antigens. Not surprisingly, SSA/Ro and SSB/La were among the most frequently retrieved autoantigens. It has been proposed that viral or bacterial antigens act as autoimmune triggers by molecular mimicry of endogenous human proteins [2,8,9]. However, eight of the 14 putative viral antigens in SSKB were selected for BLAST analysis, which did not identify strong sequence similarity with human proteins (not shown).
The 377 proteins not identified as autoantigens or microbial antigens were considered candidates for functional genes that could play a role in the initiation and progression of Sjögren's syndrome. Since the gene list contains data from humans and animals, the corresponding human genes were identified, with the assumption that genes identified in animal models of Sjögren's syndrome may also be involved in the human disease.

Gene ontology
The Gene Ontology database [27] was queried to identify the biological processes, cellular components and molecular functions associated with genes in the SSKB ( Table 1). The 40 most highly enriched entries were identified in each category.
The most highly enriched biological processes (19 of 40; 18 of the top 20) were associated with immune function, including leukocyte proliferation, leukocyte activation, and regulation of the immune response. Other prominent biological processes were associated with apoptosis and cell death. Thus, the SSKB data set is consistent with recent microarray data [16] and reflects current models for the biological processes involved in the pathogenesis of Sjögren's syndrome [5,31,32].
The most highly enriched cellular component was the calcineurin complex, which plays a major role in the activation of T cells. Interestingly, in placebo-controlled clinical trials, treatment of Sjögren's syndrome patients with eye drops that contain the calcineurin inhibitor cyclosporine, led to significant improvement in several of the signs and symptoms of dry eye [33].
Other highly enriched cellular components include: 1) platelet alpha granules. Although platelet activation has been reported in the salivary glands of Sjögren's syndrome patients [34], a direct search of PubMed for "platelet alpha granules" with "sjogren's" did not retrieve any published studies. Thus, while the proteins identified were retrieved from the literature, their potential association with platelet alpha granules in Sjögren's syndrome has not previously been noted. 2) MHC protein complexes were identified and are presumably involved in the presentation of autoantigens [16].
3) The finding that protein-lipid complexes and lipoprotein particles are associated with Sjögren's syndrome may be consistent with changes in serum lipid levels in Sjögren's syndrome patients [35] although the prevalence of anti-phospholipid antibodies is low in Sjögren's syndrome [36]. 4) Nerve terminals and axons were also prominent cellular components, consistent with the known neurological component of Sjögren's syndrome [37].
In molecular function, nitric oxide synthase (NOS) activity was the most highly enriched, although only three genes (NOS1-3) were identified. Nitric oxide (NO) signaling appears to be directly affected in salivary and lacrimal glands in Sjögren's syndrome [38]. Other highly enriched molecular functions include chemokine and cytokine activity/receptor binding (8 of the top 15) and peptidase activities.

Pathway analysis
The SSKB gene list was submitted to KEGG [29] to identify biological pathways potentially associated with Sjögren's syndrome. A total of 72 KEGG pathways showed highly significant enrichment (P <0.001) in this analysis ( Table 2). The pathway analysis revealed dominant pathways associated with immune regulation. Indeed, the eight most highly enriched pathways were associated with antigen presenting cells and activation of T cells and B cells.
Several cancer associated pathways were identified. This is partly due to the overlap between cancer pathways. These pathways typically include cytokine or growth factor stimulation of cell cycle and cell death and were not further analyzed.
Pathways associated with apoptosis, cytokine signaling and inflammation were also highly enriched. To focus on the events associated with initiation of Sjögren's syndrome, we analyzed pathways with known triggers. Several of the highly enriched pathways are triggered by bacterial toxins, viral DNA, or viral RNA. These include signaling pathways for Toll-like receptor, NOD-like receptor, RIG-I-  Overlap with other autoimmune diseases The KEGG pathways include several pathways for autoimmune diseases, including type I diabetes mellitus, autoimmune thyroid disease, and SLE. While about 50% of the genes associated with the first two pathways are also associated with Sjögren's syndrome, only 16 Sjögren's syndrome genes were identified in the 140-gene SLE pathway (KEGG ID: hsa05322). These findings suggest that significant differences exist in the pathogenesis of autoimmune diseases.

Conclusions
The results of this analysis can serve as a background and comparison for the increasing number of gene expression data sets available for Sjögren's syndrome, e.g. [15][16][17]. Preliminary analysis of such data sets suggest that the biological pathways identified in the SSKB are very similar to those identified in human parotid tissue but quite different from those identified in human labial salivary glands [15]. Future analyses will further define these differences and focus on the comparison of biological pathways identified in human tissues and mouse models of Sjögren's syndrome. It is envisioned that the SSKB data can also serve as the starting point for literature reviews and literature-based validation of identified genes; functional gene enrichment studies; protein-protein interaction networks and other bioinformatics analyses; it can be used to arrive at gene sets for SNP set enrichment analysis (pathway based GWAS studies); it can be used to define a gene set for gene set enrichment analysis (GSEA); as a starting point for bioinformatics analysis