Using fuzzy set theory and scale-free network ...

Soft Comput (2006) 10: 374–381 DOI 10.1007/s00500-005-0497-5

FOCUS

Jonathan D. Wren

Using fuzzy set theory and scale-free network properties to relate MEDLINE terms

Published online: 27 April 2005 © Springer-Verlag 2005

Abstract Automated construction and annotation of biological networks is becoming increasingly important in bioinformatics as the amount of biological data increases. At the center of this are metrics required for relating biological entities such as genes, diseases, signaling molecules and chemical compounds. Co-occurrence of terms within abstracts is widely used to establish tentative relationships because it is easy to use, implement, understand, and is reasonably accurate. However, it is also very imprecise – the cutoffs for how many co-occurrences of terms are necessary to establish a relationship is arbitrary and the nature of the relationship is generic. Since the frequency of co-occurrence for terms usually follows a scale-free distribution, this property can be exploited to define degree of membership in fuzzy sets. Beginning with a set of co-occurrences for any biomedical term (or its synonyms), relations are defined by the overlap of sets, normalizing by the area under the curve that the two sets share. The ability of this method to rank the relative specificity of biological relationships is tested by comparing cumulative term co-occurrences within 7.5 million MEDLINE abstracts with focused summaries of gene function and disease association within LocusLink. On average, the fuzzy set ranking (FSR) was in the top 0.6% of all potential associations, showing a good correlation between domain overlap and the biological association between two terms. Keywords Fuzzy sets · Text mining · MEDLINE · Biomedical terminology · Inference

J. D. Wren Advanced Center for Genome Technology, Department of Botany and Microbiology, The University of Oklahoma, 101 David L. Boren Blvd., Room 2025, Norman, OK 73019, USA E-mail: [email protected] Tel.: +405-325-3415 Fax: +405-325-3442

1 Introduction The amount of available text is increasing rapidly in most scientific fields, but perhaps even more so for biology and medicine. This is reflected by the growth in the number of scholarly journals published and number of total records indexed in biomedical literature reference databases such as MEDLINE. There are many different areas of research, a broad and non-comprehensive list of which includes genetics, medicine, chemistry, pharmacology and biology. In each of these fields, there are efforts underway for large-scale cataloging and analysis of terms within. For example, in biology and genetics, efforts are underway to define gene ontologies [9,23], construct networks of genetic interactions [10], and define protein-protein contacts [4,14,24]. In medicine, there is a continual need to catalog diseases in terms of associated phenotypes and identify “clusters” of diseases that share similar etiology, pathogenesis or clinical presentation. In pharmacology, there is a need to catalog phenotypic actions of drugs and chemical compounds. For each of these large-scale efforts, there is a need to both identify terms of interest within the biomedical literature and to assess whether or not a relationship exists between any two given terms (e.g. gene-disease, disease-phenotype, drug-phenotype, etc.). Complicating the matter is that some relations may not be explicitly specified within text. This last aspect to association is particularly valuable and challenging as well – the concept that new information could be inferred from existing information. Most current methods to define the conceptual strength of biological relationships, explicit or not, have significant limitations.

2 Term co-occurrence The general approach to associating terms by searching for their co-occurrence within text has been used in many fields as a simple, yet comprehensive way to identify potential associations. In biology and medicine, co-occurrence has been used to identify potential relationships between genes [10,


18], proteins [5] and drugs [15]. The disadvantage of this approach is that associations are very general – that is, no specifics on how two terms are related or associated are obtained by this method. Plus, in general, the more frequently two terms co-occur together, the more general and uninformative the relationship is (e.g., consider word frequency distributions). The advantages are that it is easy to implement and comprehensive. In general, it is easier for users to deal with false positives that might arise from an irrelevant association than it is for them to deal with false negatives – that is, what associations were missed because a more specialized method was employed such as natural language processing (NLP).

3 Natural language processing Natural Language Processing is a good way to associate terms in a specific manner (e.g. “A binds B”), but has some drawbacks as a means of defining a relationship. First, parsing and semantics-based NLP is computationally expensive, usually designed to analyze relationships between all or most terms within a sentence. Second, existing NLP methods are often difficult to incorporate and adapt for purposes other than their original one. Third, almost all NLP methods have difficulty in anaphora resolution (determining the correct object of a pronoun reference located in a preceding sentence) and detecting subtle or implicit relationships. Finally, and perhaps most importantly, semantics-based and parsing NLP is better at detecting direct grammatical relationships rather than conceptual relationships, which may not be explicitly stated. For example, older articles tend to refer to type 2 diabetes as “adult onset diabetes” or “mature onset diabetes”. Assuming a body of text does not explicitly state this synonymy, it would not only be useful to recognize, computationally, that these are synonymous concepts, but to identify related concepts such as transient neonatal diabetes mellitus or type 1 diabetes.

375

up the problem of scaling: when a term has been mentioned only a few times within the literature, any association with another term will have an extremely high MIM. The converse is true for terms that are mentioned frequently in the literature, even if one highly abundant term is strongly relevant to the definition/activity of another term, its abundance in the literature will equate to a relatively low MIM.

5 Using fuzzy sets to define relatedness The goal of this study was to test how well fuzzy set theory (FST) [11] performed in identifying conceptually related terms. Two related terms tend to both be mentioned with other terms related to each one individually. That is, these two terms will tend to be related to one another by some set of common functions, components, processes or associations. A term “A” will appear with another term “C” within the context of a set of n terms {B} that define the nature of their association. Given a set of defined terms, a catalog of all co-citations of these terms within abstracts can be constructed, developing a network of co-citations. This network can then be queried for a term A to identify all co-cited terms {B}. The same can be done for another term, C, to identify all terms co-cited with it. A list of terms co-cited with C as well as A (Fig. 1a) can then be compared to define how A and C overlap. The overlap of these related term sets would then define to what degree A is related to C and vice versa (Fig. 1b). FST replaces the two-valued set-membership function with a real-valued function, that is, membership of C in A is treated as a probability, or as a degree of relatedness. When asserting a relationship, a real value is assigned to assertions as an indication of their degree of relatedness. This value of relatedness would range from 0 (unrelated) to 1 (identity). In the case of an identity relationship, we would expect that the two terms are essentially synonymous. Most terms we would expect to be related to a lesser degree.

4 Mutual information measure To correct for the abundance of terms within a body of literature, a mutual information measure (MIM) [16] can be taken to reflect the strength of a relationship, and has been successfully used to identify dependencies among terms within text [6]. Mutual information detects informational dependencies between variables and, when used for textual co-occurrence analysis, it determines whether the frequency of term cooccurrence is greater or less than the number of times that would be expected statistically, given the relative frequency of the two terms within the literature. This would perhaps be the best way of defining the strength of relationship, except it does not allow for terms to be related that have not appeared together within the literature. This will be especially problematic for new terms entering the literature, which brings

Fig. 1 a Two terms within MEDLINE, A and C, share a set of relationships (co-citations), with {B}. Here, terms are represented as nodes and relationships as lines. b This could equally be represented in terms of the relatedness of two sets, showing where relationships are shared and where there is no overlap between the sets

376

J. D. Wren

term, A, is shown related to a set of concepts, {B}, sorted in descending order of the frequency of co-citation between the concepts A and each term in {B} (Fig. 3a). Another term, C, may also be related to some of these terms in the set {B}, and the overlap between C and A can be reflected as the intersection between these two sets or the AUC shared by these two terms (Fig. 3b). Complicating matters is that C may not be related to the terms in {B} with the same relative strength that A is. Taking a minimal overlap within this intersection should provide a more accurate way of reflecting the relative strength of these shared relationships. Because A and C will likely be mentioned within the literature at different frequencies, this will need to be taken into account prior to analyzing their domain overlap. Thus, when calculating the AUC, these values are normalized so that the summation of all relationships to a term always adds up to 1. Fig. 2 For each biomedical term, the number of co-occurrences also follows a scale-free distribution when sorted in descending order. Shown here is the number of co-occurrences of other terms with the term “fibromyalgia”

6 Not all relationships are created equal Figure 1 depicts each relationship as being binary in nature, either present or not. Yet, in a co-citation network there is a degree of uncertainty as to whether or not a co-citation really reflects a meaningful relationship. In general, the more cocitations between two terms, the higher the probability is that a relationship exists between the terms. Additionally, important relationships tend to be mentioned more frequently than secondary and tangential relationships. The number of cocitations for nouns within a corpus typically follows a scalefree or power-law distribution [3]. Figure 2, for example, shows the distribution in the number of other terms co-cited with the disorder Fibromyalgia within all electronically available MEDLINE abstracts, sorted in descending order by how many co-citations were found. Fibromyalgia is a disorder, predominantly affecting females, in which patients experience musculoskeletal pain, depression, anxiety, loss of sleep and chronic fatigue. All these terms are found within the top 20 most co-cited terms in this graph, reflecting their relative importance with relation to this disorder. Clearly, from a conceptual standpoint (e.g., function, definition, etiology), these terms could not be treated as being of equal importance as those co-cited only a few times.

8 Previous research Defining the relatedness of two terms by the relative overlap of their sets permits an analysis of terms that have been associated with one another as well as an analysis of terms that have not been associated with one another. Terms that have not been associated with one another, yet share associations, can be defined as implicitly related to one another. One could use implicit relationships to infer the possible existence of a new relationship that has not yet been documented. This idea that novel relationships within text could be computationally identified based upon existing relationships has its roots in an approach developed by Swanson [19], who used software to identify words shared between article titles. Using their software, called Arrowsmith, Swanson and Smalheiser were able to identify common intermediates between Raynaud’s disease (a circulatory disorder restricting blood-flow to the extremities) and the dietary effects of fish oil, leading to the hypothesis and subsequent proof [7] that compounds within dietary fish oil could alleviate the symptoms of Raynaud’s disease [17,19]. To explain how reasonable hypotheses could go unnoticed by researchers in separate fields, Swanson coined the term “non-interactive literatures”. While pioneering, a keyword-based method such as Swanson and Smalheiser’s is both limiting and highly burdensome, especially where a large body of literature is concerned, because the number of unique keywords grows quickly per record analyzed. For the reasons specified earlier regarding the use of co-occurrence of terms, it is also limited. 9 An approach based upon fuzzy set theory (FST)

7 Defining overlap between fuzzy sets The frequency of co-citation within MEDLINE abstracts is used to reflect a measure of the strength of the relationship between terms. When analyzing a term, A, in the context of other terms that overlap with its set of relationships, it is useful to visualize the sum of all related terms as the area under the curve (AUC), shown in Fig. 3. Here, a hypothetical

In summary, there is a strong and growing need to evaluate the conceptual relatedness of terms in biomedicine, whether it is to aid in constructing ontologies, annotate genetic databases, or infer new relationships. Methods of accomplishing these tasks by evaluating co-occurrence of terms and/or annotated fields (e.g. record annotations such as medical subheadings) are limited and generally yield vague rather than


377

Fig. 3 a Any given term, A, will have a set of relationships with other terms, {B}. The total strength of all relationships can be calculated as the area under the curve (AUC). b Plotted here are only those terms in the set {B} that are shared between A and C (A intersect C). This defines the relative importance of the shared relationships. c The term C, however, may not be related to the same terms within the set {B} to the same degree. This could be reflected in several different ways, but shown here is the intersection of the lesser of the two strengths

Fig. 4 Locuslink summary information for the gene ANK2 (Neuronal Ankyrin 2, Locuslink ID# 287). Terms recognizable within the Object Relationship Database are underlined

specific relations. To overcome these difficulties and evaluate the conceptual relatedness of two terms, whether the terms have been mentioned together in the literature or no connection has previously been made, an approach was tested based upon membership within a fuzzy set. The domain of a given term was defined by the relative frequencies of all the other terms it was co-mentioned with in the literature. The overlap that a term had with any other term was a function of the terms they were both co-mentioned with and the relative importance of this shared term to both domains. The result was a method able to identify highly similar biomedical concepts and properties.

10 Materials and methods Code was written in Visual Basic 6.0 (SP5) using ODBC extensions to interface with an SQL-based database, with database queries written in SQL. Programs were executed on a Pentium 4 3.06 GHz machine with 1 GB of RAM and two ultra-fast SCSI hard drives. The National Library of Medicine graciously provided an electronic archive of MEDLINE records in XML format. Abstracts were chosen as the body

of text to conduct analysis due to their electronic availability and are also because they are a good source of pertinent information due to their brief, focused nature that presumably contains a summary of the most important findings in each report. An even more focused report of genetic associations can be found in the Locuslink template database under the field “SUMMARY:”. For each gene within Locuslink where a function is known, a brief description of the gene’s function is provided. This description is highly focused, usually several sentences long. Figure 4 shows an example of a Locuslink summary entry. Thus, it can be expected that many relationships found within abstracts will not be present in this focused description. However, the advantage of using this field for analysis is that the description of gene function is necessarily constricted to only the most pertinent associations because of this limited space.

11 Recognizing terms of interest within MEDLINE records Terms of interest in scientific research were first defined by assimilating database entries from relevant databases into

378

one central database. This permitted both words and phrases to be identified within text, and synonymous terms to be mapped to primary terms. The goal in defining this list of terms to be recognized was to provide a set of terms that are studied with reference to each other. Genes, for example, are studied in terms of the diseases their mutation or disruption might cause. Diseases are studied in terms of the phenotypes associated with them, the metabolites that change as a consequence of disease etiology, and the drugs that might be used to treat them. Thus, a broad list of genes, diseases, phenotypes, chemical compounds, ontological assignments and drugs was assimilated into one central database for analysis. All electronically available MEDLINE records were then analyzed for associations between terms of interest by searching for their co-occurrence within abstracts, summing the total number found. All available literature was processed (a total of 12,899,016 records), creating a network of associations for each term. Finding terms within scientific text was conducted by first inputting names from different research areas into a common object recognition database (ORD). Each MEDLINE record (both title and abstract) was read in, split into sentences and then each sentence split into an array of words for analysis. Iteratively, words within each sentence were checked against the ORD database, beginning with the longer phrases (up to four consecutive words) and proceeding to single words. This was done so that, for example, “breast cancer” would be matched preferentially over the more generic term “cancer”. Terms co-mentioned within an abstract, sentence or both were then recorded in a table for co-mentions. Terms classified as diseases, disorders, syndromes or phenotypes were obtained from online mendelian inheritance in man (OMIM) [8]; chemical compounds and small molecules were obtained from the medical subject headings (MeSH) database. [12]; approved drug names from the Food and Drug Administration; genes were obtained from Locuslink [13], and ontological classifications for genes were obtained from the gene ontology consortium [9]. Acronyms for database terms were resolved within text, when encountered, using an acronym resolving general heuristic [22]. A total of 12,899,016 MEDLINE records with dates ranging from 1967 to September 2003 were processed. These records contained a total of 7,456,642 abstracts (not all records contain abstracts, especially early records). The assimilated databases contained a total of 181,364 unique terms that could be recognized within text. When including synonyms, the total number of recognizable phrases for these unique terms was 292,227 (e.g. “Non-Insulin Dependant Diabetes Mellitus” is a synonym for “Type 2 Diabetes”, and the two are treated equivalently). Most of these synonyms were provided by the databases assimilated, the rest were inferred by matching phrases to acronym-definition pairs using the program mentioned earlier [22], so that “Non-Insulin Dependant Diabetes Mellitus” would be matched with NIDDM as its corresponding abbreviation. After processing, however, only 112,805 of these unique terms were found within MEDLINE

J. D. Wren

(not all these database terms are derived from literature-based references, so there was no expectation that all terms should necessarily be present). 12 Establishing a fuzzy measure of relatedness All terms co-occurring within a MEDLINE abstract were presumed to have a potential association. Because a goal of the study was to define a “fuzzy domain” for each term, those co-occurring terms that do not have any biologically relevant association with one another should not constitute a large portion of the AUC. The strength of association between two cooccurring terms, A and C, was furthermore weighted by the type (abstract or sentence) and frequency of co-occurrence. For each term, the strength of an association, Sac , between two terms A and C, was defined as: Sac = 0.5Ft + 0.8Fs ,

(1)

where F t is the number of times (frequency) A and C cooccurred within abstracts and F s is the number of times A and C co-occurred within sentences. These weighting factors were obtained during an earlier study to estimate the probability that two co-occurring terms reflected a biologically meaningful relationship. It was found that approximately 50% of abstract co-occurrences were biologically meaningful and approximately 80% of sentence co-occurrences were [21]. After all literature was analyzed, each term had a set of t associations. The sum strength of all associations was calculated from the area under the curve: t Sn . (2) AUC = n=1

To compare the domains of two terms, the AUC must be normalized so that all values add up to a constant number. Each association was then normalized in its strength by dividing it by the AUC so that the sum of all associations adds to 1. Each strength score is thus converted to an integral strength normalized (ISN) score, so that for the nth relationship in a set, the strength would be ISNn = Sn /AUC

(3)

The fuzzy association between two terms can be considered equivalent to measuring the approximate overlap between their domains. With a list of associations and the normalized strengths for all terms identified within MEDLINE, three measures of fuzzy relatedness can be calculated from the associations common to both domains, each of which is graphically illustrated in Fig. 3. First, given a set of O terms in the set of B (see Fig. 1) that overlap with both A and C domains, the relative strength of the shared terms with reference to A can be calculated by summing all the normalized strengths of the relationships between A and the shared terms Sa : o Sn(a) . (4) ISNA = n=1


379

ISNA provides a measure reflecting the importance of the shared terms with reference to the domain of A. Similarly; one can calculate the same for the domain of C as: o Sn(c) . (5) ISNC = n=1

A measure can also be taken to define the minimum overlap between these two sets, taking the minimum ISN values for the terms shared by both domains. ISNmin =

o

min(Sn(a) , Sn(c) ) .

(6)

n=1

The overlap between two research domains can be defined discretely (i.e. a finite, fixed number of terms are shared by two domains), but ascertaining the strength or nature of the relationship is not as straightforward because co-occurrence of terms appears with differing frequencies for each term involved. Thus, the use of standard set operations to compare two domains is somewhat limiting because the boundaries of such domains are not well defined, that is – they are “fuzzy”.

13 Results Identification of co-occurring database terms within MEDLINE records yielded a network of 14,346,559 associations. The distribution of terms found in MEDLINE ranged from more general categories (e.g. “blood”, “tumor”, “stress”, “lesions”) that are found in a higher percentage of abstracts to the more specific. The frequency of terms when plotted follows a power-law distribution and resembles that of a scale-free network, which is reasonable given that new terms are typically studied in terms of their relationship to known terms (law of preferential attachment).

Fig. 5 Top 1,000 terms related to “fibromyalgia”, sorted by their ISNmin score. Number of co-citations between fibromyalgia and each term is plotted on the y-axis. Where there are gaps, no documented relationship was found. Not all gaps will be visible due to x-axis compression

and informative clinical presentation in determining diagnosis. These terms were ranked 2 and 3, respectively. Musculoskeletal pain and sleep disturbances (terms ranked 4 and 5) typically accompanied chronic fatigue, but were thought to be secondary. Now they are considered defining symptoms of fibromyalgia as a syndrome [20]. Another example of the ability of the FSR to group terms by relatedness can be seen by the immediate proximity of the terms “Sleep disturbance” and its plural variant “Sleep disturbances” (note that because the terms used for analysis came from different databases, they are sometimes spelled or phrased in different ways).

15 Model evaluation

The first 112 Locuslink genes within the ORD having both summary information as well as more than 20 and less than 2000 relationships found within MEDLINE were chosen for 14 Comparing FSR with other means of association analysis. Each of these 112 genes was analyzed against each Continuing with the term “Fibromyalgia” as an example, the of the 112,805 unique terms within MEDLINE for their fuzzy ISNmin score for terms associated with fibromyalgia were set overlap. These genes themselves served as positive consorted in descending order (x-axis) and plotted against the trols that the FSR was being calculated correctly, in each number of co-citations (y-axis). Comparing this new graph case the FSR of each gene to itself was one or very close to (Fig. 5) with the original distribution for the same term (Fig. 2), it (rounding errors sometimes prevent a value of exactly one it is apparent that the same general distribution is preserved. being obtained). The number of terms in the ORD that shared Terms co-mentioned frequently tend to receive higher scores, at least one relationship with the genes being analyzed ranged but in many cases the rank order of terms has changed. The from 38,028 to 82,378, with an average of 66,859. That is to key issue now is: does this rearrangement of the rank impor- say, this was the range in the number of terms with a positive tance of terms correlate with their “relatedness” as terms? FSR value in each analysis. For each of these 112 genes, this Table 1 examines the top 20 term associations obtained from FSR estimation of relationship strength is derived from the the ISNmin score in comparison with two other means one processing of MEDLINE abstracts. Each of the focused gene might use as a strength of association: the co-citation fre- summary fields provided by Locuslink (e.g. see Fig. 4) was quency and the number of shared terms. Note that in this table then processed, as the MEDLINE abstracts were, to idena less used, but synonymous term, “fibromyalgia syndrome” tify biomedical terms. Because experts provide each focused received the highest score and that fibromyalgia was more summary field describing a genes function, purpose and assocommonly referred to in the older literature as “Chronic Fa- ciations, each term found within this focused summary is tigue Syndrome” because chronic fatigue is the most defining presumed to be highly related to the gene being analyzed.

380

J. D. Wren

Table 1 Biomedical terms related to Fibromyalgia, sorted by their minimum integral strength normalized (ISNmin ) score. Contrast this list with how the same terms would be ranked via other metrics such as the number of shared terms or the number of times A and C were co-cited together within MEDLINE. Here, also notice the ISNmin score ranks highly synonyms terms such as “fibromyalgia syndrome” and “chronic fatigue syndrome”. The most defining clinical presentations of this disorder appear at the top of this list A

# Shared terms (B)

(rank)

# Co-citations (A–C)

(rank)

C

ISNmin

Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia Fibromyalgia

552 128 286 174 184 219 208 258 457 352 278 62 189 294 282 85 419 330 108

1 2568 414 1562 1412 948 1076 609 21 163 460 5875 1339 376 439 4267 51 230 3194

– 157 190 68 107 57 39 19 267 119 13 56 17 71 57 16 123 36 21

– 6 3 15 9 21 31 67 1 8 98 23 73 13 22 79 7 32 62

Fibromyalgia fibromyalgia syndrome chronic fatigue syndrome Chronic fatigue musculoskeletal pain Sleep disturbance Sleep disturbances Joint pain Fatigue Musculoskeletal insomnia primary fibromyalgia Tiredness rheumatic diseases Low back pain poor sleep Anxiety Psychiatric disorders temporomandibular disorders

1.00 0.56 0.44 0.43 0.42 0.38 0.36 0.33 0.33 0.33 0.32 0.32 0.32 0.31 0.31 0.31 0.31 0.31 0.31

Terms receiving a positive FSR score were compared to the list of terms found within the focused summaries. The FSR list was ranked in descending order of the a implicit strength normalized (ISNA , illustrated in Fig. 3b). If a summary term was found, it was reported and given a relative ranking score (RRS) – expressed in terms of a percent rank. A RRS of 1% means that the focused summary term was found within the top 1% of all terms with the FSR list (which, again, was derived from MEDLINE). If a term was not found in the FSR list, it was automatically assigned a value of 100%. Terms that appeared within the summary multiple times were only reported and evaluated once. Table 2 summarizes the performance of the FSR in ranking the relevance of MEDLINE terms to these 112 genes. For these 112 genes, there were a total of 14,007 possible terms within their corresponding Locuslink summaries (e.g., Fig. 4) recognizable by the system and also present within MEDLINE records. More than half of the terms (53%) found within the Locuslink summary field were ranked in the top 1% of the list, suggesting that FSR analysis is able to evaluate and rank the conceptual relevance of terms appearing across many thousands of MEDLINE abstracts. That is, the higher the domain overlap of two terms, the more likely the two are strongly related to one another. Table 3 summarizes the FSR analysis in terms of the minimum, average, standard deviation and maximum values for the number of relationships of each of the genes analyzed, the number of summary terms found for each gene, the ISNA scores obtained during analysis, and the relative ranking of summary terms.

Table 2 Evaluation of fuzzy set rankings (FSR) obtained from analyzing gene-term associations within MEDLINE. A total of 14,007 terms were found within a summary description of 112 Locuslink genes. Terms found within MEDLINE were then ranked for their associations with these 112 genes using the methods described. The relative FSR of summary terms found within the literature ranked high within the list of all possible associations (average # of associations = 66,859). A full list of all genes, summary terms and their rankings can be found at http://faculty-staff.ou.edu /W/Jonathan.D.Wren- 1/Papers/FuzzySet%20Analysis.xls Rank

#

%

Top 0.1% Top 0.5% Top 1% Top 5% Top 10% Top 25% Top 50%

3,211 5,722 7,380 11,011 12,147 13,160 13,593

23 41 53 79 87 94 97

Table 3 Statistics for the 112 genes analyzed, including the number of unique MEDLINE relationships for each gene, the number of terms found for these genes within the Locuslink summary description field, the ISNA score for the FSR analysis conducted for each of these summary terms, and the relative rank of these summary terms compared to all other terms with a non-zero FSR value

# of unique MEDLINE relationships # terms within Locuslink summary ISNA score Relative Rank

Min.

Avg.

StDev

Max.

20

304

391

1925

3

125

72

250

0.00 0.0001%

0.60 0.06%

0.26 0.14

1.00 99.9%

16 Discussion Fuzzy set ranking provides an excellent way of ranking the relatedness of terms by analysis of a secondary set of terms

common to both terms being analyzed. This set of secondary terms collectively represents the biological association “domain” that each term occupies. Several measures were


presented here to compare the overlap of domains, which is less than straightforward because each association between terms has its own strength and there are several ways that domain overlap could potentially be visualized. Potentially, FSR could also be used to identify synonymous terms, which would presumably share overlap not only in the number of shared terms but also the strength of association with each term. Although this approach to identifying the relatedness of terms is tested here using biomedical data, this method is easily extensible to other areas by first identifying a set of terms that are of core interest to researchers in the field and then obtaining a body of literature that reports on associations or relationships between these terms. Ranking the importance of co-occurring terms by their frequency of co-occurrence would likely also yield favorable results, perhaps comparable to the values obtained in Table 1 or even better. However, direct association measures such as co-occurrence of terms or the MIM will not be able to identify associations not explicitly found within the body of literature analyzed. Any method that relies strictly upon two terms appearing together will be limited to querying or analyzing what is known, yet often the most valuable information is what is not directly known but rather what is implied. Here, abstracts were analyzed, and the bulk of each scientific report was unavailable. Thus, information therein may not always be reflected in the abstract. Similarly, bodies of information may be sparse in other ways, perhaps incomplete or contained within diverse sources. The ability to analyze the relatedness of two terms by the FSR enables the user to infer relationships that may not have been explicitly specified. Term co-occurrence was used here to reflect the relative strength of a relationship between two terms, but MIM could equally be used as an alternative way of reflecting the same concept. It was not employed here because MIM tends to penalize broad, widely mentioned concepts (e.g. depression, sleep) with lower relative scores and it is these more general concepts that would be expected to appear within a summary statement such as provided by Locuslink. It would be interesting to test how the use of MIM affected fuzzy set analysis, but for the reason just mentioned would probably be better evaluated in another context besides the one chosen here. For example, using mutual information as opposed to frequency of co-mention to define domain overlap would be expected to pair together terms not upon conceptual similarity, but upon informational similarity. Acknowledgements This work was funded by NSF-EPSCoR grant # EPS-0132534. The author would also like to thank the National Library of Medicine for graciously providing MEDLINE records in XML format for analysis.

381

References 1. ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL tmpl.gz. 2. http://www.nlm.nih.gov/pubs/factsheets/medline.html. 3. Allegrini P, Grigolini P et al (2004) Intermittency and scale-free networks: a dynamical model for human language complexity. Chaos Solitons Fractals 20(1):95–105 4. Bader GD, Donaldson I et al (2001) BIND–The biomolecular interaction network database. Nucleic Acids Res 29(1):242–245 5. Blaschke C, Andrade MA et al (1999) Automatic extraction of biological information from scientific text: protein-protein interactions. ISMB 99:60–67 6. Church KW, Hanks P (1990) Word association norms, mutual information and lexicography. Computat Linguist 16:22–29 7. DiGiacomo RA, Kremer JM et al (1989) Fish-oil dietary supplementation in patients with Raynaud’s phenomenon: a double-blind, controlled, prospective study. Am J Med 86(2):158–164 8. Hamosh A, Scott AF et al (2002) Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 30(1):52–55 9. Harris MA, Clark J et al (2004) The gene ontology (GO) database and informatics resource. Nucleic Acids Res 32 Database issue:D258–D261 10. Jenssen TK, Laegreid A et al (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28(1):21–28 11. Klir G,Yuan B (1995) Fuzzy sets and fuzzy logic: theory and applications. Prentice Hall. Englewood Cliffs 12. Lowe HJ, Barnett GO (1994) Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. Jama 271(14):1103–1108 13. Pruitt KD, Maglott DR (2001) RefSeq and LocusLink: NCBI genecentered resources. Nucleic Acids Res 29(1):137–140 14. Rindflesch TC, Hunter L et al (1999) Mining molecular binding terminology from biomedical text. Proc AMIA Symp 127–131 15. Rindflesch TC, Tanabe L et al (2000) EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput 517–528 16. Shannon C, Weaver E (1949) The mathematical theory of communication. University of Illinois Press, Chicago and Urbana 17. Smalheiser NR, Swanson DR (1998) Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. Comput Methods Programs Biomed 57(3):149–153 18. Stapley BJ, Benoit G (2000) Biobibliometrics: information retrieval and visualization from co- occurrences of gene names in Medline abstracts. Pac Symp Biocomput 5:529–540 19. Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30(1):7–18 20. Wallace DJ, Wallace BW (2002) All About Fibromyalgia. Oxford University Press, New York 21. Wren JD, Bekeredjian R et al (2004) Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics 20(3):389–398 22. Wren JD, Garner HR (2002) Heuristics for identification of acronym-definition patterns within text: towards an automated construction of comprehensive acronym-definition dictionaries. Methods Inf Med 41(5):426–434 23. Wren JD, Garner HR (2004) Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 20(2):191–198 24. Xenarios I, Rice DW et al (2000) DIP: the database of interacting proteins. Nucleic Acids Res 28(1):289–291