Concept Similarity in Publications Precedes Cross ...

5 downloads 181 Views 913KB Size Report
with scripts written in the Python programming language ... the Django project for database access .... lower bound on the performance of our model. MEDLINE ...
Concept Similarity in Publications Precedes Cross-disciplinary Collaboration Andrew R. Post, MD, PhD1, James H. Harrison, Jr., MD, PhD1 1 Clinical Informatics Division, Dept. of Public Health Sciences, University of Virginia, Charlottesville, VA Abstract Innovative science frequently occurs as a result of cross-disciplinary collaboration, the importance of which is reflected by recent NIH funding initiatives that promote communication and collaboration. If shared research interests between collaborators are important for the formation of collaborations, methods for identifying these shared interests across scientific domains could potentially reveal new and useful collaboration opportunities. MEDLINE represents a comprehensive database of collaborations and research interests, as reflected by article co-authors and concept content. We analyzed six years of citations using information retrievalbased methods to compute articles’ conceptual similarity, and found that articles by basic and clinical scientists who later collaborated had significantly higher average similarity than articles by similar scientists who did not collaborate. Refinement of these methods and characterization of found conceptual overlaps could allow automated discovery of collaboration opportunities that are currently missed. Introduction Innovation is the introduction of ideas, processes or technologies that transform knowledge and/or practice in a domain, and it may occur when ideas from one domain are introduced and adapted for use in another.1 A form of cross-domain collaboration termed translational research is especially important in biomedical science for extending discoveries occurring in basic science to create innovation in clinical research and ultimately in community health practices.2 To stimulate this type of innovation, the National Institutes of Health has made the promotion of cross-domain and -disciplinary communication and collaboration a funding priority,3 in particular, with NCI designated Cancer Centers and the Clinical and Translational Science Awards.4 Cross-domain innovation requires communication of knowledge and is enhanced by collaboration, but barriers that are not directly addressed in current initiatives often impede the formation of these collaborations.5 Typical methods of communicating scientific knowledge, such as the literature with standard searching techniques, various conferences,

and word-of-mouth, are overwhelmed by the volume of data being produced, inefficient for identifying related concepts across studies and often hampered by domain-specific terminology, even for ideas and concepts that are shared.1,6,7 Because of these problems, many collaboration opportunities likely go unrecognized. Recently developed online scientist collaboration networks such as Nature Network (network.nature.com) and Community of Science (www.cos.com) show promise for enabling investigators to find and communicate with other scientists with similar research interests. However, they require users to maintain summary “profiles” of their work, a time consuming and subjective process that does not provide a standardized representation of research concepts and is prone to going out-of-date. Alternatively, automated methods may be able to identify collaboration opportunities by processing electronic data on nascent, established and past research that are increasingly available to research institutions, including the biomedical literature, grant application summaries and institutional review board protocols.6,8 Biomedical concepts that are contained in these documents and databases may be represented in controlled vocabularies and may serve as “research descriptors” that are proxies for scientists’ research interests. Such descriptors might be compared and contrasted to assign scientists to domain groups and match scientists who are likely to be effective collaborators across domains. Since these concept matches could be accomplished without revealing specific details of research, competitive concerns should be decreased. We hypothesize that investigators who collaborate successfully tend to share concepts in their precollaboration work that draw them together and stimulate their initial interest in collaboration. These shared concepts may be detectable in research descriptions, and concepts shared between scientists who have not collaborated may serve as markers for potential collaboration opportunity. Methods for identifying these opportunities may allow individual scientists to discover new collaborations and institutions to promote collaborations likely to be effective. If this hypothesis is true, research descriptions from collaborating scientists that were written prior to the

AMIA 2008 Symposium Proceedings Page - 606

collaboration should be enriched in shared concepts relative to the general population of scientists. We have tested this assertion using the MEDLINE database as an example of a research description repository, with the MeSH index9 serving as a model research descriptor vocabulary. We created a prototype collaboration data model that supports representation of scientists’ research descriptor profiles as lists of MeSH concepts and we incorporated it into a software tool that identifies first-time collaborations across two specified domain areas and characterizes their prior research interests. We used the software to determine whether publications of basic and clinical scientists who later collaborate have more concepts in common than scientists who do not, and to characterize concept sharing among publications from these two populations of scientists. Methods Translational collaboration model Our proposed collaboration model is illustrated in Figure 1. The model represents a network of researchers from clinical and basic science, their publications, the journals in which they have published, and concepts associated with the content of their publications over a specified time window. The data elements are represented in detail in Figure 2. Scientists have a name and a domain category. Publications have a title, abstract, author list, journal name, and publication date. Journals have one or more topic categories that represent the general content of their articles. Concepts have a unique identifier from a controlled vocabulary. Links from scientists to publications represent authorship, and links from multiple scientists to a specific publication represent collaboration. Links from publications to concepts represent those articles’ content, and links from journals to publications represent publication source. Since publications are time-stamped, this model allows patterns in scientific collaboration (as reflected by publications resulting from collaborations), and concepts describing those publications and collaborations, to be examined over time within and across domains of interest. The model classifies journals and scientists into “basic” or “clinical” domain categories according to the Web of Science (Thompson Scientific, Inc.), which maintains a list of biomedical journals by category. Scientists are classified as basic or clinical according to whether the majority of their publications in the specified time window are in basic or clinical journals.

Figure 1. Translational collaboration model, instantiated for two domains: basic (B) and clinical (C). Circles represent scientists, with light colored circles from domain B (top) and dark colored circles from domain C (bottom). Squares represent publications, and arrows identify each publication that a scientist co-authored. The journals containing these publications are specified in the model but are omitted from this figure for clarity (see Figure 2). Scientist 3 from domain B, and scientist 4 from domain C were co-authors of publication P7, and previously published separately in publications P1, P2, P3 and P4.

Figure 2. Class diagram of the collaboration discovery data model. First-time cross-disciplinary collaborations are defined as publications with an author list containing at least one pair of scientists from both domains who were never previously co-authors within the time window. We define the concepts that motivate such collaborations as a subset of concepts that appear in those authors’ pre-collaboration publications. For each collaboration, we define the similarity of the basic and clinical scientists’ prior research interests, as reflected by the concepts associated with their prior independent publications, using the cosine similarity measure,10 defined as the inner product of the pre-collaboration basic concept frequencies and clinical concept frequencies normalized by the total number of concept instances. The cosine measure is

AMIA 2008 Symposium Proceedings Page - 607

commonly used in information retrieval for multiterm comparisons, and it returns similarity scores between 0 and 1. We multiply each concept’s number of occurrences, or term frequency (TF), by a measure of the uniqueness of the concept in the time window called the inverse document frequency (IDF),10 defined as the log of the quotient of the number of citations in the window and the number of citations containing the concept in the window. Conceptual overlap in basic and clinical publications We obtained a list of the top twenty basic science and top twenty clinical journals from the Web of Science ranked by impact factor, a measure of the frequency with which articles in a journal are cited by other articles. Journals that primarily publish reviews rather than original research articles were excluded. We extracted all citations in MEDLINE from those journals during a six-year time window (years 2001 to 2006), and downloaded the citations’ author lists, MeSH descriptor concepts, titles and abstracts into a local relational database. For all citations in the time window, frequencies of occurrence of MeSH concepts in basic versus clinical publications were determined, and their correlation was calculated to determine the extent to which concepts in the selected basic and clinical journals overlap. Conceptual similarity in translational collaborations Citations were split into those from years 1 through 5, and those from year 6. We identified all citations from year 6 with at least one basic author and at least one clinical author (as defined earlier) who had published during years 1-5 but had never been coauthors; the year 6 citations were labeled “translational,” and the citations in which those authors had previously appeared in years 1 to 5 were labeled “pretranslational.” MeSH concept headings that appeared in the pre-translational citations were identified, and the concept pair-wise similarities of the basic scientists’ prior publications versus the clinical scientists’ prior publications were calculated as above. A random sample of 1,000 nonpretranslational publications from years 1 through 5 were also taken, and the pairwise similarities of their MeSH concepts were computed. The average similarities of the non-pretranslational and pretranslational publications were compared using an independent t-test.11 The concept pair-wise similarities of individual scientists’ publications (scientist self-similarity) during years 1 to 5 were also computed in a random sample of 1,000 scientists, and the average scientist self-similarity was compared to the average non-pretranslational

similarity using an independent t-test. P-values were two-tailed. A log transformation was applied to the similarity scores, and a density plot and receiver operating characteristic (ROC) curve were plotted to determine whether concept pair-wise similarity can distinguish between pretranslational and nonpretranslational publications. Implementation MEDLINE extraction and processing were performed with scripts written in the Python programming language (www.python.org), using the Biopython software library (www.biopython.org) for MEDLINE access and an object-relational mapping library from the Django project for database access (www.djangoproject.com). Statistical calculations were performed using the SciPy library (www.scipy.org) and R (www.r-project.org). Scripts were run on an Apple (Cupertino, CA) Xserve 2 GHz dual-processor PowerPC G5 server with 1 GB of RAM and the Mac OS X 10.4 Server operating system. Downloaded MEDLINE data was stored in a MySQL (Cupertino, CA) relational database. Results Conceptual overlap in basic and clinical publications The six-year MEDLINE time window contained 47,704 citations from the twenty clinical journals, and 50,079 citations from the twenty basic science journals. There were 218,093 authors listed in the basic science journal citations, and 181,313 authors listed in the clinical journal citations. There were 95,167 unique “basic” scientists, and 85,036 unique “clinical” scientists. The citations contained a total of 16,767 unique MeSH headings, and 9,441 of those

Figure 3. Log-log scatter plot of co-occurrence of MeSH headings in basic versus clinical citations (r=0.72).

AMIA 2008 Symposium Proceedings Page - 608

Figure 4. Density plot of log-transformed concept pair-wise similarities of non-pretranslational basic versus clinical publications (Non-pretranslational), pre-translational basic versus clinical publications (Pretranslational), and publications by the same scientist (Self-similarity). headings (56% of total) were shared between basic and clinical citations. The correlation of the frequencies of occurrence of MeSH headings in basic versus clinical citations was 0.72 (see Figure 3). Conceptual similarity in translational collaborations In the split dataset (year 1 through 5 and year 6 publications), there were 2,635 citations in year 6 in which the author list contained at least one basic author and at least one clinical author (translational publications), representing 16% of citations for that year. Those authors published 5,182 citations in clinical journals and 7,243 citations in basic science journals (as defined above) during years 1 through 5, for a total of 12,425 citations (pre-translational publications). The translational publications had 5,926 unique MeSH headings, and the pretranslational publications had 7,472 unique MeSH headings, of which 4,400 were shared. Concept similarity results for MeSH headings are shown in Figure 4. Overall concept similarity was significantly greater (p < 0.0001) among publications of basic and clinical scientists who ultimately collaborated (pretranslational; mean=0.038, SD=0.054) than among those who did not (nonpretranslational; mean=0.0055, SD=0.021). The similarity between future collaborators resembled the similarity of the scientists’ own work (self-similarity; mean=0.056, SD=0.11), and self-similarity was significantly greater than non-pretranslational

Figure 5. ROC curve of log-transformed nonpretranslational versus pretranslational concept pairwise similarities (AUC=0.77). similarity (p < 0.0001). ROC analysis, shown in Figure 5, shows that our model can distinguish pretranslational versus non-pretranslational scientist pairs based on log-transformed concept similarity with a sensitivity of nearly 0.7 and a false positive rate of 0.2 (AUC=0.77). Discussion We have developed a model of collaboration between scientists from two domains, and a method for evaluating and comparing the conceptual similarity of descriptions of collaborators’ prior research. These descriptions could ultimately derive from various historic and current sources; for our initial studies, the scientific literature as represented and indexed in MEDLINE provides a convenient research description repository. We chose our initial domains to be basic and clinical science, and defined publications that include both basic and clinical scientists in their author lists to be “translational.” This is a trivial definition of translational research, but we believe that it is a reasonable starting point for this work. Collaborations between basic and clinical scientists that do not meet a rigorous definition of translational research may still have intrinsic value, and we are also likely to be able to refine our definition based on additional analysis of research descriptors from these domains. The degree of overlap of MeSH concepts in basic and clinical publications (Figure 3) indicates that there is a relatively even spread of concepts between basic and clinical journals, with a substantial number of

AMIA 2008 Symposium Proceedings Page - 609

concepts shared between the domains. Thus it is reasonable to expect that basic and clinical scientists publishing in their respective literatures who share interest in the same concepts will share the coded representations of those concepts. The distributions of similarity scores between pairs of basic and clinical authors show that scientists who ultimately collaborated tended to share a greater number of concepts than those who did not, and the similarity of scientists who ultimately collaborated approached self-similarity (Figure 4). These populations can be discriminated reasonably well (Figures 4 and 5), though there are overlapping tails in both populations. The wide variance in similarity scores caused by these tails likely indicates that distinct but complementary research interests and factors other than research interests alone may also contribute to effective collaboration.

Conclusion Distinguishing concepts associated with basic and clinical scientists who do and do not collaborate using MEDLINE and MeSH as sources of collaboration and topic information is feasible. Further work in optimizing these methods to better separate pretranslational and translational concepts, and in identifying classes of concepts that are relevant to collaboration formation, may ultimately allow for the creation of tools for discovering collaboration opportunities between scientists whose research can be described by these concept classes. References 1. Kostoff RN. Systematic acceleration of radical discovery and innovation in science and technology. Technol Forecast Soc Change 2006;73(8):923-36.

We believe that these encouraging results represent a lower bound on the performance of our model. MEDLINE has limited capability for distinguishing between authors with the same name, and such authors, if they have dissimilar research interests, would be expected to reduce and add noise to the similarity scores. Variable time lag in publication could also add noise by distorting the order in which the results of translational and pre-translational work appear in the literature. We did not prune concepts, or include partial matches based on hierarchical relationships between concepts. Some classes of MeSH concepts are likely to be irrelevant with respect to collaboration and their presence would be expected to reduce and add noise to the similarity scores. Adjusting these parameters and identifying more precisely in our subsequent work the relative association of specific types of concepts with successful collaboration may allow us to optimize our similarity scores and create a greater apparent separation between non-pretranslational and pretranslational populations (Figure 4).

2. Lenfant C. Shattuck lecture--clinical research to clinical practice--lost in translation? N Engl J Med 2003, Aug 28;349(9):868-74.

If basic and clinical scientists who participate in translational collaborations do share concepts as these data suggest, then it may be possible to identify classes of concepts that reflect research interests associated with translational collaboration in general. These concept classes may allow for the creation of tools that can identify potential translational collaborators who share such concepts in their current and past research projects, using research descriptions from a variety of sources (listed above). Such a tool could be a powerful way for scientists interested in translational collaboration to find each other, and for institutional initiatives to identify and encourage such collaborations.

8. Hurdle JF, Botkin J, Rindflesch TC. Leveraging semantic knowledge in IRB databases to improve translation science. Proc AMIA Annu Fall Symp 2007:349-53.

3. Zerhouni E. Medicine. The NIH roadmap. Science 2003, Oct 3;302(5642):63-72. 4. Zerhouni EA, Alving B. Clinical and translational science awards: A framework for a national research agenda. Transl Res 2006, Jul;148(1):4-5. 5. Sung NS, Crowley WFJ, Genel M, Salber P, Sandy L, Sherwood LM, et al. Central challenges facing the national clinical research enterprise. JAMA 2003, Mar 12;289(10):1278-87. 6. Pietrobon R, Guller U, Martins H, Menezes AP, Higgins LD, Jacobs DO. A suite of web applications to streamline the interdisciplinary collaboration in secondary data analyses. BMC Med Res Methodol 2004, Dec 14;4(1):29. 7. Yetisgen-Yildiz M, Pratt W. Using statistical and knowledge-based approaches for literature-based discovery. J Biomed Inform 2006, Dec;39(6):600-11.

9. Medical subject headings (MESH®) fact sheet 30 October 2007: Available from: http://www.nlm.nih.gov/pubs/factsheets/mesh.html. Accessed 12 March 2008. 10. Korfhage RR. Information storage and retrieval. New York: Wiley; 1997. 11. Freedman D, Pisani R, Purves R. Statistics. New York: Norton; 1998.

AMIA 2008 Symposium Proceedings Page - 610