Semantic Data Integration for Toxicogenomic ...

Semantic Data Integration for Toxicogenomic Laboratory Experiment Management Systems Hyung-Yong Kim1, Sang-Min Lee2, Ga-Hee Shin1, Seung-Hun Lee1, Jun-Hyung Park1, Young-Rok Seo2 & Byeong-Chul Kang1 1

Insilicogen Inc, #909, Venture valley, 958, Gosaek-dong, Gwonseon-gu, Suwon-si, Gyeonggi-do 441-813, Korea 2 Department of Life Science, Dongguk University-Seoul, 26 Pil-dong 3-ga, Jung-gu, Seoul 100-715, Korea Correspondence and requests for materials should be addressed to B. C. Kang ([email protected]) Received 3 September 2011 / Received in revised form 19 September 2011 Accepted 20 September 2011 DOI 10.1007/s13530-011-0091-4 ©The Korean Society of Environmental Risk Assessment and Health Science and Springer 2011

Abstract Mircoarray technology leads rapid screening of differential expressed gene (DEG) from various kinds of chemical exposes. Using toxicogenomics for the risk assessment, various and heterogeneous data are contributed to each step, such as genome sequence, genotype, gene expression, phenotype, disease information etc. Accordingly ontology-based knowledge representations could prove to be successful in providing the semantics for the relationships of the drugs to a wide body of target information, a standardized annotation, integration and exchange of data. To derive actual roles of the DEGs, it is essentially required to construct interactions among DEGs and to link the known information of diseases. We depict reconstruction of semantic relationship among chemical, disease, and DEGs by using omics-data and laboratory experiment raw data in constructed toxicogenomic meta database. Omics- and experimental data are able to be easily uploaded and connected to the already constructed data network. This semantic data integration may represent the chemical-specific marker and target disease by integrated toxicogenomic data including complex expression profiles and experimental raw data. We expect that this system shows early promise in helping bridge the gap between pathophysiological processes and their molecular determinants. Keywords: Semantics, Data-mining, Microarray, Toxicogenomics, Laboratory Information Management Systems (LIMS)

Introduction The current science landscape is rapidly evolving and it is increasingly driven by computational tasks. In particular, biology has become a data-rich discipline and its information is stored in thousands of databases. Recent advances in biological sciences have led to an explosion of new data sources about genes, proteins, genetics variations, chemical compounds, diseases and drugs. Through data mining and visualization techniques, these information could provide important insights into the complex functions of biological systems and the actions of chemical compounds or drugs on these systems. For example, it is considered increasingly important to profile existing and potential new drugs for their effects across many protein targets, not just a single target of interest1. In many previous works ontology-based knowledge representations have successfully led to provide the semantics for the relationships of the drugs to a wide body of target information, a standardized annotation, integration and exchange of data. Furthermore, it would be helpful to think about systems approaches in terms of integrating genome- and proteome-wide analyses with context specific inquiry through hypothesis-driven experimentation. The framework proposed by our group would provide semantic data integration for toxicogenomic data to the representation and interpretation of omics-based investigation and laboratory information management systems (LIMS). These approaches have given scientists aids to comprehensively answer sophisticate questions combined with genomics, proteomic, toxicogenomics, and even classical toxicology. Furthermore, the toxicant-specific profiles support biomarkers of toxicological applications and can be determined across chemicals or pharmacological classes. The gene expression changes exposed by toxic compounds have been interpreted by comprehensive databases touching on experimental, computational, and regulatory aspects. In previous work, we have presented a novel approach that use a semantic knowledge base concerning environmental risk and toxicogenomics2 and it has been attempted for biological integration in systems toxicology. The study suggested a semantic modeling to organize heterogeneous data types and introduced tech-

136

Toxicol. Environ. Health. Sci. Vol. 3(3), 135-143, 2011

niques and concepts that are used to represent complex biochemical networks relationship. Therefore, the semantic modeling tool can be used as an example to demonstrate how a domain such as risk assessment is represented and how to organize heterogeneous data types between omics-data and LIMS. In this work, the proposed data model is a project-centered, disturbed platform that facilitates communication and collaboration in a research environment. In addition, it provides a personalized work environment which supports user and project groups. This environment allows researchers to focus specifically on generating knowledge in a particular scientific field. This document describes the basic steps from getting started with an empty instance of the Knowledge Management Environment to creating a productive “gene index” knowledge base. The objective is to develop quantitative models that realistically describe or predict the flow of information and the control mechanisms that determine cellular function in biological systems under different physiological condition and to provide open platform to manage user’s knowledgebase for risk assessment of environmental hazards, which supports to build chemical-gene-disease relationship like CTD3. We also present the example to demonstrate how to connect heterogeneous data types between omics-data and LIMS. Especially, the raw data of experiment results (such as data of ELISA, Protein-Protein Interaction, Comet assay etc.) have been provided as additional evidences for biological response of interest by expose of environmental toxicants and the microarray technology have been provided to investigate underlying mechanisms of toxicity by expose of the toxicants (such as Cadmium, Nickel). Cadmium, one of the heavy metals, has been classified by the international agency of research on cancer as a human carcinogenic hazard of environmental pollution and occupational exposure4 form activities including mining, smelting, fossil fuel combustion and industrial use5. Cadmium is known to produce reactive oxygen species even though it is a non-redox metal unable to participate in Fenton-type reaction6. Nickel compounds are broadly exposed to environment due to industrial development for commerce. Nickel has been regarded as an industrial perilous component which is applied for nickel refining, electroplating and welding6. Nickel transition heavy metal is the well documented carcinogen and has been known to induce skin allergies, lung fibrosis and cancer risk6. Also, this chemical has also been reported to induce DNA damage indirectly by generation of reactive oxygen species and subsequent inhibition of DNA repair6-8. Nevertheless, the precise molecular mechanisms of nickel carcinogenesis are not yet definitely clarified. A mechanism-based classi-

fication by DNA microarray would be an efficient method for evaluation of toxicities of environmental samples. Furthermore, the experimental raw data by expose of these chemicals support the evidence for the DNA microarray. Our study suggests the method which can choose the chemical-specific marker and target disease by literature-based of complex expression profiles and experimental raw data. We expect that this approach is showing early promise in helping bridge the gap between pathophysiological processes and their molecular determinants.

Results and Discussion Semantic Data Modeling In order to design extensible data model, we employed mixed modeling techniques with both object-oriented and generic method. The first step in building a knowledge base is to design the domain-specific data model in the Knowledge Management Environment. In the example use case shown here, we will create a “gene index” knowledge base using the following seed data: a table of gene names, synonyms, EntrezGene identifiers (IDs) and Gene Ontology (GO) classification. The data model as an interactive white board so semantic objects and associated objects in a data model can be easily added, deleted, edited and visualized. The graph (a term of mathematical expression to represent objects with nodes and edges) viewer allows this editing and extension while enforcing consistency within the data model. The data model consists of various types of the elements. The properties of the elements are defined by the element type. Before adding the data, relationships and annotation for the elements, we added the element types to the data model. After that, to create the relationship between genes and proteins in the data model, we added a relation class to define the relationship between the element types. To add some basic information (including synonyms) to the element, we added a new annotation form called “Basic information” to the data model and associate it with each element. In order to design the data model, each element was mapped to external database entries integrated using the BioRS databanks (Biomax Informatics AG, Munich, Germany). In particular, genes were mapped to the EntrezGene databank and proteins were mapped to the UniProt Swiss-Prot databank. The Comprehensive Toxicogenomics Database (CTD) is a public resource that promotes understanding about the interaction of environmental chemicals with gene products, and their effects on human health. Chemical-gene and -protein interactions and chemical- and gene-disease

Semantic Data Integration with LIMS

relationships are curated from the literature and these core data are integrated to construct chemical-genedisease networks. And other popular databases also integrated with such as EMBL, UNIGENE, MEDLINE, UniProt9 etc. Finally, the data model applied to ontology, a central concept in knowledge management. It relates the conceptualization of a domain to the data model. We associated the GO ontology with the “Gene” elements.

Toxicogenomics Meta Database Focused on Heavy Metals The Toxicogenomics Meta DB is constructed with

137

diverse data (chemical, gene/protein, human disease, references vertebrate and invertebrate organisms, and Gene Ontology) by environmental toxicant and includes information of 2,666,723 entries until now. Figure 1 depicts a genes consisted of five entities such as Gene Symbol, GeneBank Accession Number, UniGene ID, EntrezGene ID, and Gene Name in this database. In case of Figure 2, Entity-Relation Modeling is built with microarray information by expose of heavy metal and these entities relates to each elements as description of Table 1. As seen in Figure 3(a), common DEGs exposed by Cadmium are identified and extracted and their interac-

Figure 1. Genomic components constructed in Toxicogenimcs Meta Database.

Figure 2. Entity-Relation (ER) modeling for microarray (partial of ER diagram).

138


(a)

(b)

Figure 3. An example retrieval using keyword of chemical; (a) extracted common DEGs and (b) proteomics information (PPI) in Toxicogenomics Meta Database.


tions are visualized. Furthermore, this model is designed to represent the relation between the experimental raw data and DEG information exposed by specific environment toxicant (Figure 3(b)). It is possible to extract interactions of interest efficiently and demonstrate detail relations among entities of biological information.

Semantic Queries Combined with Omics and Conventional Experiment Data Gene expression data of microarray on exposure of Cadmium and Nickel using SuperPrint G3 Gene Expression Microarrays (Agilent Technologies, CA, USA) is supported by Prof. Young-Rok Seo of Dogguk University. Extraction of DEGs is performed by Toxicogenomics Meta Database and is completed by using Pathway StudioTM. Table 2 is a summary of the query DEGs supported form GO entry in each toxicogenomic experiments. In this example we consider the number of DEGs from GO entry explains how each relation is different and important to choose genes of interest Table 1. Components of data, which are modeled as entities and their relations. Entity

Relation Term

Entity

Treated Sample Treated Sample Comparison Experiment Probe Gene Go

is classified by is classified by is grouped by is experimented with is expressed on is targeted by is classified by

Sample Treatment Treated Sample Comparison Experiment Probe Gene

139

strategically. It means that number of DEGs as biological process to specific chemical is distinguishable from each other. However, we just represent that DEGs information of two chemical has different character each. Number of selected down-regulated genes from microarray on exposure of Cadmium or Nickel is 416, 721 in constructed toxicogenomic meta database. The next step, we select the DEGs relate to specific meaning like “strand break” in biological process and the number of query DEGs from microarray on exposure of Cadmium or Nickel shows in Table 3. In particular DEGs in biological process relate to “strand break” is eight (yellow box in Figure 4). In addition, but we can narrow the meaning as the query DEGs supported from GO entry “strand break” like the Cadmium microarray data. Selected DEGs is ten (yellow box in Figure 5). As seen in Figures 4 and 5, we can distinguish the common DEGs supported from GO entry and these DEGs participate to biological process in GO entry as follows: GO:0005958 DNA-dependent protein kinase complex, GO:0007131 meiotic recombination, and so on. Common eight genes aim to four biological process. The next step, we evaluate to select the candidate genes by comparison with the query common DEGs and their reference count relate to disease. Table 4 shows the number of evidential literature on each relation directly associates how many scientists have been researched and explains the list of candidate genes and their disease, process relation in detail. As shown in Table 4, top-ranked relation of cell process and disease has common target DEGs and this way therefore presents simple strategy to select genes of interest with

Table 2. The results of queries to retrieve total number of DEGs supported by GO entry on exposure of Cadmium and Nickel respectively, from whole experiment results in current Toxicogenomics Meta Database. DEGs form GO Entry

Chemical

Expression

Number of DEG

Apoptosis

Transcription

Strand break

Cadmium

Up-regulated Down-regulated

194 416

34 86

36 47

39 76

58 60

Nickel


673 721

192 146

187 86

247 132

59 60

Nucleotide binding

Table 3. The results of queries to retrieve subset of DEGs supported by GO entry on exposure of Cadmium and Nickel respectively, from chosen microarrays of interest in Toxicogenomics Meta Database. Chemical

Expression

Number of DEG

Cadmium


Nickel


DEGs form GO Entry Nucleotide binding

Apoptosis

Transcription

Strand break

137 299

29 66

36 47

29 67

0 8

364 427

0 112

0 62

1 96

7 10

140


Figure 4. Selected genes using keyword of Cadmium in Toxicogenomic Meta Database; yellow box: DEGs, pink box: probe, white box: GO entry.

Figure 5. Selected genes using keyword of Nickel in Toxicogenomic Meta Database; yellow box: DEGs, pink box: probe, white box: GO entry.


141

(b)

(a)

Figure 6. Scheme of semantic data integration model for toxicogenomic laboratory experiment management system; (a) experiment raw data (blue box: laboratory assay), (b) omics-data (yellow box: gene, pink box: probe).

wild range of evidence among levels of gene, cell process, and disease. As seen in Figure 6, we present the example to demonstrate how to connect heterogeneous data types between omics-data and experiment raw data. Especially, the raw data of experiment results (such as data of ELISA, Protein-Protein Interaction, Enzyme cleavage etc.) have been provided as additional experiment results when organism exposes to environmental toxicants. Additionally the microarray technology has been

provided to investigate underlying mechanisms of toxicity by expose of environmental toxicants (such as Cadmium, Nickel).

Conclusions We suggested the data model which can seamlessly integrate toxicogenomics and conventional toxicological experiment data. This approach gives benefits con-

142


Table 4. Candidate gene selected and their disease, process relation in detail by semantic data integration. Cell process

Disease

Number of candidate gene

Reference count

Top-ranked

Reference count

Cadmium

8

1631

Cell cycle, Check point, Apoptosis, Mitosis, DNA recombination

352

Neoplasm, Lymphoma

Nickel

10

1913

Cell cycle, Check point, Apoptosis, Mitosis, Severe Combined Immunodeficiency

463

Neoplasm, Lymphoma, Wounds and Injuries, Chromosomal Instability

Toxicogenomic experiment

sistent accumulation of lab information, reliable data management, and rapid discovery of biomarkers surrogating of pathways and/or toxicological effects. The advanced technology such as NextGen sequencing can be easily adopted into the proposed model and will enable simultaneous probing of genetic, genomic, proteomic, and metabolomic events. Toxicogenomics biomarker data will routinely be updated according to in vitro and in vivo test systems. These models will eventually lend way to predictive in silico models that can help reduce use of animals and cost of experiments conducted to assess hazard and risk. We expect that this approach will eventually lend way to predictive in silico models that can help reduce use of animals and cost of experiments conducted to assess hazard and risk.

Materials and Methods Microarray Analysis Gene expression data of microarray on exposure of Cadmium and Nickel using SuperPrint G3 Gene Expression Microarrays (Agilent Technologies, CA) is supported by Prof. Young-Rok Seo of Dogguk University. Cadmium and Nickel was exposed for 24 h on human cell line and the concentration of Cadmium, Nickel was individual 50, 20 μmol. Microarray data was used for analysis of relationship such as cell process, disease and gene regulator between genes expression patterns by each chemical compounds. Data Integration Modeling with BioXM

The BioXMTM is a customizable knowledge management solution for life science. It provides a central inventory of information and knowledge integrated in a unified scientific model. It makes easy semantic data integration without the need to know about relational database schemas.

Top-ranked

Semantic Data Modeling

Data modeling is a series of organizing and structuring data. Entities are extracted from concerned domain and these are linked each other on the specific consideration. Entity and their links can be expressed such as natural sentence and it is the key of semantic data modeling. The process of semantic data modeling in this research is as followings. 1. Extract entities: Noun in this domain can be the entity such as gene, chemical, experiment, etc. 2. Define entity attribute: If there are some concepts decorating the entities, it can be the attribute of the entity. “PubChem id” and “IUPAC name” can be attributed of entity “chemical”. 3. Make link between two entities: If two entities have relation, these can be linked with other information. The link can be expressed to verb. 4. Collecting objects and relations: Considerable real data is an object. It is an instance of entity like as Ethanol object of chemical entity.

Acknowledgements The authors gratefully acknowledge the financial support from Korea Ministry of Environment as “EcoTechnopia 21 Project” entitled “Study on the development of toxicity evaluation technology for the environmental mutagenic heavy metals via the discovery of novel molecular biomarkers using DNA repair networkfocused toxicogenomics”.

References 1. Hopkins, A. L. Network Pharmacology: The Next Paradigm in Drug Discovery. Nat. Chem. Biol. 4, 682690 (2008). 2. Shin, G. -H., Kim, H. -Y., Lee, T. -H., Park, J. -H. & Kang, B. -C. A novel semantic framework for toxicogenomics. Toxicol. Environ. Health Sci. 2, 1-3 (2010). 3. Davis, A. P. et al. The Comparative Toxicogenomics


Database: update 2011. Nucleic Acids Res. 39, Database issue D1067-D1072 (2011). 4. World Health Organization. in Beryllium, Cadmium, Mercury, and Exposures in the Glass Manufacturing Industry. IARC Monographs on the evaluation of carcinogenic risks to humans 58, (WHO Publications Center, Albany, USA, 1993). 5. Nordberg, G. F. & Herber, R. F. M. & Alessio, L. Cadmium in the human environment: toxicity and carcinogenicity. IARC Sci. Publi. 118, (1992). 6. Kasprzak, K. S., Sunderman, F. W. Jr. & Salnikow, K.

143

Nickel carcinogenesis. Mutat. Res. 533, 67-97 (2003). 7. Kodipura, D., Balakrishna, S., Thimappa, R. & Muralidhara. Nickel-induced oxidative stress in testis of mice: evidence of DNA damage and genotoxic effects. J. Androl. 25, 996-1003 (2004). 8. Wozniak, K. & Blasiak, J. Nickel impairs the repair of UV- and MNNG-damaged DNA. Cell. Mol. Biol. Lett. 9, 83-94 (2004). 9. The UniProt Consortium. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 39, D214-D219 (2011).