Software agents for extracting, aggregating and updating data ... - NCBI

4 downloads 7450 Views 166KB Size Report
Software agents for extracting, aggregating and updating data from web pages of genomic databanks. *Andrea Stelia, * Marco Masseroli, PhD, t Myriam Alcalay, ...
Software agents for extracting, aggregating and updating data from web pages of genomic databanks

*Andrea Stelia, * Marco Masseroli, PhD, t Myriam Alcalay, MD, PhD, *Francesco Pinciroli, Professor *Dipartimento di Bioingegneria, Politecnico di Milano, Milan - Italy t Istituto Europeo di Oncologia, Milan - Italy tree structure from the node containing the anchor to the node of the HTML structure containing the data of interest and creates the template.

Background Today a huge amount of genomic data is stored in distinct databanks accessible via web. A large effort is being made in compiling integrational databanks allowing the extraction of different data in a batch mode. However, to date, complementary annotations of the same nucleotide or aminoacid sequence are stored in distinct available databanks, most of which can be interrogated only for a single gene or protein at a time, and generally query results are available in exhaustive form only, not structured inside HTML pages. This format does not allow either to extract and aggregate easily the data of interest among the retrieved information, or to perform new more specific queries on them. Nevertheless, performing articulated queries on more genes or proteins at a time, and integrating the data extracted from different databanks to perform comparisons, are high-priority needs among the research groups'.

Results and Discussion For template creation and use with the defined software agent, a software application, called GeneWebEx', was implemented in Java programming language. Its main characteristics are: I) a graphical interface with intuitive windows adequate to biologists and physicians; 2) a parametric functioning allowing to adapt the software agent performance; 3) a module for template creation from any reference HTML page; 4) a module for automatic extraction of data from distinct HTML pages of different databanks, using the defined software agent and the created templates; 5) the direct connection to a database for storing and aggregating the variety of data extracted from distinct genomic databanks, and performing articulated queries on the aggregated data for comprehensive information comparisons; 6) an extraction operation log file enabling to quickly evaluate extraction results. These features make GeneWebEx0 especially adequate for the needs of small and medium research laboratories, which often do not have the resources and informatics expertise to manage instruments more sophisticated but also more complex to use. A software agent was also created in Java programming language to keep updated the database of the extracted data. These software agent uses the defined templates as knowledge base and at predefined intervals of time autonomously and automatically retrieves the HTML pages containing the data of interest, applies the template extraction rules, extracts the available data of interest, compares them to those in the database, and, in case, update the last. In this way also retrieved information presenting a high temporal variability can be kept updated and synchronized to those in the original databanks.

Materials and Methods A software agent for the automatic extraction of data from HTML pages was developed. It works as follows: - retrieves the HTML page, available in internet, containing the data of interest; - creates a tree structure representation of the retrieved HTML page by parsing it and separately identify HTML tags and data that will represent nodes and sleeves, respectively, of the tree structure; - uses templates as knowledge base to identify, inside the tree structure, the HTML structures containing the data to extract; - aggregates and structures the extracted data in a database designed to allocate the variety of extractions from different genomic databanks. Creation of Data Extraction Template Templates are created on the basis of a developed algorithm and user interaction. The user must select three sequences of characters on a reference HTML page. The first sequence constitutes an anchor, i.e. a unique sequence of characters inside the page. The other two sequences of characters are used by the algorithm to identify the HTML structure with the data to extract. Using the three selected sequences of characters, the algorithm automatically defines the relative path in the page

AMIA 2002 Annual Symposium Proceedings

References 1. Cheung KH, Nadkarni PM, Shin DG. A metadata approach to query interoperation between molecular biology databases. Bioinformatics 1998; 14(6): 486-97.

1171