SEKE-Rafael Andrade

An approach for retrieval and knowledge communication using medical documents Rafael Andrade, M. A. R. Dantas

Fernando Costa Bertoldi, Aldo von Wangenheim

Post-Graduate Program in Knowledge Engineering and Management (EGC) Federal University of Santa Catarina, UFSC Florianopolis, Brazil [email protected], [email protected]

Department of Informatics and Statistics (INE) Federal University of Santa Catarina, UFSC Florianopolis, Brazil {bertoldi, awangenh}@inf.ufsc.br

Abstract — The great number of information available in different data sources requires increasingly of search engine system to retrieve as many relevant documents as possible. Clinical medical records contain a great number of information, normally written in free-text form and without a linguistic standard. The physicians do not write the patient’s reports using style elements. Consequently, to retrieve the knowledge from these data is not an easy task for the search engine. In this paper we present the development of a model that allows recovering knowledge from textual information in medical documents. Query expansion techniques, which apply knowledge detection assets from the DeCS ontology and language dictionaries, will be used. The goal is to expand the user search and to create a knowledge base to allow its reuse. In order to improve the search results, semantic annotations and negation detection will be used to process medical texts. The case study presented at the end shows that the proposed model was able to achieve a mean accuracy of 90% in its first ten results, while the Boolean model was limited to only 60%. The conclusion is that the user will not need to search in different databases to find the necessary information. Keywords-Informartion Retrieval; Query expansion; Negation detection; Medical ontologies

I.

INTRODUCTION

The electronic medical information is increasingly present in all hospital and medical clinic. The large number of data that contain medical information is available for researchers, medical institutions, patients and all types of people interested in this kind of information. More and more people can receive and share their information without any restrictions. However, the great number of information available in different data sources requires the use of more intelligent retrieval techniques, focusing on information content and semantic [1]. To demonstrate this requirement, we present a scenario from a typical medical expert task: some common tasks of medical research and clinical care involve, for instance, the checking of exam results, comparison with other reports or statistical analysis of disease patterns in medical records. When a large volume of information is captured and analyzed, the automation of this task is crucial for efficient processing. Medical concepts have to become easily extractable from medical records and they can be compared with other

information in order to allow more effective searches[2]. Although the information is available from different forms, the medical texts need to be interpreted by computers in order that information can be processed and effectively shared. To enable this process, the users need to make use of tools in order to increase the accessibility and the data management [3]. On the other hand, to interpret medical text is a difficult task, but it can be easier in comparison as narrative speech, because the medical vocabulary is more restricted. Clinical medical records contain a great number of information, normally written in free-text form and without a linguistic standard. The physicians do not use elements of style for write the patient’s reports. A physician can write a report using many ways, and each of them has its own writing style [4]. However, to manipulate these amounts of information is certainly one of the greatest challenges for modern health care search engine systems. In this paper we describe the use of the knowledge extraction technique from medical ontology and negation detection procedures to define semantic annotation by expand the user’s query and to improve the search engine process on medical reports. We describe how to connect the named entities from semantic descriptions on medical ontology DeCS [5] and expand the user’s query from DeCS terms. Furthermore, we describe a method that automatic negation detection in medical reports using Natural Language Processing (NLP) techniques [6]. We understand that the use of these three techniques (semantic annotation, query expansion using medical ontology and detecting negated findings and diseases in medical reports), allow for semantically enriched Information Retrieval (IR), expands the user’s queries with medical terms and improve the quality of a search in the search engine. In this paper, the search quality is typically described by the use of precision at ten (P@10) metrics. This paper presents the problem of retrieval information from medical documents on Section 2. A background and related works for semantic search, medical ontologies, Semantic Annotation, Semantic repository and detecting negated phrases techniques are presents on section 3. In Section 4 we present a schema to define automatic annotation from medical ontology DeCS, expand the user query and detect

negative expressions on medical text reports. And finally, the conclusions and future works are discussed in Section 5. II.

PROBLEM

Just because the information does not have a specific standard, the information retrieval (IR) process from patient’s medical records is in the most of cases, inefficient. The traditional search engine by using Boolean techniques, do not exploit all the potential existence in this domain-specific knowledge. Although the medical domain provides controlled vocabularies and tools that can be used to search and index documents according to a conceptual hierarchy (e.g. MeSH, DeCS, UMLS, etc.), these engines do not recovery the hierarchical relations among the concepts and semantic such the parents and children links. Sometimes, the medical expert needs to retrieval more relevant information about a specific term or expression. Therefore, this method does not necessarily improve the traditional methods of text search [3]. In order to all related information will be retrievable, it is necessary that the search engines are able to understand the user request and expand the search range, but without losing the quality of search described by precision and recall metrics. III.

BACKGROUND

Before we present our approach, we need to introduce some techniques most used for effective search and retrieval medical information in semantic context. We explain in more detail the methods used for semantic search using query expansion and medical ontologies on Subsection A. In particular, the Subsection B introduces the use of semantic annotation techniques in order to improve on free text search retrieval performance. The detection-negated phrases on medical reports are introduced in Subsection C. A. Query expansion and ontologies To solve the problem of semantic search, effective techniques are being widely developed to search and retrieve semantically the medical knowledge [1], [2], [3], [7], [6]. The resource most used is ontology. Ontology is a formal specification of shared conceptualization in order to represent concepts, relationships and rules that manage the relationships. It consists in a way to represent semantic relationships, such objects and relationships in a particular domain [8] and [9]. Ontologies and semantic approaches are used for the integration and IR from different databases [10]. The ontology structure is an important criterion to organize the knowledge and determine the semantic reusability and data interoperability [1]. An important use of the ontology is the query expansion from free textual documents in order to improve IR systems [10]. To expand a query, the user is guided to reformulate the search by add new meaningful terms to the initial query. The system extracts the similar terms from ontology and generates a new query to increase IR systems. The objective here is combining two independent subsystems to retrieval textual and visual information, using the integration of medical knowledge with term expansion they improve the results compared with the two techniques in separated way. Other works address the problem of query expansion in order to enrich the information retrieval system [1], [7], [11]. A

way to expand a query is the use of Term Frequency – Inverse Document Frequency (tf-idf) technique. In order to expand the search, this method includes new terms to the request using words or phrases having meanings similar or related to the original request [11]. In order to evaluate the importance of query expansion and manual indexing, Abdou and Savoy [11] develop a new query expansion model and evaluate the performance of ten different IR models, including probabilistic, language and vector-space models. The authors perform two IR tests (With and Without MeSH ontology) from Medline collection in order to measure retrieval performance. The method presented by authors is 170% more precise when compared to the classical tf-idf vector-space model. Including the MeHS when indexing scientific articles this work improve more 8.4% performance on the best IR scheme. Other way to expand a query is the use of ontologies. Bhogal et al.[7], presented a range of definitions of information retrieval focused on the use of context for query expansion. They have discussed the problem of the use of ontologies for a range of information retrieval tasks and their use in the area of query expansion. For more information see [7]. B. Semantic annotation Important studies show that extract semantically the information providing semantic annotation using references from ontologies and knowledge databases, can improve more precision on IR systems [2], [12], [13]. In this research area, semantic text annotation is effective texts processing that map corresponding concepts from ontologies and annotate these concepts in the document [2]. Annotations generate and use a set of metadata that provide reference to named entities from ontology mentioned in the medical documents [12]. This metadata generate a knowledge database in order to support semantic search in document repositories. The unstructured medical text gain an important metadata supplied by the ontology. The Semantic annotation enable computer to understand the semantic meaning of a data set and increase the quality of retrieved information and system interoperability [14]. Gschwandtner et al. [2], presents a specific automatic semantic annotation system that maps the concepts from medical ontology to free medical text. They customize other annotation systems (annotating web-pages) to understand a specific medical domain. They develop an application that generates a map of concepts from medical terminologies and the medical concepts included in documents are annotated semantically by a metadata. The medical experts can visualize, control and correct all kinds of annotated information for processing a medical document. Her works show that mapping medical concepts from ontology can provide semantically accurate information for text processing and helps to removing ambiguity from different meanings. Lourenco et al. [13], identifies relevant terms in documents based on a lexicon Named Entity Recognition (NER) process. The goal is annotate the occurrences of biological classes from abstracts or full-texts within PubMed library and implement the semantic indexing of documents and terms. The technique is used to extract the information from medical libraries, tokenize (pre-processing documents) and apply a lexical dictionary in order to perform the recognition of named entities. In their results obtained, the system is able to reduce significantly the

number of irrelevant documents without significant loss of irrelevant documents. Although the technique used was different, the two works aim to improve the indexing and semantic information retrieval from medical documents. C. Negative expressions Retrieval negative expressions is as important as retrieval any other information from the database. A negative expression can invalidate a search. Therefore, the system engine has to decide whether this expression contained in the database will be excluded or not from the search [15]. Negative phrases/expressions are used by doctors to write patient’s diagnosis, medical procedures as well it helps the medical expert to identify the occurring symptoms and diseases from medical documents [6]. For instance, in medical document is common diagnosis that contains text like “the patient does not have hypertension”, or “the patient has hypertension”. These expressions create a problem to the search engine. If the engine detects the first expression as true, all documents that contain this kind of expression will be retrieved and the result will contain false information. Because of this, the process of negation detection requires a lot of knowledge about the language to correctly identify negated words or terms from an expression. The detection of negative statements in medical databases can reduce the search space thereby rendering the search process more agile.

find the information in an intuitive way. The approach presented here is based on the query expansion technique using ontologies that defines the use of terms in hierarchical tree and by the use of synonyms. In order to improve the search, semantic annotations will be used in medical texts. This technique includes the use of named entities and the detection of negative phrases to increase the research universe and reduce the number of less relevant responses. We will create a metadata repository which be used to generate concepts from the ontology, extract the information from medical documents and make the text pre-processing. In figure 1, we describe how the term is indexed and how the users receive de results. The most important component is the Query Engine module. It is responsible to access the semantic knowledge database and the text extraction modules (Semantic Annotation, Negation Detection and Query Expansion). The core of the query search engine is composed for two main elements: Indexing and Search. The Analyzer process is the conversion of texts in terms. The terms are used to determine which documents match a query during the search. The Analyzer is the component of the analysis process that performs a series of operations to facilitate the indexing. It converts the tokens to lowercase letter, removes stressed characters, and removes common words, like articles and pronouns (stop words) and extracts words root (stemming).

Gingl et al. [6] developed a method to classify and detect negated information occurring in Clinical Practice Guidelines (CPG) on syntactical level using grammatical information of the English language. Her studies show that grammatical elements are used to decide whether a phrase is negated or not. The negation classification allows medical expert to decide which therapies or treatment options are best or not applied to the patients. Their syntactical methods detection improves the values in precision and recall. One previous study identifies concepts from a medical ontology UMLS and using a lexical scanner in order to recognize and classify a large set of negated patterns occurring in the text [15]. The authors developed a program based on existing technology for implementing parsers based on context-free grammar. The results of your study show that the system presented has a recall and precision between 91.8 and 95.7 percent in detecting negations on medical documents. IV.

METRHODS AND MODEL DESCRITION

Figure 1. Semantic Information Retrieval from medical databases.

Although, most of this works apply different techniques in order to retrieval the information from medical documents, all of these works have a very particular function: improve the search engine process on medical data. We understand that is not easy to improve the current process, but based on the union from these three previews presented techniques, we refine the actual approach.

The Analyzer also uses NLP to extract terms from DeCS ontology, in order to classify and define the relationship between two terms, and extract the negative sentences. In our system, we assume that the knowledge database should be built and associated from information sources, which utilize medical ontologies, lexical dictionaries, and the analyzer component in order to describe concepts that appear in documents stored on medical database.

Our work differs from the previous works, because we aim to provide to the medical user a document range much more extensive and effective. We want to present most relevant documents so that the user does not need to dispend too much time to find information or the user does not need to look in different databases in order to find the information. We aim also to provide a semantic list created from a knowledge database that is frequently updated in order to permit the user

The Reformulate component is responsible to connect others text extraction modules, reformulate the query and store the results on the Semantic Knowledge database. The semantic annotation, query expansion and Negation Detection modules are used by the Top-k Results Processor in order to retrieval and ranking the information. The ranking method is an adaptation of the vector space model (Van Rijsbergen, 1975),

which defines keywords weights that appear in the document. The weights are computed automatically based on the frequency of instances in each document. The number of occurrences for a document instance is defined by the number of the times that an instance appears on text The Negation Detection module is a NLP technique, which finds negations sentences into the medical text report in our database. As described in [6], we need to detect five negation classes (adverbial negation, intra-phrase triggered negation, prepositional negation, adjective negation and verb negation) in order to improve parameters of recall and precision by the user query. The first step in the detection process is to create a list of negative terms (not, ever, nerver, none, neither…). As a single word does not define a negative term, we need to analyze an expression. Furthermore, most of medical findings do not have 100% sure if a expression is affirmativ or negative, we created a list "hypothetical" terms, like, cannot assert, cannot exclude, cannot completely rule out… Then we map biomedical text founded in DeCS ontology in order to create a negative expression dictionary with this five classes of negation expressions. We discovery this negated expressions on the medical report in order to refine the user query by sending to the query engine module. As a result, we generated a list that has been indexed with the terms found in order to facilitate further search. Figure 2 shows a list of excerpt from negative expressions indexed by the algorithm in the Portuguese language. 206876|Bloqueio divisional ântero-superior de ramo esquerdo, nao se pode excluir fibrose inferior 200330|Alterações da repolarização ventricular em parede inferior com onda T negativa em D3 e a Vf compatível com isquemia subepicárdica 295168| Não se pode descartar fibrose inferior 84079| taquicardia supraventricular não -sustentada

To define semantic annotation, we use a lexicon dictionary, a list of stop words, and a dictionary of negative sentences for Portuguese language. Moreover, we use the DeCS ontology to create an automatic semantic annotation on text and to store on semantic repository. The purpose of the semantic repository is not only exact terms retrieval contained in the queries, but also retrieval related entities, as synonyms and related terms stored hierarchically on medical ontology. Using structural document similarities, we identified named entities, associate these entities with concepts, and define the semantic annotation. These annotations are stored on semantic repository for further future research. V.

RESULTS

In order to validate a knowledge base we needed a people who have a high knowledge degree in a specific area and ability to transfer this knowledge. These people are called domain specialist. The specialist help to structure the knowledge and allows to minimize the indexing errors of computer systems. In this work we use an specialist in medical domain in order to validate the creation and the annotation of the knowledge base. This validation enable to the user to represent and communicate the knowledge in future search. In order to validate our database, the specialist have select randomly 172 reports from computer tomography. Figure 3 shows the result of evaluating from the medical especialist in CTs findings. The specialist found 142 reports categorized as affirmative; 19 reports categorized as negative; four as ambiguous reports and seven reports are not found. In this case, we can see that only for these 172 reports, a traditional system would index those 19 reports (11%) as true and the traditional search engines would be unable to define if a report should be part of the response or not.

Figure 2. Stretch of annotated medical findings using our system.

The query expansion process uses the DeCS ontology by adding more medical information in the user research. The DeCS ontology is used for indexing the medical report text in database, because this ontology contains concepts, relations of synonymy and related concepts. This facilitates the expansion and extraction of terms from the DeCS Descriptors. If the set of terms have relation with the user query and DeCS descriptors, we generate a new query, which contain all terms presents on DeCS synonymous. Moreover, we use the lexical dictionary in order to find new synonymous expressions on medical reports. For instance, when a medical search for the disease “asthma”, the DeCS ontology find four answer terms: “Asthma”, “Asthma, Exercise-Induced”, “Dyspnea, Paroxysmal” and “Asthma, Aspirin-Induced”. If we use the lexical dictionary, we expand the results including three new terms: “Bronchitis”, “Infectious bronchitis virus” and “Bronchitis, Chronic”. It is happened because in our dictionary the term “Asthma” has relationship with the term “Bronchitis”, but in DeCS ontology this connections does not exist in the same hierarchical tree.

Figure 3. Validation reports for CT examination by medical specialist.

Once developed the prototype, several experiments were conducted to verify the conduct of the search system in real situations and in this case we verify the compliance with the model described here. The work tests were developed to validate the recovery process and communication of knowledge in health domain.

In this scenario, the user has done a search for reports that contain terms used in the ontology. We also investigated reports that contain negative expressions and reports that contain expressions, which are not known by the ontology. The aim of this study is show to the user the queries that he has required and similar terms. If the system does not find a result, the word or expression will be recorded in our daily database and indexed in the next interaction. This information is extremely valid, because the system can "learn" new terms, according users will use the system. And for users, they can learn more about the terms used by other professionals.

queries, already in IR+QE+Neg method was carried out only one query for retrieve the results. Figure 4 show only the best results with the traditional IR compared with IR+QE+Neg model.

To illustrate the operation of our system, we developed a study case that considered four sentence searches: Q1: "Presence of thyroid nodules" Q2: "Absence of Lithiasis" Q3: "popcorn calcification in the brain" Q4: "Right MCA Aneurysm" The query Q1 was done with terms that are present in domain ontology. In this case, the documents that are expected to answer to Q1 are all documents that contain "thyroid nodules presence" + "Presence Thyroid Gland Nodule" + "presence Thyroid Neoplasms" + "presence Thyroid Disease" + Gland Thyroid presence" (result of the research expansion). As the term "thyroid nodules” is known in ontology, the system returned all the documents that had some relation with the set of term listed in the expanded query. Still, the results may not contain negative expressions, because the user did not select the specific field to search for phrases with a negative sense. The queries results are described in Table 1. TABLE I.

CONSULTATION USING THE DEVELOPED METHODOLOGY Query

Q1 “Presence nodules”

of

thyroid

Results

Time in ms

P@10

221

557

0,7*

Q2: “Absence of Lithiasis”

432

612

1,0

Q3: “popcorn calcification in the brain”

109

736

0,9*

Q4: “Right MCA Aneurysm”

79

589

1,0

Figure 4. Comparison of traditional method with the proposed model.

With the exception of Q1, all other queries obtained accuracy equal to or above 90%. The accuracy rate is a little lower in Q1 due to the fact that the responses contain many hypothetical expressions. In this way, we cannot confirm the accuracy of the reports. But even so, the accuracy in Q1 was still much higher than the traditional IR method (50%). As the average precision of the two models can be computed, we can also compare our model with the most important articles showed in section II. Figure 5 shows a comparison between our proposed systems with four others IR models. However, the average precision can only be considered in the expansion methods and in our method. Likewise, the overall precision was only available in works of detecting negative expressions. Because this work involves the use of two different models, it was difficult to define an efficient compared against the models available in literature. Still, the proposed model showed accuracy well above what is available in the literature.

* In this query we found hypothetical terms that can not be considered valid

In order to validate this case study and define the system's accuracy, we used the metric of P@10. For each query defined in the beginning of this section, the result is shown in Table 1. For a better search, the queries was conduct after the validation by the specialist and the connection of the daily use terms with the ontology terms. The average precision of all queries was 0.9000. The proposed methodology evaluated the accuracy of the first ten results (P@10) for the two research models: the traditional method (described here as IR) and the new model (IR+QE+Neg). We can notice that in all the submitted queries, the new research model had much better results than the traditional method. For the user to get a satisfactory result in the traditional method, we needed to perform various different

Figure 5. Results of average precision and overal precision compared with others important works.

The model presented by Díaz-Galiano et al. [1], uses the database of the Cross Language Evaluation Forum (CLEF) in 2005 and 2006. The base has 50,000 annotated images, which are available for testing accuracy. The other works have used a proper database for measuring the performance of its experiments. The traditional method was used to compute the 352 sentences accuracy rating, which reached 0.30. DiasGaliano et al. [1] using 50,000 reports came annotated media accuracy of 0.23. Abdou and Savoy [11] were able to reach 0.38 of precision in their responses using a set of 1,000 reports. Chapman [16] and Gindl et al. [6] show only the total precision in their experiments. Chapman used a database of 1058 reports and came to a precision of 0.78. Gindl et al. [6] developed their research in a base of 558 awards and reached the total of 0.68 accuracy. Since our model, which uses three RI presented techniques, came to an accuracy rating of 0.88 and overall accuracy 0.96. VI.

medical diagnosis database. Then we need to classify the text, e.g., “the patient does not have hypertension”, “the patient has hypertension”, or “the patient has high blood pressure”. Furthermore, we need to make a relation between the terms on medical ontologies, with the terms founded in the medial report (e.g., symptoms and human body parts). The goal in the future is to define high weights for negative expressions, generate a dictionary that will be use for faster search and improve more precision and recall to the user queries. REFERENCES [1]

[2]

CONCLUSIONS

In this paper we described a method that enables semantic search on medical text reports using ontologies in specific domains. We implemented a parser that extracts information from a database, normalize and store into an index database. In some searches conducted, it is possible to measure the functionality and applicability of the proposed architecture. Also we had used DeCS ontology for clinical toxicology area in order to classify the available information and establish welldefined relations between this information. The use of this ontology will optimize the process of care, increasing reliability and efficiency of medical professional. In our approach the queries will be generated from keywords by a natural language query, or by form-based interface where the user can explicitly select ontological items. When a user performs a search, the queries are executed against the semantic repository, which returns a list containing instances that satisfy the search. If the required information is not indexed on semantic repository, the engine performs the search again directly to the medical database. The result is an index that is stored on semantic repository and is available for further queries. Before sending an answer to the user, the system retrieves the information, ranking and creates the top-k result list. We also discussed three general data integrations approaches that use the DeCS ontology to provide access to medical database. According to an evaluation on the use of these three techniques, we believe that the use of this semantic information retrieval could assist the user to improve the results in your research, by expanding the query and enrich the values from precision and recall. In future we aim to develop a semantic index that interprets the information received from a medical database. We will develop basic NLP techniques, in order to construct the semantic queries. An ontological process will be provided to map the word on the text with the ontology DeCS concepts. For the handle of negative statements, we will develop a text-mining algorithm to extract medical terms from our

[3]

[4]

[5] [6]

[7] [8] [9]

[10]

[11]

[12] [13]

[14]

[15]

[16]

Díaz-Galiano, M.C., M.T. Martín-Valdivia, and L.A. Ureña-López, Query expansion with a medical ontology to improve a multimodal information retrieval system. Computers in Biology and Medicine, 2009. 39(4): p. 396-403. Gschwandtner, T., et al., Easing semantically enriched information retrieval--An interactive semi-automatic annotation system for medical documents. International Journal of Human-Computer Studies, 2010. 68(6): p. 370-385. Moskovitch, R. and Y. Shahar, Vaidurya: A multiple-ontology, conceptbased, context-sensitive clinical-guideline search engine. Journal of Biomedical Informatics, 2009. 42(1): p. 11-21. Sager, N., C. Friedman, and M.S. Lyman, Medical Language Processing: Computer Management of Narrative Data. 1987: AddisonWesley Longman Publishing Co., Inc. 320. BIREME. DeCS/VMX. 2010 15 fev 2009]; Available from: http://decs.bvs.br/vmx.htm. Gindl, S., K. Kaiser, and S. Miksch, Syntactical negation detection in clinical practice guidelines. Studies in health technology and informatics, 2008. 136: p. 187. Bhogal, J., A. Macfarlane, and P. Smith, A review of ontology based query expansion. Inf. Process. Manage., 2007. 43(4): p. 866-886. BERNERS-LEE, T., J. Hendler, and O. Lassila, The semantic web. Scientific American, 2001. 284(5): p. 34-43. Studer, R., V.R. Benjamins, and D. Fensel (1998) Knowledge engineering: Principles and methods. Data & Knowledge Engineering 25, 161-197 DOI: 10.1016/S0169-023X(97)00056-6 Munir, K., et al., Semantic Information Retrieval from Distributed Heterogeneous Data Sources. FIT Islamabad, special track on bioinformatics for academia and industry, 2006. Abdou, S. and J. Savoy, Searching in Medline: Query expansion and manual indexing evaluation. Inf. Process. Manage., 2008. 44(2): p. 781789. Kiryakov, A., et al., Semantic Annotation, Indexing, and Retrieval., in The SemanticWeb - ISWC 2003. 2003. p. 484-499. Lourenço, A., et al., BioDR: Semantic indexing networks for biomedical document retrieval. Expert Systems with Applications, 2010. 37(4): p. 3444-3453. Agosti, M., G. Bonfiglio-Dosio, and N. Ferro, A historical and contemporary study on annotations to derive key features for systems design. International Journal on Digital Libraries, 2007. 8(1): p. 1-19. Mutalik, P.G., A. Deshpande, and P.M. Nadkarni, Use of generalpurpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. Journal of the American Medical Informatics Association : JAMIA, 2001. 8(6): p. 598609. Chapman, W., A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries. Journal of Biomedical Informatics, 2001. 34(5): p. 301-310.