Processing Textual Information from Industrial ... - Semantic Scholar

Processing Textual Information from Industrial Systems Using Semantic Networks Abhinav Saxena, George Vachtsevanos Georgia Institute of Technology, Atlanta GA 30332 {asaxena,gjv}@ece.gatech.edu

Abstract. A paradigm shift is emerging in system reliability and maintainability. The military and industrial sectors are moving away from the traditional breakdown and scheduled maintenance and adopting concepts referred to as Condition Based Maintenance. In addition to signal processing and subsequent diagnostic and prognostic algorithms these new technologies require storage of large volumes of both quantitative and qualitative information and means to retrieve old cases from these case libraries and match them with a current problem. A semantic network based approach is being presented for natural language processing of qualitative information available from industrial systems in the form of textual descriptions. Syntactic rules are used to extract relationships between the words and the spatial arrangement is preserved using semantic networks. Compared to other current automated methods to manipulate text messages which are computationally expensive, this technique takes advantage of the semi structured nature of the text and domain limited vocabulary in industrial environments in order to create an architecture that processes textual information efficiently and effectively. Domain knowledge is taken into consideration while interpreting the text and creating the semantic networks. These semantic networks form a part of cases in a dynamic case based reasoning system, which constitutes an integral module of integrated diagnosis-prognosis architecture. This approach assists in retrieving short text based cases taking into account the semantic meaning of the sentence and not just conventional frequency based information.

1 Introduction In the past decade a lot of effort has been put in condition based maintenance (CBM) and prognostic health management (PHM) of industrial systems. Most health maintenance systems rely solely on sensor measurements to diagnose a fault condition and almost no use of descriptive textual knowledge is made. A lot of useful information is often available in the form of textual descriptions through maintenance logs and operator observations. Non-availability of adequate and computationally less expensive natural language processing (NLP) methods has held back the CBM/PHM technology from using such information. In practice an operator often does not even need to explicitly look at the sensor data to diagnose the problem and only uses the observations and his past experience instead. For complex systems, where classical physics based modeling and control techniques are very difficult,

knowledge based systems offer simpler and more informed solutions. A hybrid reasoning architecture for integrated fault diagnosis and health maintenance of fleet vehicles is being developed at Georgia Tech [1]. Sensors

Feature Extraction

Diagnosis Module

Final Diagnosis

System

Repair Action + Explanation

Observations (Textual Data)

Information Extraction

Select Sensors

Select Features

Select Diagnostic Modules

Solution Revision

Initial Diagnosis

Knowledge Base

CBR Engine

Feedback Update Statistic

Solution Evaluation

Fig. 1. Integrated reasoning architecture for fault diagnosis and prognosis in industrial environments. Diagnosis is carried out at two levels, first using the textual observations (initial diagnosis for fault localization) and second using relevant sensor data and analytical techniques (final diagnosis for fault detection and isolation).

The aim of this architecture is to research, develop and test advanced diagnostic and decision support tools for maintenance of complex machinery (fig. 1). Knowledge Based AI approach for diagnostics has been proposed with particular reference to Dynamic Case-Based Reasoning (DCBR). This system not only utilizes the sensor data for numerical analysis but also uses textual information from the system, which is usually available as fault symptoms in the form of activity logs or operator observations. These textual descriptions are short, semi-structured and often use a specialized domain vocabulary. Developing techniques for processing textual information is important because with the need of improved efficiency, accuracy and data interoperability use of standardized language in industrial environments is being promoted [2]. Due to a structured syntax and fixed domain vocabulary the task of NLP on such data can be greatly reduced. Compared to the conventional NLP, which mostly, only uses the syntactic relationships between the words, semantics based analysis can be performed for better performance and reasoning in unknown situations. Such semantic information can be preserved in the form of semantic

networks, which can be used while querying the database for improved retrieval performance. Next section highlights the efforts and importance of using standardized natural language in industrial environments. A brief account of current NLP techniques is presented in the subsequent section followed by the approach taken in this paper. Knowledge representation technique and analytical methods are further discussed in later sections.

2 Natural Language Processing (NLP) Natural language processing techniques enable more meaningful representation of texts. Conventional NLP techniques are computationally expensive and work only with small dictionaries and hence are not preferred. Further they are not robust against poor grammar or missing and incorrect words. With the use of standardized languages and domain limited vocabularies not only the size of texts can be reduced but also a formal reduced grammar can be developed which is easy to use and translate across different domains and languages. 2.1 Concept of Standardized Language The concept of standardized formal language offers several advantages over using non-standardized language in industrial environments. It not only helps reducing communication errors by avoiding ambiguities but also simplifies electronic textual data management and technology transfer between manufacturers, users and maintainers. Using a well defined documentation format and domain vocabulary makes the language globally interpretable and reduces multicultural language barriers. For instance formal communication within the aviation maintenance domain is defined and regulated [3]. A hierarchy of written correspondence is defined in the Federal Aviation Regulations (FARs), which includes airworthiness directives (ADs), notices to airmen (NOTAMs), maintenance manuals, work cards, and other types of information, that are routinely passed among manufacturers, regulators, and maintenance organizations. The international aviation maintenance community has adopted a restricted and highly structured subset of the English language to improve written communication, such as ATA-100 and AECMA Simplified English [2]. Similar to aviation industry the importance of standardized technical documentation is gaining importance in manufacturing industry and the efforts are being made to enhance the ability to update support documents during the life cycle of a machine as it is maintained, modified or resold to form a valuable archive of knowledge concerning safe and reliable operation of the machine. A lot of machine condition information is obtained in terms of operator observations expressed as textual descriptions that are rarely used in automation of the health maintenance

process. But these symptoms carry important information about the system that may not always be evident from sensory measurements. In order to eliminate or at least minimize potential ambiguities and other variances, protocols are established which consist of rules regarding which words, phrases, or other elements will be used for communication, their meaning, and the way they are connected with one another. Use of short and semi-structured sentences is promoted with the name of minimized language where some more rules are imposed on the use of language. The next task here on is to develop methods that can process qualitative information effectively by taking advantage of this standardization so that textual information can be used at par with the sensory information.

2.2 NLP in Practice Several Information Retrieval (IR) methods for textual data have been used which primarily depend on the statistical account of occurrence of words and do not consider the semantic relationships between them. This results in several problems like meaning ambiguity and paraphrase expressions (different expressions for same meaning). An alternative approach of n-gram matching has been used to retrieve relevant documents but this approach also does not permit integration of additional knowledge like domain specific thesauri or glossaries. A use of Textual CBR (TCBR) has been proposed in [4] to explicitly allow the integration of semantic knowledge using some Natural Language Processing (NLP) and establish an indexing vocabulary. Careful analysis of the domain is carried out to device similarity measures that extend beyond statistical term weighting. NLP Techniques like partsof-speech tagging are used to tag the words in the texts and extract the basic linguistic structures. TCBR is typically built for specific domains to address the ambiguity problem. In another approach a feature vector is used to index text documents and two approaches have been proposed to reduce its size: feature selection with boosting and feature generalization with association rules [5]. Feature selection helps with identifying discriminatory features while feature generalization captures semantic relationships. But this method still does not express semantic relationship explicitly. Use of graphs based methods for TCBR has been described in [6]. Graphs offer several advantages over conventional feature vector based methods. They can create rich representations of cases. The structure and word order can be retained to capture relationships between two elements and any number of elements can be added or deleted at will. So far TCBR has been mainly considered as tool for an independent domain of books, web documents, reports, documentations and manuals etc. Elaborate methods have been developed which try to accommodate the complex structure and enormity of the language vocabulary and grammar [7]. The semi-structured nature of the textual description of symptoms in industrial environments offers an advantage in that an explicit language grammar is not required to establish their meaning. E.g. "knocking sound from left panel", suffices in conveying its meaning as an observation. The vocabulary is rather limited and by restricting within a domain resolves the ambiguity problem to a large extent. The text

is usually not very long and hence easy to organize in small and efficient data structures. Mostly such descriptions occur as experts' experience and are often concisely documented for future references only and not explicit usage for problem solving. However there is a fair amount of information embedded in such descriptions, which not only supports the diagnosis from sensor data but also suggests other possibilities that were not perhaps documented with the related incidents. They can contain the explanations and solutions to the problems in descriptive form as well. The hypothesis proposed here maintains that the simpler structure of the text can compensate for the computational complexity, usually involved with NLP and search methods, and a dense and concise nature of such information can improve the degree of belief established by data driven diagnostics alone. This will also provide a mechanism to organize the human experience accumulated over time and use it effectively besides just storing the sensory information. 2.3 NLP and Knowledge Representation for Industrial Systems In order to process a sentence of a language, the tokens of the language must be first isolated and identified. For NLP, lexical processing operates at the single word level and involves identifying words and determining their grammatical classes or parts-ofspeech before a higher level of language analysis can take place [8]. A shallow-NLP technique, which tags each word with its probable class such as a noun, verb etc and identifies corresponding word stem, has been suggested in [4]. This method is both efficient and robust as compared to other complex NLP techniques. A PC based demo version of TreeTagger tool [9] developed at the Institute of Natural Language Processing (IMS) at Stuttgart University was used within the limits of its capabilities. This version did not allow modifications in the associated dictionary that could have been useful in reshaping this dictionary suitable to industrial domain and reduce ambiguity. For instance the words like gear and bearing can be annotated with ‘noun’ referring to mechanical components only thereby removing the annotation ‘verb’ altogether. Further new words could not be added to include domain specific technical vocabulary. The complete version also provides probabilities associated with a word if there are more than one possibilities for its tag. This capability was also not available in the demo version, which could have been helpful in ambiguity reduction. The output of the program is three columns, first containing the original word as it appears in the sentence, second contains the tag abbreviation (e.g. NNP for proper noun, VB for verb etc) and the last column contains the stem of the word (e.g. 'run' for 'running' and 'be' for 'have'). Figure 2 shows a snapshot of the output file from TreeTagger.

Fig. 2. TreeTagger output

After all the words have been tagged a set of syntactic rules are employed to extract relationships between different words in the sentence. In all three types of rules were employed. The first set of rules link different words based on their POS category and word order. These linkages are transformed into small units called triads that form the basic units of semantic network. triads have been described in details in the following discussion. E.g. Transmission will not downshift with accelerator fully depressed. In this sentence downshift should be associated with transmission and depressed with accelerator at the first level. And these two pairs should be further linked with each other at the next level. (transmission will not downshift) – when – (accelerator is fully depressed). The second set of rules combines multiple words if they together define an object or a situation. E.g. Engine does not start Æ Engine does not_start This takes care of negation, as the word start would not be matched to not_start appearing in a different situation. Similarly two nouns are combined if they appear together as they most likely describe a single object. E.g. Transmission fluid has burnt smell Æ Transmission_fluid has burnt smell

This distinguishes between words like transmission, transmission fluid and break fluid which are different contexts involving the same words. However this does not entirely disregard the similarity between such instances as they do provide some information about the location of the fault and may be useful in providing an alternative hypothesis if the better matching hypothesis has been confirmed false. The third set of rules involves sentences containing conjunctions (and, or, /). It creates multiple associations to an object to accommodate all descriptions connected via these conjunctions. E.g. Horn inoperative or unsatisfactory in operation Æ (Horn is inoperative), (Horn is unsatisfactory in operation)

Here the same object horn is associated to two different conditions and a match to any of these involves similar diagnosis. A rule base for a variety of symptoms has been created and applied to a test data set. The size of this rule base is limited by virtue of standardized language.

Performance of this rule base has been described using various examples in latter sections. It was found that most sentences could be broken into smaller segments, which independently define a relevant concept and can be represented as triads. These triads can be combined to form higher-level triads in order to create corresponding semantic networks (fig. 3). Three basic relations (link types) were defined that explained most of the relationships between words and triads. These relations are shown in table 1. R1

P1

P2

Fig. 3. A triad consists of two phrases (P1 and P2) and a relation (R1). These phrases can be words or other triads. Table 1. Three relations capture most of the scenarios

A--IN--B → A in_condition B or A when B Here A is typically a noun or a triad and B is mostly a triad e.g. Transmission fluid has burnt smell translates into (Transmission_fluid) --IN-- (smell --IS-- burn) A--IS--B → A has_property B or A is_type B Here A is typically a noun or a triad and B is an adjective e.g. Transmission fluid has burnt smell translates into (Transmission_fluid) --IN-- (smell --IS-- burn) A--AT--B → A in_state B or A exhibits_state B Here A is typically a noun or a triad and B is a verb e.g. Transmission slips translates into (Transmission) --AT-- (slip) A triad τ is a three-tuple consisting of two phrases p1, p2 and a relationship r between them. A phrase can be a noun, adjective, verb or a triad itself. The set of relations, R, is a finite set of three relations as described above. τ =< p 1 , p 2 , r > where p 1 , p 2 ∈ {W ∪ Τ} = { set of all words in the sentence } ∪ { set of all triads in the sentence } r ∈ R = { IS , IN , AT }

Both phrases in a Type-I triad are single words and do not involve triads. Type-II triads on the other hand consist of one or more triads. This makes the semantic networks a binary tree with words occurring only at the leaves and rest of the nodes as relations. In order to improve uniformity across different usage of same words only the stems of the words are included in the triads. Following examples show various possibilities. Transmission has no drive in reverse gear

Type-II triad

IN

AT

TRANSMISSION

IS NO DRIVE

GEAR

REVERSE

Type-I triad

Transmission slips

AT

TRANSMISSION

SLIP

Fig. 4. A semantic network consists of triads. A Type-II triad consists of one or more triads. Only the stems of the words are included in the semantic networks

Examples: For the purpose of evaluation of this technique a simple data set was acquired from an automotive troubleshooting website which listed several symptoms and their possible diagnosis and repair [10]. Very short descriptions have been listed using common vehicle terminology as a car mechanic would use. With slight efforts these descriptions were engineered to fit the description of a standardized language. For example all sentences were converted to active voice. Use of conjunctions and determiners was maximally reduced. Very long and complex sentences were broken into smaller ones. Since a domain-limited customized vocabulary could not be incorporated in TreeTagger use of ambiguous words was reduced. Since the data size was fairly small these modifications were carried out manually. But it is expected that

with the promotion of using standardized language this step may not necessarily be required. Figure 5 shows some typical descriptions and corresponding semantic networks as created by our programs. Engine will not start in any gear SNet(1) = ((engine -- not_start) -- any_gear) Engine started in gear other than Park or Neutral sNet(2) = (((engine -- start) -- (gear -- other)) -park,neutral) transmission shifted roughly sNet(3) = (transmission -- shift_roughly) Problems in gear selection sNet(4) = (problem -- gear_selection) Transmission will not downshift (kickdown) with accelerator fully depressed sNet(5) = ((transmission -- not_downshift) -- (accelerator - fully_depress)) transmission Noisy in neutral with engine running sNet(6) = (((transmission -- noisy) (engine -- run))

--

neutral)

--

Fig. 5. Semantic networks for six different descriptions using the three relations described above. For clarity each triad is enclosed between a pair of parentheses.

Similarity Calculation: for the purpose of retrieval of matching cases a similarity metric needs to be established between the target case and the cases in the case base. For these semantic networks similarity assessment is done based on matching the triads. First the smallest level of triads is matched. e.g. the input semantic net:(Transmission—IS—Noisy)—IN—(Gear—IS—Neutral) Consists of three triads, τ 1 :(Transmission—IS—Noisy)

τ 2 :(Gear—IS—Neutral) τ 3 :( τ 1 —IN— τ 2 )

All semantic nets that contain similar type-I triads (e.g. τ 1 or τ 2 in this case) will be retrieved. Similarity of triads is computed based on how closely the three constituents match. With the semantic net (Transmission—AT—No_Drive)—IN—(Gear—IS—Neutral) the above query semantic net will be matched in the following manner. For triads with links AT or IS the weights associated with p1, p2 and r are 0.5, 0.3, 0.2 respectively acknowledging the fact that component (p1) matching is more important than its exact condition as far as localizing the fault is concerned. Fault identification may require more accurate condition matching but here the emphasis is on fault localization and identification is left to further diagnosis using dedicated

diagnostic algorithms. For triads containing links IN the weights associated with p1, p2 and r are 0.4, 0.4, and 0.2 respectively as both p1 and p2 convey equally important information here. These weights can be assigned through expert experience or learned from the data. Table 2 shows a step-by-step procedure for an example. Table 2. Similarity calculation for triads based semantic networks τ

INPUT

OUTPUT

SimVal

τ 1 (Transmission—IS—Noisy) (Transmission—AT—No_Drive) (1*0.5+0*0.2+0*0.3)=0.5 τ 2 (Gear—IS—Neutral)

(Gear—IS—Neutral)

(1*0.5+1*0.2+1*0.3) = 1

τ 3 (Transmission—IS—Noisy)— (Transmission—AT—No_Drive)— (0.5*0.4+1*0.2+1*0.4)=.8 IN—(Gear—IS—Neutral) IN—(Gear—IS—Neutral)

The next section describes briefly the application domain keeping the relevance and importance of the NLP approach described above in perspective.

4 Application As mentioned earlier in the introduction a hybrid reasoning architecture for integrated fault diagnosis and health maintenance of fleet vehicles is being developed at Georgia Tech [1]. As shown in the schematic (fig. 1) qualitative information is used as the initial query. Textual descriptions are converted into semantic networks in the manner described above. The case base is searched based on these semantic networks and the relevant hypotheses explaining the symptoms are generated. Based on the past experience these hypotheses are ranked and the most probable hypothesis is tested first by automatically activating the relevant data acquisition and corresponding diagnostic techniques. These techniques reside in a knowledge base very closely coupled to the case base of the CBR engine. If the hypothesis is confirmed by relevant sensor data investigation its solution is suggested for the current situation and its success rate is positively updated. Otherwise the next probable hypothesis is tested and corresponding success rates are updated. The procedure is repeated until a useful solution is obtained or a new case is generated and stored in the case base.

5

Discussion

Since the semantic networks are created using the stems of the corresponding words the retrieval step is expected to consider all cases that use the same word in any of its grammatical form. Further to resolve ambiguity issue a reduced domain vocabulary can be established to fix the meaning of ambiguous words. E.g. in mechanical domain

bearing is a component i.e. a noun and not the present participle of the verb bear. This approach also takes negation into consideration unlike most NLP techniques, which is very important for a CBR system [11]. Even though similarity calculation is a subgraph isomorphism problem in graph theory that is NP complete [12], a smaller size of semantic networks from brief technical descriptions reduces this problem to a large extent.

6 Conclusions It has been shown that short and semi-structured technical textual descriptions can be abstracted using three simple relations and represented as semantic networks. These structures not only provide a means of representing textual descriptions in a structured manner but also preserve the semantic meaning of the sentence. They form a part of cases in a dynamic case based reasoning system, which constitutes an integrated diagnosis and prognosis architecture for industrial systems. This helps in retrieving short text based cases to generate an initial hypothesis thereby reducing the search space considerably for further diagnosis.

References 1. Saxena, A., Wu, B., Vachtsevanos, G.: Integrated Diagnosis and Prognosis Architecture for Fleet Vehicles Using Dynamic Case Based Reasoning, appearing in IEEE Autotestcon 2005 2. Greenough, R.: e-Smart – Electronic Support of Manufacturing Technology Work Package Report, Cranfield University, UK 3. Drury, C.G., Ma, J.: Language Errors in Aviation Maintenance, Year 1 Interim Report for Federal Aviation Administration, State University of New York Buffalo, 2003 4. Lenz, M., Hubner, A., Kunze, M.: Textual CBR: In Lenz, M., Bartsch-Spörl, B., Burkhard, H., Wess, S. (Eds.): Case based Reasoning Technology: From Foundations to Applications, Lecture Notes in Artificial Intelligence 1400, Springer Verlag, 1998 5. Nirmalie, W., Ivan, K.: “Feature Selction and Generalization for Retrieval of Textual Cases”, Berlin Heidelberg, Springer-Varlag, 2004 6. Cunningham, C.M., Weber, R., Proctor, J.M., Fowler, C., Murphy, M.: Investigating Graphs in Textual Case-Based Reasoning, In: Funk, P., Gonzalez, C, Pedro, A.: (eds.), Advances in Case-Based Reasoning (Lecture Notes in Artificial Intelligence), Vol. 3155, Springer-Verlag 7. Schenkar, A., Last, M., Bunke, H., Kandel, A.: Clustering of Web Documents using a Graph Model. In: Antonacopoulos, A., Hu, J. (eds.): Web Document Analysis: Challenges and opportunities, pp 1-16, 2003 8. Smeaton, A. F.: Progress in the Application of Natural Language Processing to Information Retrieval Tasks, The Computer Journal, issue 3, 1992 9. Schmidt, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees, in International Conference on New Methods in Language Processing, Manchester, UK, 1994 10. Vehicle Fault Finding: http://peugeot.mainspot.net/fault_find/index.shtml 11. Ashley, K.: Progress in Text-Based Case-Based Reasoning. Invited talk at the Third International Conference on Case-Based-Reasoning, Seeon, Germany, 1999 12. Johnson, D. S.: Computers and Intractability, W.H. Freeman and Company, New York, 1979