An ontological website models-supported search agent for ... - CiteSeerX

21 downloads 87922 Views 5MB Size Report
information section stores statistics about HTML tag properties, e.g., #Frame for ... texts associated with. Titles, Anchors, and Headings for webpage analysis; we.
ARTICLE IN PRESS Available online at www.sciencedirect.com

Expert Systems with Applications Expert Systems with Applications xxx (2007) xxx–xxx www.elsevier.com/locate/eswa

An ontological website models-supported search agent for web services Sheng-Yuan Yang

*

Department of Computer and Communication Engineering, St. John’s University, Taipei, 499, Sec. 4, TamKing Road, Tamsui, Taipei County 25135, Taiwan, ROC

Abstract In this paper, we advocate the use of ontology-supported website models to provide a semantic level solution for a search engine so that it can provide fast, precise and stable search results with a high degree of user satisfaction. A website model contains a website profile along with a set of webpage profiles. The former remembers the basic information of a website, while the latter contains the basic information, statistics information, and ontology information about each webpage stored in the website. Based on the concept, we have developed a Search Agent which manifests the following interesting features: (1) Ontology-supported construction of website models, by which we can attribute correct domain semantics into the Web resources collected in the website models. One important technique used here is ontology-supported classification (OntoClassifier). Our experiments show that the OntoClassifier performs very well in obtaining accurate and stable webpages classification to support correct annotation of domain semantics. (2) Website models-supported Website model expansion, by which we can collect Web resources based on both user interests and domain specificity. The core technique here is a Focused Crawler which employs progressive strategies to do user query-driven webpage expansion, autonomous website expansion, and query results exploitation to effectively expand the website models. (3) Website models-supported Webpage Retrieval, by which we can leverage the power of ontology features as a fast index structure to locate most-needed webpages for the user. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Ontology; Website models; Search agents; Web services

1. Introduction In this information-exploding era, the user expects to spend short time retrieving really useful information rather than spending plenty of time and ending up with lots of garbage information. Current general search engines, however, produce so many entries of data that often overwhelms the user. For example, Fig. 1 shows six of over 4000 entries returned from Google, a well-known robotbased search engine, for a query that asks for assembling a cheaper and practical computer. The user usually gets frustrated after a series of visits on these entries, however, when he discovers that dead entries are everywhere, irrelevant entries are equally abundant, what he gets is not exactly what he wants, etc. It is increasingly clear that

*

Tel.: +886 2 28013131x6394; fax: +886 28013131x6391. E-mail address: [email protected]

one has to properly narrow down the scope of search and to cleverly sort the sites or combine them in order to benefit much from the Internet. This process is hard to general users, however; so is it to the experts. Two major factors are behind this difficulty. First, the Web is huge; it’s reported that six major public search engines (AltaVista, Excite, HotBot, Infoseek, Lycos, and Northern Light) collectively only covered about 60% of the Web and the largest coverage of a single engine was about one-third of the estimated total size of the Web (Lawewnce & Giles, 2000). Empirical study also indicates that no single search engine could return more than 45% of the relevant results (Selberg & Etzioni, 1995). Second, most pre-defined index structures used by the search engines are complex and inflexible. Lawewnce and Giles (1999) reports that indexing of new or modified pages in one of the major search engines could take months. Even if we can follow the well-organized Web directory structure of one search engine, it is still very easy that we get lost in the complex category

0957-4174/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.09.024

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS 2

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

Fig. 1. A query example.

indexing jungle (see Fig. 2 for an example). Current domain-specific search engines do help users to narrow down the search scope by the techniques of query expansion, automatic classification and focused crawling; their weakness, however, is almost completely ignoring the user interests (Wang, 2003). In general, current search engines face two fundamental problems. First, the index structures are usually very different from what the user conjectures about his problems. Second, the classification/clustering mechanisms for data hardly reflect the physical meanings of the domain concepts. These problems stem from a more fundamental

problem: lack of semantic understanding of Web documents. New standards for representing website documents, including XML (Henry, David, Murray, & Noah, 2001), RDF (Brickley & Guha, 2004), DOM (Arnaud et al., 2004), Dublin metatag (Weibel, 1999), and WOM (Manola, 1998), can help cross-reference of Web documents; they alone, however, cannot help the user in any semantic level during the searching of website information. OIL (2000), DAML (2003), DAML+OIL (2001), and the concept of ontology stand for a possible rescue to the attribution of information semantics. In this paper, we advocate the use of ontology-supported website models (Yang,

Fig. 2. A Web directory example.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

3

must include descriptions of explicit concepts and their relationships of a specific domain (Ashish & Knoblock, 1997). We have outlined a principle construction procedure in Yang and Ho (1999); following the procedure we have developed an ontology for the PC domain. Fig. 3 shows part of the PC ontology taxonomy. The taxonomy represents relevant PC concepts as classes and their parent–child relationships as isa links, which allow inheritance of features from parent classes to child classes. We then carefully selected those properties of each concept that are most related to our application and used them as attributes to define the corresponding class. Fig. 4 exemplifies the definition of ontology class ‘‘CPU’’. In the figure, the uppermost node uses various fields to define the semantics of the CPU class, each field representing an attribute of ‘‘CPU’’, e.g., interface, provider, synonym, etc. The nodes at the bottom level represent various CPU instances that capture real world data. The arrow line with term ‘‘io’’ means the instance of relationship. Our ontology construction tool is Prote´ge´ 2000 (Noy & McGuinness, 2001) and the complete PC ontology can be referenced from the Prote´ge´ Ontology Library at Stanford Website (http://protege. stanford.edu/ontologies.html). Fig. 5 demonstrates how the ontology looks like on Prote´ge´ 2000, where the left column represents the taxonomy hierarchically and the right column contains respective attributes for a specific class node selected. The example shows that the CPU ontology contains synonyms, along with a bunch of attributes and constraints on their values. Although the domain ontology was developed in Chinese (but was changed to English here for easy explanation) corresponding English names are treated as Synonyms and can be processed by our system too. In order to facilitate Web search, the domain ontology was carefully pre-analyzed with respect to how attributes are shared among different classes and then re-organized into Fig. 6. Each square node in the figure contains a set of representative ontology features for a specific class, while each oval node contains related ontology features between two classes. The latter represents a new node type called ‘‘related concept’’. We select representative ontology features for a specific class by first deriving a set of

2006) to provide a semantic level solution for a search engine so that it can provide fast, precise and stable search results with a high degree of user satisfaction. Basically, a website model consists of a website profile for a website and a set of webpage profiles for the webpages contained in the website. Each webpage profile reflecting a webpage describes how the webpage is interpreted by the domain ontology, while a website profile describes how a website is interpreted by the semantics of the contained webpages. The website models are closely connected to the domain ontology, which supports the following functions used in website model construction and application: query expansion, webpage annotation, webpage/website classification (Yang, 2006), and focused collection of domain-related and user-interested Web resources (Yang, 2006). We have developed a Search Agent using website models as the core technique, which helps the agent successfully tackle the problems of search scope and user interests. Our experiments show that the Search Agent can locate, integrate and update both domain-related and user-interested Web resources in the website models for ready retrieval. The personal computer (PC) domain is chosen as the target application of our Search Agent and will be used for explanation in the remaining sections. The rest of the paper is organized as follows. Section 2 develops the domain ontology. Section 3 describes Website models and how they are constructed. Section 4 illustrates how Website models can be used to do better Web search. Section 5 describes the design of our search agent and reports how it performs. Section 6 discusses related works, while Section 7 concludes the work. 2. Domain ontology as the first principles Ontology is a method of conceptualization on a specific domain (Noy & Hafner, 1997). It plays diverse roles in developing intelligent systems, for example, knowledge sharing and reusing (Decker et al., 2000, 1998), semantic analysis of languages (Moldovan & Mihalcea, 2000), etc. Development of an ontology for a specific domain is not yet an engineering process, but it is clear that an ontology

Hardware isa isa isa

Interface Card

isa

Power Equipment

isa

isa

Memory

Storage Media

Case

isa

isa

isa

Network Chip

isa

isa

isa

Sound Card

Display Card

isa

isa

SCSI Card

Network Card

isa

Power Supply

isa

isa

UPS

isa

isa

Main Memory

ROM

isa

CD

isa

DVD

Optical

isa

isa

isa

ZIP

isa

CDR/W

CDR

Fig. 3. Part of PC ontology taxonomy.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS 4

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx CPU Synonym=

Central Processing Unit

D-Frequency Interface

String

Instance*

L1 Cache

Instance

Abbr.

CPU Slot Volume Spec.

Instance

CPU Spec.

...

io

io

io

io

io

io

XEON

THUNDERBIRD 1.33G

DURON 1.2G

Factory= Intel

Synonym= Athlon 1.33G

Interface= Socket A

D-Frequency=

Interface=

Socket A

L1 Cache=

64KB

Synonym=

P4 2.0GHZ

Synonym=

P4 1.8GHZ

L1 Cache=

128KB

Abbr.=

Duron

Interface=

Socket 478

Interface=

Abbr.=

Athlon

Factory=

AMD

L1 Cache=

8KB

L1 Cache=

Factory=

AMD

1.2GHZ

Abbr.=

P4

Abbr.=

Clock=

...

PENTIUM 4 2.0AGHZ

io

...

20

PENTIUM 4 1.8AGHZ D-Frequency=

...

18

CELERON 1.0G

PENTIUM 4 2.53AGHZ

Interface= Socket 370

Synonym=

L1 Cache=

32KB

Interface= Socket 478

Socket 478

Abbr.=

Celeron

L1 Cache=

8KB

Factory=

Intel

Abbr.=

P4

P4

Clock=

1GHZ

Factory=

Intel

...

...

P4 8KB

...

Fig. 4. Ontology class CPU.

class. The representative features are then removed from the set of candidate terms and the rest of candidate terms are compared with the attributes of other ontology classes. For any other ontology class that contains some of these candidate terms, we add a related concept node to relate it to the class. Fig. 7 takes CPU and motherboard as two example classes and show how their related concept node looks like. The figure shows a related concept node between two classes; in fact, we may have related concept nodes among three or more classes too. For instance in Fig. 6, we have a related concept node that relates CPU, motherboard and SCSI Card together. Table 1 illustrates related concept nodes of different levels, where level n means the related concept node relates n classes together. Under this definition, level 1 reduces to the representative features of a specific class. Thus, term ‘‘graphi’’ in level 3 means it appears in three classes: graphic card, monitor, and motherboard. This design clearly structures semantics between ontology classes and their relationships and can facilitate the building of semantics-directed website models to support Web search.

Fig. 5. PC ontology shown on Prote´ge´2000.

candidate terms from pre-selected training webpages belonging to the class. We then compare them with the attributes of the corresponding ontology class; those candidate terms that also appear in the ontology are singled out and dubbed as the representative ontology features of the

PC Hardware Graphic Card

Optical Drive

Related Concept

Related Concept

Sound Card

Hard Drive Related Concept

Related Concept

Network Card

Monitor

CPU

Related Concept

Modem Related Concept

Related Concept

SCSI Card

Related Concept

Related Concept

Related Concept

Related Concept

Motherboard

Reference class Superclass

Fig. 6. Part of re-organized PC ontology.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx CPU

Motherboard

3dnow Sse Sse2 Mmx2 L1 Ondie Pipline Superscalar Fcpga 0.13 64k

CPU Intel AMD Cyrix Socket Slot L2 Secc Secc2 Petium Celeron Athlon Duron Morgan Northwood Tualatin Fsb K7 478pin 423pin

(representative)

Motherboard Onboard Atx Amr Cnr Bank Com1 Raid Bios Northbridge Southbridge (representative)

(related concept node)

Fig. 7. Re-organized PC ontology.

Table 1 Example of related concept nodes of different levels (after stemming) Level 3

Level 4

Level 9

Level 10

ddr dvi graphi inch Kbp khz raid

Bandwidth Microphone Network scsi

Channel Connector Extern mhz Plug usb

Intern Memoir Output Pin

3. Website model and construction A website model contains a website profile and a set of webpage profiles. A website profile contains statistics about a website. A webpage profile contains basic information, statistics information, and ontology information about a webpage. To produce the above information for website modeling, we use DocExtractor to extract primitive webpage information as well as to perform statistics. It also transforms the original webpage into a tag-free document and turns it to OntoAnnotator to annotate ontology information. This section gives detailed description of what a website model looks like and how it is constructed. 3.1. Website model Fig. 8(a) illustrates the format of a website model. The webpage profile contains three sections, namely, basic information, statistics information, and ontology information. The first two sections profile a webpage and the last annotates domain semantics to the webpage. Each field in the basic information section is explained below. DocNo is automatically generated by the system for identifying a webpage in the structure index. Location remembers the path of the stored version of the Web page in the website model; we can use it to answer user queries. URL is the path of the webpage on the Internet, the same as the returned URL index in the user query result; it helps

5

hyperlinks analysis. WebType identifies one of the following six Web types: com (1), net (2), edu (3), gov (4), org (5), and other (0), each encoded as an integer in the parentheses. WebNo identifies the website that contains this webpage. It is set to zero if we cannot decide what website the webpage comes from. Update_Time/Date remembers when the webpage was modified last time. The statistics information section stores statistics about HTML tag properties, e.g., #Frame for the number of frames, #Tag for the number of different tags, and various texts enclosed in tags. Specifically, we remember the texts associated with Titles, Anchors, and Headings for webpage analysis; we also record Outbound_URLs for user-oriented webpage expansion. Finally, the ontology information section remembers how the webpage is interpreted by the domain ontology. It shows that a webpage can be classified into several classes with different scores of belief according to the ontology. It also remembers the ontology features of each class that appear in the webpage along with their term frequencies (i.e., number of appearance in the webpage). Domain_Mark is used to remember whether the webpage belongs to a specific domain; it is set to ‘‘true’’ if the webpage belongs to the domain, and ‘‘false’’ otherwise. This section annotates how a webpage is related to the domain and can serve as its semantics, which helps a lot in correct retrieval of webpages. Let us turn to the website profile. WebNo identifies a website, the same as that used in the webpage profile. Through this number, we can access those webpage profiles describing the webpages that belong to this website. Website_Title remembers the text between tags hTITLEi of the homepage of the website. Start_URL stores the starting address of the website. It may be a domain name or a directory URL under the domain address. WebType identifies one of the six Web types as those used in the webpage profile. Tree_Level_Limit remembers how the website is structured, which can keep the search agent from exploring too deeply, e.g., 5 means it explores at most 5 levels of the website structure. Update_Time/Date remembers when the website was modified last time. Fig. 8(b) illustrates an example website model. This model structure helps interpret the semantics of a website through the gathered information; it also helps fast retrieval of webpage information and autonomous Web resources search. The last point will become clearer later. Fig. 8(c) illustrates how website profiles and webpage profiles are structured. 3.2. Website modeling Website modeling involves three modules. We use DocExtractor to extract basic webpage information and perform statistics. We then use OntoAnnotator to annotate ontology information. Since the ontology information contains webpage classes, OntoAnnotator needs to call OntoClassifier to perform webpage classification. Fig. 9 illustrates the architecture of DocExtractor. DocExtractor receives a webpage from DocPool and produces informa-

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS 6

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

Website Profile: WebNo::Integer Website_Title::String Start_URL::String WebType::Integer Tree_Level_Limit::Integer Update_Time/Date::Date/Time ..... Webpage Profile: Basic Information: DocNo::Integer Location::String URL::String WebType::Integer WebNo::Integer Update_Time/Date::Date/Time ..... Statistics Information: #Tag #Frame ..... Title Text Anchor Text Heading Text Outbound_URLs ..... Ontology Information: Domain_Mark::Boolean class1: belief1; term11(frequency); ... class2: belief2; term21(frequency); ... .....

Website Profile: WebNo::920916001 Website_Title::Advanced Micro Devices, AMD - Homepage Start_URL::http://www.amd.com/us-en/ WebType::1 Tree_Level_Limit::5 Update_Time/Date::04:37:59/AUG-26-2003 ..... Webpage Profile: Basic Information: DocNo::9209160011 Location::H:\DocPool\920916001\1 URL::http://www.amd.com/us-en/ WebType::1 WebNo::920916001 Update_Time/Date::10:30:00/JAN-17-2003 ..... Statistics Information: #Tag::251 #Frame::3 ..... Title::Advanced Micro Devices, AMD - Homepage Anchor::Home Heading::Processors Outbound_URLs::http://www.amd.com/home/prodinfo01; http://www.amd.com/home/compsol01; ... ..... Ontology Information: Domain_Mark::True CPU: 0.8; L1(2); Ondie(2); AMD(5); ... Motherboard: 0.5; AGP(1); PCI(1); ... .....

(b) An example website model

(a) Format of a website model

WebNo#1

DocNo#11

(website profile)

.....

DocNo#12

DocNo#189

DocNo#190

(webpage profile)

WebNo#2

DocNo#21

(website profile)

.....

DocNo#22

DocNo#290

DocNo#291

(webpage profile)

(c) Conceptual structure of a website model Fig. 8. Website model format, example and structure.

tion for both basic information and statistics information sections of a webpage profile. It also transforms the webpage into a list of words (pure text) for further processing by OntoAnnotator. Specifically, DocPool contains webpages retrieved from the Web. HTML Analyzer analyzes the HTML structure to extract URL, Title texts, anchor texts and heading texts, and to calculate tag-related statistics for website models. HTML TAG Filter removes HTML tags from the webpage, deletes stop words using 500 stop words we developed from McCallum (1996), and performs word stemming and standardization. Document Parser transforms the stemmed, tag-free web-

page into a list of words for further processing by OntoAnnotator. Fig. 10 illustrates the architecture of OntoAnnotator. Inside the architecture, OntoClassifier uses the ontology to classify a webpage, and Annotator uses the ontology to annotate ontology features with their term frequencies for each class according to how often they appear in the webpage. Domain Marker uses Table 2 to determine whether the webpage is relevant to the domain. The condition column in the table means the number of classes appearing in the webpage and the Limit column specifies a minimal threshold on the average number of features of

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

DocExtractor Webpage

HTML Analyzer

DocPool

HTML TAG Filter

Table 2 Domain-relevance threshold for webpages

basic and statistics information website models

Document Parser

7

To OntoAnnotator

Condition (class count)

Limit

0 1 26 7  10

None Average P 3 Average P 2 Average P 1

stemmed, tag-free webpage in plain text

Motherboard

CPU

Ontology

3dnow (35) Sse2 (52) L1 (43) Pipeline (36) Ondie (28) .....

Fig. 9. Architecture of DocExtractor.

the class which must appear in the webpage. For example, row 2 means if a webpage contains only one class of a domain, then the features of the class appearing in the webpage must be greater than or equal to three in order for it to be considered to belong to the domain. In addition to classification of webpages, OntoClassifier is used to annotate each ontology class by generating a classification score for each ontology class. For instance, the annotation of (CPU: 0.8, Motherboard: 0.5) means the webpage belongs to class CPU with score 0.8, while it belongs to class Motherboard with score 0.5. OntoClassifier is a two-step classifier based on the reorganized ontology structure as shown in Figs. 6 and 7, where, for ease of explanation, we have deliberately skipped the information of term frequencies. Fig. 11 brings it back showing that each ontology feature is associated with a term frequency. The term frequency comes from the number of each feature appearing in the classified webpages. OntoClassifier classifies a webpage in two stages. The first stage uses representative ontology features. We employ the level threshold, THW to limit the number of ontology features to be involved in this stage. For example, Fig. 11 shows THW = 1, which means only the set of representative features at level 1 will be used. The basic idea of classification of the first stage is defined by Eq. (1). In the equation, OntoMatch(d,C) is defined by Eq. (2), which calculates the number of ontological features of class C that appears in webpage d, where M(w,c) returns 1 if word w of d is contained in class C. Thus, Eq. (1) returns class C for webpage d if C has the largest number of ontology features appearing in d. Note that not all classes have the same

BIOS (72) ATX (47) Onboard (41) Amr (23) Cnr (15) .....

Level 1

Level Threshold = 1 (THW = 1) Socket (126) 478pin (78) 423pin (36) .....

Socket (154) 478pin (122) 423pin (98) .....

Level 2

Related concept nodes

Fig. 11. Re-drawed ontology structure.

number of ontology features; we have added #wC0 , the number of words in each class C 0 , for normalization. Also note that Eq. (1) only compares those classes with more than three ontology features appearing in d, i.e., it filters less possible classes. As to why classes with less than three features appearing in d are filtered, we refer to Joachims’ concept that the classification process only has to consider the term with appearance frequency larger than three (Joachims, 1997). HOntoMatchðdÞ ¼ arg max 0 C 2c

OntoMatchðd; C 0 Þ ; #wC0

ðd; C 0 Þ > 3 OntoMatch C 0 2c X OntoMatchðd; CÞ ¼ Mðw; CÞ

ð1Þ ð2Þ

w2d

If for any reason the first stage cannot return a class for a webpage, we move to the second stage of classification. The second stage no longer uses level thresholds but gives an ontology feature a proper weight according to which level it is associated with. That is, we modify the traditional classifiers by including a level-related weighting mechanism for the ontology classes to form our ontology-based classifier. ontology information

OntoClassifier

Annotator

Processed Webpages from DocExtractor

website model

Domain Marker Domain_Mark

OntoAnnotator

Ontology

Fig. 10. Architecture of OntoAnnotator.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS 8

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

This level-related weighting mechanism will give a higher weight to the representative features than to the related features. The second stage of classification is defined by Eq. (3). Inside the equation, OntoTFIDF(d,C) is defined by Eq. (4), which calculates a TFIDF score for each class C with respect tod according to the terms appearing both on d and C, where TF(xjy) means the number of appearance of word x in y. Eq. (3) is used to create a list of class: score pairs for d and finally selects the one with the highest score of TFIDF as the class for webpage d. HOntoTFIDFðdÞ ¼ arg max OntoTFIDFðd; C 0 Þ 0

returned Web pages from Web Crawler for DocExtractor during the construction of webpage profiles. It also stores query results from search engines, which usually contains a list of URLs. URLExtractor is responsible for extracting URLs from the query results and dispatching those URLs that are domain-dependent but not yet in the website models to Distiller. User-Oriented Webpage Expander pinpoints interesting URLs in the website models for further webpage expansion according to the user query. Autonomous Website Evolver autonomously discovers URLs in the website models that are domain-dependent for further webpage expansion. Since these two types of URLs are both derived from website models, we call them website model URLs in the figure. User Priority Queue stores the user search strings and the website model URLs from User-Oriented Webpage Expander. Website Priority Queue stores the website model URLs from Autonomous Website Evolver and the URLs extracted by URLExtractor. Distiller controls the Web search by associating a priority score with each URL (or search string) using Eq. (5) and placing it in a proper Priority Queue. Eq. (5) defines ULScore(U,F) as the priority score for each URL (or search string).

ð3Þ

C 2c

OntoTFIDFðd; CÞ ¼

X 1 TFðwjCÞ TFðwjdÞ P  P 0 Lw TFðw jCÞ TFðw0 jdÞ w2d w0 2F C

w0 2F C

ð4Þ

4. Website models application The basic goal of the website models is to help Web search taking account both user-interest and domaindependence. Section 4.1 explains how this can be achieved. The second goal is to help fast retrieval of webpages stored in the website models for the user. Section 4.2 explains how this is done.

ULScoreðU ; F Þ ¼ W F  S F ðU Þ

ð5Þ

where U represents a URL or search string; and F identifies the way U is obtained as shown in Table 3, which also assigns to each F a weight WF and a score SF(U). Thus, if F = 1, i.e., U is a search string, then W1 = 3, and S1(U) = 100, which implies all search strings are treated as the top-priority requests. As for F = 3, if U is new to the website models, S3(U) is set to 1 by URLExtractor; otherwise it is set to 0.5. Finally, for F = 2, the URLs may come from User-Oriented Webpage Expander or Autonomous Website Evolver. In the former case, we follow the algorithm in Fig. 13 (to be explained in Section 4.1.1) to calculate S2(U) for each U. The assignment of

4.1. Focused web crawling supported by website models In order to effectively use the website models to narrow down the search scope, we propose a new focused crawler as shown in Fig. 12, which features a progressive crawling strategy in obtaining domain relevant Web information. Inside the architecture, Web Crawler gathers data from the Web. DocPool was mentioned before; it stores all

Web

Search Engines

Focused Crawler Webpages URL

Query result

Search strings Webpage

Web Crawler

DocExtractor

OntoAnnotator

DocPool Webpage

User Priority Queue

Website Priority Queue

Query result

URL Distiller User search strings

URLExtractor

Website Model URLs

User-Oriented Webpage Expander Autonomous Website Evolver

Website Models

Fig. 12. Architecture of focused crawler.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx Table 3 Basic weighting for URLs to be explored Input type

F

WF

SF(U)

Search strings Website model URLs URLs extracted by URLExtractor

1 2 3

3 2 1

S1(U) = 100 S2(U) S3(U)

S2(U) in the latter case is more complicated; we will describe it in Section 4.1.2. Note that Distiller schedules all the URLs (and search strings) in the User Priority Queue according to their priority scores for web crawling before it starts to schedule the URLs in the Website Priority Queue. In addition, whenever there are new URLs coming into the User Priority Queue, Distiller will stop the scheduling of the URLs in the Website Priority Queue and turn to schedule the new URLs in the User Priority Queue. This design prefers user-oriented Web resource crawling to website maintenance, since user-oriented query or webpage expansion takes into account both user interest and domain constraint, which can better meet our design goal than website maintenance. 4.1.1. User-oriented web search supported by website models Conventional information retrieval research has mainly been based on (computer-readable) text (Salton & Buckley, 1988, 1983) to locate desired text documents using a query consisting of a number of keywords, very similar to the keyword-based search engines. Retrieved documents are ranked by relevance and presented to the user for further exploration. The main issue of this query model lies in the difficulty of query formulation and the inherent word

Let D be a Domain, Q be a Query, and U be an URL. Let URL_List be a List of URLs with Scores. UserTL(Q) = User terms in Q. AnTerm(U) = Terms in the Anchor Text of U. S2(U) = Score of U (cf. Table 6). Webpage_Expand_Strategy(Q, D) { For each Website S in the Website Model For each outbound link U in S { Score = AnE_Score(U, Q) If Score is not zero { S2(U) = RS,D x Score Expanding_URL(U, Score) } } Return URL_List }

9

ambiguity in natural language. To overcome this problem, we propose a direct query expansion mechanism, which helps users implicitly formulate their queries. This mechanism uses domain ontology to expand user query. One straightforward expansion is to add synonyms of terms contained in the user query into the same query. Synonyms can be easily retrieved from the ontology. Table 4 illustrates a simple example. More complicated expansion adds ontology classes according to their relationships with the query terms. The most used relationships follow the inheritance structure. For instance, if more than half of the subconcepts of a concept appear in a user query, we add the concept to the query too. We also propose an implicit webpage expansion mechanism oriented to the user interest to better capture the user intention. This user-oriented webpage expansion mechanism adds webpages related to the user interest for further retrieval into the website models. Here we exploit the outbound hyperlinks of the stored webpages in the website models. To be more precise, we are using Anchor Texts specified by the webpage designer for the hyperlinks, which contain terms that the designer believes are most suitable to describe the hyperlinked webpages. We can compare these anchor texts against a given user query to determine whether the hyperlinked webpages contain the terms in the query. If yes, it implies the hyperlinked webpages might be interested by the user and should be collected in the website models for further query processing. Fig. 13 formalizes this idea into a strategy to select those hyperlinks, or URLs, that the users are strongly interested in. The algorithm returns a URL-list which contains hyperlinks along with their scores. The URL-list will be sent to the Focused Crawler (to be discussed later), which uses the scores to rank the URLs and, accordingly fetches hyperlinked webpages. Note that the algorithm uses RS,D to modulate the scores, which decrease those hyperlinks that are less related to the target domain. RS,D specifies the degree of domain correlation of website S with respect to domain D, as defined by Eq. (6). In the equation, NS,D refers to the number of webpages on website S belonging to domain D; and NS stands for the number of webpages on website S. Here we need the parameter Domain_Mark in the webpage profile to determine NS,D. In short, RS,D measures how strong a website is related to a domain. Thus, the autonomous webpage expansion mechanism is also domain-oriented in nature.

AnE_Score(U, Q) { For each Term T in AnTerm(U) If UserTL(Q) contains T AnE_Score = AnE_Score + 1 Return AnE_Score }

Table 4 Example of direct query expansion

Expanding_URL(U, Score) { Add U and Score to URL_List }

Expanded user query: Mainboard Motherboard CPU Central process unit, processor Socket Slot, connector KV133 ABIT

Fig. 13. User-oriented webpage expansion strategy supported by the website models.

Original user query: Mainboard

CPU

Socket

KV133

ABIT

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS 10

RS;D ¼

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

N S;D NS

ð6Þ

4.1.2. Domain-oriented web search supported by website models Now, we can discuss how Autonomous Website Evolver does domain-dependent website expansion. Autonomous Website Evolver employs a four-phase progressive strategy to autonomously expand the website models. The first phase uses Eq. (7) to calculate S2(U) for each hyperlink U which is referred to by website S and recognized to be in S from its URL address but whose hyperlinked webpage is not yet collected in S. X S 2 ðU Þ ¼ RS;D  ð1  P S;D ðCÞÞ; C2D

U 2S

and

P S;D ðCÞ 6¼ 0

ð7Þ

where C is a concept of domain D, RS,D was defined in Eq. (6), and PS,D is defined by Eq. (8). PS,D(C) measures the proportion of concept correlation of website S with respect to concept C of domain D. NS,C refers to the number of webpages talking about domain concept C on website S. Fig. 14 shows the algorithm for calculating NS,C. In short, PS,D(C) measures how strong a website is related to a specific domain concept. P S;D ðCÞ ¼

N S;C N S;D

ð8Þ

Literally, Eq. (7) assigns a higher score to U if U belongs to a website S which has a higher degree of domain correlation RS,D, but contains less domain concepts, i.e., less PS,D(C). Fig. 15 illustrates how this strategy works. It shows webpages I and J will become the first choices for calculating their priority scores by the first phase. In the figure, the upper nodes represent the webpages stored in the website models and the lower nodes represent the webpages whose URLs are hyperlinked in the upper nodes. Nodes in dotted circles, e.g., nodes I and J represent webpages not yet stored in the website models and needed to be collected. The figure shows webpage J referred to by webpage A in website 1 belongs to website 1, but is not yet collected into its website model. Similarity, webpage I referred to by webpage E in website 2 has to be collected into its website model. Note that webpage I is also referred to by webpage C in Let C be a Concept, S be a Website, and D be a Domain. Let THW be some threshold numbers for the website models. OntoPages(S) = The webpages on website S belong to D. OntoCon(THW, P) = The top THW domain concepts of webpage P. Calculating_NS,C(S, C) { For each webpage P in OntoPage(S) { If OntoCon(THW, P) contains C NS,C = NS,C + 1 } Return NS,C }

Website1

Website2

A

B

C

E

F

G

H

J

C

B

I

E

F

G

outbound link

Website1 URL range

Fig. 15. Basic operation of the first phase.

website 1. In summary, the first phase prefers to expand the websites that are well profiled in the website models but have less coverage of domain concepts. The first phase is good at collecting more webpages for well profiled websites; it cannot help with unknown websites, however. Our second phase goes a step further by searching for webpages that can help define a new website profile. In this phase, we exploit URLs that are in the website models, but belong to some unknown website profile. We use Eq. (9) to calculate S2(U) for each outbound hyperlink U of some webpages that is stored in an indefinite website profile. S 2 ðU Þ ¼ AnchorðU ; DÞ

ð9Þ

where function Anchor(U,D) gives outbound link U a weight according to how many terms in the anchor text of U belong to domain D. Fig. 16 illustrates how this strategy works. Phase 2 will choose hyperlink H for priority score calculation. The unknown website X represents a website profile which contains insufficient webpages to make clear its relationships to some domains. Thus, the second phase prefers to expand those webpages that can help bring in more information to complete the specification of indefinite website profiles. In the third phase, we relax one more constraint; we relax the condition of unknown website profiles. We exploit any URLs as long as they are referred to by some webpages in the website models. We use Eq. (10) to calculate S2(U) for each outbound hyperlink U which are referred to by any webpage in the website models. This equation heavily relies on the anchor texts to determine which URLs should receive higher priority scores.

Website1

Unknown Website X

A

B

C

B

C

E

E

F

G

H

E

F

outbound link

Website1 URL range

Fig. 14. Algorithm for calculating NS,C.

Website2 URL range

Unknown Website X URL range

Fig. 16. Basic operation of the second phase.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

S 2 ðU Þ ¼

X

AnchorðU ; CÞ  RS;D  P S;D ðCÞ

11

ð10Þ

Initial state

C2D Y

Fig. 17 illustrates how the strategy works. It will select all nodes in dotted circles for priority score calculation. In short, the third phase tends to collect every webpage that is referred to by the webpages in the website models. In the last phase, we resort to general website information to refresh and expand website profiles. This phase is periodically invoked according to the Update_Time/Date stored in the website profiles and webpage profiles. Specifically, we refer to the analysis of refresh cycles of different types of websites conducted in Cho and Garcia-Molina (2000) and define a weight for each web type as shown in Table 5. This phase then uses Eq. (11) to assign an S2(U) to each U which belongs to a specific website type T. S 2 ðU Þ ¼ RS;D  W T ;

U 2T

4.2. Webpage retrieval from website models Webpage retrieval concerns the way of providing mostneeded documents for users. Traditional ranking methods employ an inverted full-text index database along with a ranking algorithm to calculate the ranking sequence of relevant documents. The problems with this method are clear: too many entries in returned results and too slow response time. A simplified approach emerged, which employs various ad hoc mechanisms to reduce query space (Salton & McGill, 1983, 1988). Two major problems are behind these mechanisms: (1) they need a specific, labor-intensive and time-consuming pre-process and; (2) they cannot respond to the changes of the real environment in time due to the off-line pre-process. Another new method called Page, Brin,

A

Website2 B

C

E

F

G

outbound link

H

If Priority Queue is empty?

J

L

M

Invoke phase 1; Add URLs to Priority Queue by Distiller

Periordically invoke phase 4 by system

N If Priority Queue is empty?

Invoke phase 2; Add URLs to Priority Queue by Distiller

Wait for Priority Queue empty

Invoke phase 3; Add URLs to Priority Queue by Distiller

Y

Fig. 18. Progressive website expansion strategy.

Motwani, and Winograd (1999) was employed in Google to rank webpages by their link information. Google spends lots of offline time pre-analyzing the link relationships among a huge number of webpages and calculating proper ranking scores for them before storing them in a special database for answering user query. Google’s high speed of response stems from a huge local webpage database along with a time-consuming, offline detailed link structure analysis. Instead, our solution ranking method takes advantage of the semantics in the website models. The major index structure uses ontology features to index webpages in the website models. The ontology index contains terms that are stored in the webpage profiles. The second index structure is a partial full-text inverted index since it contains no ontology features. Fig. 19 shows this two-layered index structure. Since we require each query contain at least one ontology feature, we always can use the ontology index to locate a set of webpages. The partial full-text index is then used to further reduce them into a subset of webpages for users. This design of separating ontology indices from a traditional full-text is interesting. Since we then know what

Ontology Index (Inverted Index of Ontology Feature Terms)

I

N

Need to invoke phase 1~3?

Y

ð11Þ

Fig. 18 summarizes how this four-phase progressive website expansion strategy works.

Website1

N

Document numbers of webpages in website models

Partial Full-Text Index (Inverted Index of Partial Terms)

DocNo 1

N Term 1

Term 11 DocNo 2

Fig. 17. Basic operation of the third phase.

Term 2

Term 21 DocNo 3

Term 3

Table 5 Weight table to different types of Websites

Term 31 DocNo 4

...

... ...

Web type, T

WT

Com net/org edu gov

0.62 0.32 0.08 0.07

...

... ... ...

Fig. 19. Index structures in website models.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS 12

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

ontology features are contained in a user query. Based on this information, we can apply OntoClassifier to analyze what ontology classes (equivalently domain concepts) the user are really interested in and use the information to fast locate user interested webpages. Let us explain how this is done in our system. First, we use the second stage of OntoClassifier along with a threshold, say THU, to limit the best classes a query is associated with. For example, if we set THU to three, we select the best three ontology classes from a query and use them as indices to fast locate userinterested webpages. As a matter of fact, we can leverage the identified ontology features in a user query to properly rank the webpages for the user using the ranking method defined by Eq. (12). In the first term of the equation, MQU(P) is the number of user terms appearing in webpages P, which can be obtained from Fig. 19, and WQU is its weighting value. PS,D(T) is defined by Eq. (13), which measures, for each term T in the user term part of query Q (i.e., QU(Q)), the ratio of the number of webpages that contain T (i.e., NS,T), and the total number of webpages related to D (i.e., NS,D), on website S. Multiplying these factors together represents how strong the to-be-retrieved webpages are user termsoriented. The second term of Eq. (12) does a similar analysis on ontology features appearing in the user query. Basically, WQO is a weighing value for the ontology term part of query Q, and MQO(P) is the number of ontology features appearing in webpage P, which can be obtained from Fig. 19 too. As to the factor of PS,D(T), we have a slightly different treatment here. It is used to calculate the ratio of the number of webpages containing ontology feature T, but we restrict T to appear only in the top THU concepts, as we have set a threshold number of domain concepts for each user query. We thus need to add a factor PTH(Q,P) to reflect the fact that we also apply a threshold number of domain concepts, THW, for each webpage (see Section 3.2). PTH(Q,P) is defined by Eq. (14) measuring the ratio of the number of domain concepts that appear both in the top THU concepts of query Q and the top THW concepts of domain D (i.e., MTH(Q,P)) to the number of domain concepts that appear only in the top THU concepts of Q (i.e., MTH(Q)). This second term thus represents how strong the to-be-retrieved webpages are related to userinterested domain concepts. RAðQ; P Þ ¼ W QU  M QU ðP Þ 

X

Fig. 20. Adjustment of user-terms vs. ontology-features in webpage retrieval.

Note that the two weighting factors are correlated as defined by Eq. (15). The user is allowed to change the ratio between them as illustrated in Fig. 20 to reflect his emphasis on either user terms or ontology features in retrieving webpages. 5. System evaluation 5.1. Architecture of the search agent Fig. 21 shows the architecture of our Search Agent, which integrates the design concepts and techniques discussed in the previous sections. To recap, Focused Crawler is responsible for gathering webpages into DocPool according to user interests and website model weakness. Model Constructor extracts important information from a webpage stored in DocPool and annotates proper ontology information to make a webpage profile for it. It also constructs a website profile for each website in due time according to what webpages it contain. Webpage Retrieval uses ontology features in a given user query to fast locate and rank a set of most needed webpages in the website models and displays it to the user. Finally, the User Interface receives a user query, expands the query using ontology, and sends it to Webpage Retrieval, which in turn returns a list of ranked webpages. Fig. 22 illustrates the various ways in which the user can enter query. Fig. 22(a) shows the traditional keyword-based model along with the help of ontology features shown in the left column. The user can directly choose ontology terms into

WWW Search Engines

webpages

P S;D ðT Þ

T 2QU ðQÞ

þ W QO  M QO ðP Þ  P TH ðQ; P Þ 

X

Focused Crawler

P S;D ðT Þ

T 2OntoðTHU ;QÞ

User Query

ð12Þ N S;T P S;D ðT Þ ¼ N S;D M TH ðQ; P Þ P TH ðQ; P Þ ¼ M TH ðQÞ W QU þ W QO ¼ 1

ð13Þ ð14Þ ð15Þ

DocPool

Model Constructor Model Information

Searching Strings User Interface

Ontology

Website Models

Answer Webpage Retrieval

Fig. 21. Ontology-centric search agent architecture.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

13

ranked list of websites along with their contained webpages. Either type of the ranked lists will be turned over to the User Interface for personalization before being displayed to the user. 5.2. Experiment environment Our Search Agent is developed using Borland JBuilder 7.0 on Windows XP. We collected in total ten classes with 100 webpages in each class from hardware-related Websites, as shown in Table 6. 5.3. Performance evaluation of OntoClassifier

Fig. 22. User query through user interface.

the input field. Fig. 22(b) shows the user can use natural language to input his query too. The user interface employs a template matching technique to select best-matched query templates (Chiu, 2003; Yang, 2007), as shown in Fig. 22(c) for the users to conform. Note that either of the query is finally transformed into a list of keywords internally by User Interface for subsequent query expansion. Fig. 23 illustrates a retrieval result for user query returned from Webpage Retrieval which contains a ranked list of webpages. This retrieval method emphasizes the ‘‘precision’’ criteria. As a matter of fact, Webpage Retrieval can structure another way of retrieval result to emphasize the ‘‘recall’’ factor by retrieving and ranking the websites that are most suited to answer the user query. Fig. 24 exemplifies such a returned result composed of a

http://www.amd.com http://www.intel.com http://www.idt.com http://www.cyrix.com .....

AMD - Advanced Micro Device, INC. Welcome to Intel Welcome to IDT VIA Technologies, INC.

(CPU, 0.9) (CPU, 0.9) (CPU, 0.7) (CPU, 0.7)

0.95 0.9 0.8 0.7

... ... ... ...

Fig. 23. Example of webpage-oriented retrieval from website models.

AMD website model

Intel website model

IDT website model

Cyrix website model

=> ..... => http://www.amd.com; (CPU, 0.9); 0.95; ... => ..... => Motherboard cluster => ..... => ..... => ..... => CPU class => ..... => http://www.intel.com; (CPU, 0.9); 0.9; ... => ..... => ..... => ..... => CPU class => ..... => http://www.idt.com; (CPU, 0.7); 0.8; ... => ..... => ..... => ..... => ..... => CPU class => ..... => http://www.cyrix.com; (CPU, 0.7); 0.7; ...

The first experiment is to learn how well OntoClassifier works. We applied the feature selection program as described in ontology-reorganization to all collected webpages to select ontology features for each class. Table 7 shows the number of features for each class. To avoid unexpected delay we limit the level of related concepts to 7 during the second stage classification of OntoClassifier. Fig. 25 shows its performance for each class. Several interesting points deserve notice here. First, with a very small number of ontology features, OntoClassifier can perform very accurate classification results in virtually all classes. Even with 10 features, over 80% accuracy of classification can be obtained in all classes. Second, the accuracy of classification of OntoClassifier is very stable with respect to the number of ontology features. In Wang Table 6 Experimental data Class

Webpage count

Website count

CPU Motherboard Graphic card Sound card Network card SCSI card Optical drive Monitor Hard drive Modem

100 100 100 100 100 100 100 100 100 100

7 4 3 13 5 7 5 4 5 7

=> CPU classr

.....

Fig. 24. Example of website-oriented retrieval from website models.

Table 7 Number of features in each training class Class

# of features

CPU Motherboard Graphic card Sound card Network card SCSI card Optical drive Monitor Hard drive Modem

93 96 90 92 67 55 88 79 57 66

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS 14

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx CPU

Onto Classifier

Motherboard

100%

Accuracy

Graphic Card

80% Sound Card

60%

Network Card

40%

SCSI Card

20%

Optical Drive Monitor

0% 10

20

30

40

50

60

70

80

90

100

# of Features

Hard Drive Modem

Fig. 25. Classification performance of OntoClassifier.

(2003), we have reported a performance comparison between OntoClassifier and three other similar classifiers, namely, O-PrTFIDF (Joachims, 1997), T-PrTFIDF (Ting, 2000), and D-PrTFIDF (Wang, 2003). All the three classifiers and their respective feature selection methods were reimplemented. We found that none of these three classifiers can match the performance of OntoClassifier with respect to either classification accuracy or classification stability. To verify that the superior performance of OntoClassifier is not due to overfitting, we used 1/3, 1/2, and 2/3 of the collected webpages, respectively, for training in each class to select the ontology features and used all webpages of the class for testing. Table 8 shows how OntoClassifier behaves with respect to different ratios of training samples. The column of number of features gives the number of ontology features used in each class. It does show that the superior accuracy performance can be obtained even with 1/3 of training webpages. 5.4. Do ontology features work for any classifiers? The second experiment is to learn whether the superior performance of OntoClassifier is purely due to ontology features. In other words, can the ontology features work for other classifiers too? From this purpose, this experiment will use the same set of ontology features derived for OntoClassifier to test the performance of O-PrTFIDF, D-PrTFIDF and T-PrTFIDF, the three classifiers mentioned before. This time we only used 1/3 of the collected Web pages for training the ontology features and used all

Web pages for testing. To make each classifier work best, we allow each classifier to arbitrarily choose the first 40 features that make it work best. We limit the number to be 40, because the class SCSI only has 38 features. Fig. 26 illustrates how the four classifiers work for class as CPU, Motherboard, Graphic Card, and Sound Card. We note that O-PrTFIDF and D-PrTFIDF classifiers are the most unstable among the four with respect to different numbers of features. The T-PrTFIDF classifier works rather well except for larger features, which is because TPrTFIDF was designed to work based on ontology (Ting, 2000). Its computation complexity is greater than OntoClassifier though. From this experiment, we learn that ontology features alone do not work for any classifier; the ontology features work best for those classifiers that are crafted by taking into account how to leverage the power of ontology. OntoClassifier is such a classification mechanism. 5.5. User-satisfaction evaluation of system prototype Table 9 shows the comparison of user satisfaction of our system prototype against other search engines. In the table, ST, for satisfaction of testers, represents the average of satisfaction responses from 10 ordinary users, while SE, for satisfaction of experts, represents that of satisfaction responses from 10 experts. Basically, each search engine receives 100 queries and returns the first 100 webpages for evaluation of satisfaction by both experts and nonexperts. The table shows that our system prototype with the techniques described above, the last row, enjoys the highest satisfaction in all classes. From the evaluation, we conclude that, unless the comparing search engines are specifically tailored to this specific domain, such as HotBot and Excite, our system prototype, in general, retrieves more correct webpages in almost all classes. 6. Related works We notice that ontology is mostly used in the systems that work on information gathering or classification to improve their gathering processes or the search results

Table 8 Classification performance of OntoClassifier under different ratios of training samples Class

CPU Motherboard Graphic card Sound card Network card SCSI card Optical drive Monitor Hard drive Modem

1/3 Training data

1/2 Training data

2/3 Training data

# of Features

Accuracy (%)

# of Features

Accuracy (%)

# of Features

Accuracy (%)

69 81 61 73 53 38 73 69 39 64

97 100 100 98 94 93 90 100 99 100

78 89 73 73 60 48 82 74 44 66

100 100 100 99 98 98 94 100 98 100

82 89 77 89 64 50 87 75 50 66

100 100 100 99 100 98 94 100 99 100

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

CPU 100%

Accuracy

80% D-PrTFIDF

60%

O-PrTFIDF T-PrTFIDF

40%

OntoClassifier

20% 0% 10

20

30

40

# of Features

(a) CPU class Motherboard 100%

Accuracy

80%

D-PrTFIDF

60%

O-PrTFIDF T-PrTFIDF

40%

OntoClassifier

20% 0% 10

20

30

40

# of Features

(b) Motherboard class Sound Card Accuracy

100% 80%

D-PrTFIDF

60%

O-PrTFIDF T-PrTFIDF

40%

OntoClassifier

20% 0% 10

20

30

40

# of Features

(c) Sound Card class Graphic Card

Accuracy

100% 80%

D-PrTFIDF

60%

O-PrTFIDF T-PrTFIDF

40%

OntoClassifier

20% 0% 10

20

30

40

# of Features

(d) Graphic Card class Fig. 26. Do ontology features work for any classifiers?

15

from disparate resources (Eichmann, 1998). For instance, MELISA (Abasolo & Go´mez, 2000) is an ontology-based information retrieval agent with three levels of abstraction, separated ontologies and query models, and definitions of some aggregation operators for combining results from different queries. WebSifter II is a semantic taxonomy-based, personalizable meta-search agent (Kerschberg, Kim, & Scime, 2001) that tries to capture the semantics of a user’s decision-oriented search intent, to transform the semantic query into target queries for existing search engines, and to rank the resulting page hits according to a user-specified weighted-rating scheme. Chen and Soo (2001) describes an ontology-based information gathering agent which utilizes the domain ontology and corresponding support (e.g., procedure attachments, parsers, wrappers and integration rules) to gather the information related to users’ queries from disparate information resources in order to provide much more coherent results for the users. Park and Zhang (2003) describe a novel method for webpage classification that is based on a sequential learning of the classifiers which are trained on a small number of labeled data and then augmented by a large number of unlabeled data. Wang, Yu, and Nishino (2004) propose a new website information detection system based on Webpage type classification for searching information in a particular domain. SALEM (Semantic Annotation for LEgal Management) (Bartolini, Lenci, Montemagni, Pirrelli, & Soria, 2004) is an incremental system developed for automated semantic annotation of (Italian) law texts to effective indexing and retrieval of legal documents. Chan and Lam (2005) propose an approach for facilitating the functional annotation to the Gene ontology by focusing on a subtask of annotation, that is, to determine which of the Gene ontology a literature is associated with. Swoogle (Ding et al., 2004) is a crawler-based system that discovers, retrieves, analyzes and indexes knowledge encoded in semantic web documents on the Web, which can use either character N-Gram or URIrefs as keywords to find relevant documents and to compute the similarity among a set of documents. Finally, Song, Lim, Park, Kang, and Lee (2005) suggest an automated method for document classification using an ontology, which expresses terminology information and vocabulary contained in Web documents by way of a hierarchical structure. In this paper, we not only proposed ontology-directed classification mechanism, namely, Onto-

Table 9 User satisfaction evaluation K_Word method

CPU (SE/ST)

Motherboard (SE/ST)

Memory (SE/ST)

Average (SE/ST)

Yahoo Lycos InfoSeek HotBot Google Excite Alta Vista Our prototype

67%/ 61% 64%/ 67% 69%/ 70% 69%/ 63% 66%/ 64% 66%/ 62% 63%/ 61% 78%/ 69%

77%/ 78% 77%/ 76% 71%/ 70% 78%/ 76% 81%/ 80% 81%/ 81% 77%/ 78% 84%/ 78%

38%/ 17% 36%/ 20% 49%/ 28% 62%/ 31% 38%/ 21% 50%/ 24% 30%/ 21% 45%/ 32%

61%/ 52% 59%/ 54% 63%/ 56% 70%/ 57% 62%/ 55% 66%/ 56% 57%/ 53% 69%/ 60%

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS 16

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

Classifier can make a decision of the class for a webpage or a website in the semantic decision process for Web services, but advocated the use of ontology-supported website models to provide a semantic level solution for a search agent so that it can provide fast, precise and stable search results. As to Web search, current general search engines use the concept of crawlers (spider or soft-robots) to help users automatically retrieve useful Web information in terms of ad-hoc mechanisms. For example, Dominos (Hafri & Djeraba, 2004) can crawl several thousands of pages every second, include a high-performance fault manager, be platform independent and be able to adapt transparently to a wide range of configurations without incurring additional hardware expenditure. Ganesh, Jayaraj, Kalyan, and Aghila (2004) proposes the association-metric to estimate the semantic content of the URL based on the domain dependent ontology, which in turn strengthens the metric that is used for prioritizing the URL queue. UbiCrawler (Boldi, Codenotti, Samtini, & Vigna, 2004), a scalable distributed web crawler, is platform independent, linear scalability, graceful degradation in the presence of faults, a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task. Chan (2008) proposes an intelligent spider that consists of a URL searching agent and an auction data agent to automatically correct related information by crawling over 1000 deals from Taiwan’s eBay whenever users input the searched product. Finally, Google adopts a PageRank approach to rank a large number of webpage link information and pre-record it for solving the problem (Brin & Page, 1998). A general Web crawler is, in general, a greedy tool that may make the URL list too large to handle. Focused Crawler, instead, aims at locating domain knowledge and necessary metainformation for assisting the system to find related Web targets. The concept of Distiller is employed to rank URLs for the Web search (Barfourosh, Nezhad, Anderson, & Perlis, 2002; Diligenti, Coetzee, Lawrence, Giles, & Gori, 2000; Rennie & McCallum, 1999). IBM is an example, which adopts the HITS algorithm, similar to PageRank, for controlling web search Kleinberg, 1999. These methods are ad-hoc and need an off-line, time-consuming pre-processing. In our system, we not only develop a focused crawler using website models as the core technique, which helps search agents successfully tackle the problems of search scope and user interests, but introduce the four-phase progressive website expansion strategy for the focused crawler to control the Web search, which takes into account both user interests and domain specificity. 7. Conclusions and discussion We have described how ontology-supported website models can effectively support Web search. A website model contains webpage profiles, each recording basic information, statistics information, and ontology informa-

tion of a webpage. The ontology information is an annotation of how the webpage is interpreted by the domain ontology. The website model also contains a website profile that remembers how a website is related to the webpages and how it is interpreted by the domain ontology. We have developed a Search Agent, which employs domain ontology-supported website models as the core technology to search for Web resources that are both user-interested and domain-oriented. Our preliminary experimentation demonstrates that the system prototype can retrieve more correct webpages with higher user satisfaction. The Agent features the following interesting characteristics. (1) Ontology-supported construction of website models. By this, we can attribute domain semantics into the Web resources collected and stored in the local database. One important technique used here is the Ontology-supported OntoClassifier which can do very accurate and stable classification on webpages to support more correct annotation of domain semantics. Our experiments show that OntoClassifier performs very well in obtaining accurate and stable webpages classification. (2) Website models-supported Website model expansion. By this, we can take into account both user interests and domain specificity. The core technique here is the Focused Crawler which employs progressive strategies to do user query-driven webpage expansion, autonomous website expansion, and query results exploitation to effectively expand the website models. (3) Website models-supported Webpage Retrieval. We leverage the power of ontology features as a fast index structure to locate most-wanted webpages for the user. (4) We mentioned that the User Interface works as a query expansion and answer personalization mechanism for Search Agent. As a matter of fact, the module has been expanded into a User Interface Agent in our information integration system (Yang, 2006; Yang, 2006). The User Interface Agent can interact with the user in a more semantics-oriented way according to his proficiency degree about the domain (Yang, 2007; Yang, 2007). Most of our current experiments are on the performance test of OntoClassifier. We are unable to do experiments on or comparisons of how good the Search Agent is at expanding useful Web resources. Our difficulties are summarized below. (1) To our knowledge, none of current Web search systems adopt a similar approach as ours in the sense that none of them are relying on ontology as heavily as our system to support Web search. It is thus rather hard for us to do a fair and convincing comparison. (2) Our ontology construction is based on a set of pre-collected webpages on a specific domain; it is hard to evaluate how critical this pre-collection process is to the nature of different domains. We are planning to employ the technique of automatic ontology evolution, for example, cooperated with data mining technology for discovering useful information and generating desired knowledge that support ontology construction (Wang, Lu, & Zhang, 2007), to help study the robustness of our ontology. (3) In general, a domain ontology-based technique cannot work as a gen-

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

eral-purpose search engine. We are planning to create a general-purpose search engine by employing a multiple number of our Search Agent, supported by a set of domain ontologies, through a multi-agent architecture. Acknowledgements The author would like to thank Jr-Chiang Liou, Yung Ting, Jr-Well Wu, Yu-Ming Chung, Zing-Tung Chou, Ying-Hao Chiu, Ben-Chin Liao, Yi-Ching Chu, Shu-Ting Chang, Yai-Hui Chang, Chung-Min Wang, and FangChen Chuang for their assistance in system implementation. This work was supported by the National Science Council, ROC, under Grants NSC-89-2213-E-011-059, NSC-89-2218-E-011-014, and NSC-95-2221-E-129-019. References Abasolo, J. M., & Go´mez, M. (2000). MELISA: An ontology-based agent for information retrieval in medicine Available at http://citeseer.nj.nec.com/442210.html. Al-Halami, R., & Berwick, R. (1998). WordNet: an electronic lexical database. ISBN 0-262-06197-X. Arnaud, L. H., Philippe, L. H., Lauren, W., Gavin, N., Jonathan, R., Mike, C., & Steve, B. (2004). Document object model (DOM) level 3 core specification. Available at http://www.w3.org/TR/2004/RECDOM-Level-3-Core-20040407/. Ashish, N., & Knoblock, C. A. (1997). Wrapper generation for semistructured Internet sources. ACM SIGMOD Record, 26(4), 8–15. Barfourosh, A. A., Nezhad, H. M., Anderson, M. L., & Perlis, D. (2002). Information retrieval on the world wide web and active logic: a survey and problem definition. Technical Report of CS-TR-4291, Department of Computer Science, University of Maryland, Maryland, USA. Bartolini, R., Lenci, A., Montemagni, S., Pirrelli, V., & Soria, C. (2004). Automatic classification and analysis of provisions in Italian legal texts: a case study. In Proceedings of the 2nd workshop on regulatory ontologies (pp. 593-604). Larnaca, Cyprus. Boldi, P., Codenotti, B., Samtini, M., & Vigna, S. (2004). UbiCrawler: A scalable fully distributed web crawler. Software: Practice and Experience, 34(8), 711–726. Brickley, D., & Guha, R. V. (2004). RDF Vocabulary description language 1.0: RDF schema. Available at http://www.w3.org/TR/ 2004/REC-rdf-schema-20040210/. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th international world wide web conference (pp. 107–117). Brisbane, Australia. Chan, C. C. (2008). Intelligent spider for information retrieval to support mining-based price prediction for online auctioning. expert systems with applications: an international journal, 34(1), 347–356. Chan, K., & Lam, W. (2005). Gene ontology classification of biomedical literatures using context association. In Proceedings of the 2nd Asia information retrieval symposium (pp. 552–557). Jeju Island, Korea. Chen, Y. J., & Soo, V. W. (2001). Ontology-based information gathering agents. In Proceedings of the 2001 international conference on web intelligence (pp. 423–427). Maebashi TERRSA, Japan. Chiu, Y. H. (2003). An interface agent with ontology-supported user models. Master thesis, Department of Electronic Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan. Cho, J., Garcia-Molina, H. (2000). The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th international conference on very large databases (pp. 200–209). Cairo, Egypt. DAML. (2003). Available at http://www.daml.org/about.html.

17

DAML+OIL. (2001). Available at http://www.daml.org/2001/03/daml+oil-index/. Decker, S., Melnik, S., van Harmelen, F., Fensel, D., Klein, M., Broekstra, J., et al. (2000). The semantic web: the roles of XML and RDF. IEEE Internet Computing, 4(5), 63–74. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., & Gori, M. (2000). Focused crawling using context graphs. In Proceedings of the 26th international conference on very large databases (pp. 527–534). Cairo, Egypt. Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., Reddivari, P., Doshi, V., & Sachs, J. (2004). Swoogle: A search and metadata engine for the semantic web. In Proceedings of the 13th ACM international conference on information and knowledge management (pp. 652–659). Washington, DC, USA. Eichmann, D. (1998). Automated categorization of web resources. Available at http://www.iastate.edu/~CYBERSTACKS/Aristotle.htm. Ganesh, S., Jayaraj, M., Kalyan, V., & Aghila, G. (2004). Ontology-based web crawler. In Proceedings of the international conference on information technology: coding and computing (pp. 337–341). Las Vegas, NV, USA. Hafri, Y., & Djeraba, C. (2004). Dominos: a new web crawler’s design. In Proceedings of the 4th international web archiving workshop. Bath, UK. Henry, S. T., David, B., Murray, M., & Noah, M. (2001). XML Base. Available at http://www.w3.org/TR/2001/REC-xmlbase-20010627/. Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of the 14th international conference on machine learning (pp. 143-151). Nashville, Tennessee, USA. Kerschberg, L., Kim, W., & Scime, A. (2001). WebSifter II: A personalizable meta-search agent based on weighted semantic taxonomy tree. In Proceedings of the international conference on internet computing (pp. 14–20). Las Vegas, USA. Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), 604–632. Lawewnce, S., & Giles, C. L. (1999). Accessibility and distribution of information on the web. Nature, 400, 107–109. Lawewnce, S., & Giles, C. L. (2000). Accessibility of information on the web. ACM Intelligence: New Visions of AI in Practice, 11(1), 32–39. Manola, F. (1998). Towards a web object model. Available at http:// op3.oceanpark.com/papers/wom.html. McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at http:// www.cs.cmu.edu/~mccallum/bow. Moldovan, D. I., & Mihalcea, R. (2000). Using WordNet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1), 34–43. Noy, N. F., & Hafner, C. D. (1997). The state of the art in ontology design. AI Magazine, 18(3), 53–74. Noy, N. F., & McGuinness, D. L. (2001). Ontology development 101: A guide to creating your first ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Tech. Rep. SMI-2001-0880. OIL. (2000). http://www.ontoknowledge.org/oil/downl/oil-whitepaper. pdf. Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford Digital Libraries Working Paper of SIDL-WP-1999-0120, Department of Computer Science, University of Stanford, CA, USA. Park, S. B., & Zhang, B. T. (2003). Automatic webpage classification enhanced by unlabeled data. In Proceedings of the 4th international conference on intelligent data engineering and automated learning (pp. 821–825). Hong-Kong, China. Rennie, J., & McCallum, A. (1999). Using reinforcement learning to spider the web efficiently. In Proceedings of the 16th international conference on machine learning (pp. 335–343). Bled, Slovenia. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS 18

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill. Selberg, E., & Etzioni, O. (1995). Multi-service search and comparison using the MetaCrawler. In Proceedings of the 4th international world wide web conference (pp. 169–173). Boston, USA. Song, M. H., Lim, S. Y., Park, S. B., Kang, D. J., & Lee, S. J. (2005). An automatic approach to classify web documents using a domain ontology. The first international conference on pattern recognition and machine intelligence (pp. 666–671). Kolkata, India. Ting, Y. (2000). A search agent with website models. Master Thesis, Department of Electronic Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan. Wang, C., Lu, J., & Zhang, G. Q. (2007). Mining key information of web pages: a method and its applications. Expert Systems with Applications: An International Journal, 33(2), 425–433. Wang, C. M. (2003). Web search with ontology-supported technology. Master thesis, Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan. Wang, Z. L., Yu, H., & Nishino, F. (2004). Automatic special type website detection based on webpage type classification. In Proceedings of the first international workshop on web engineering. Santa Cruz, USA. Weibel, S. (1999). The State of the Dublin core metadata initiative. D-Lib Magazine, 5(4).

Yang, S. Y., & Ho, C. S. (1999). Ontology-supported user models for interface agents. In Proceedings of the 4th conference on artificial intelligence and applications (pp. 248–253). Chang-Hua, Taiwan. Yang, S. Y. (2006). An ontology-directed webpage classifier for web services. In Proceedings of joint 3rd international conference on soft computing and intelligent systems and 7th international symposium on advanced intelligent systems (pp. 720–724). Tokyo, Japan. Yang, S. Y. (2006). A website model-supported focused crawler for search agents. In Proceedings of the 9th joint conference on information sciences (pp. 755–758). Kaohsiung, Taiwan. Yang, S. Y. (2006). An ontology-supported website model for web search agents. In Proceedings of the 2006 international computer symposium (pp. 874-879). Taipei, Taiwan. Yang, S. Y. (2006). How does ontology help information management processing. WSEAS Transactions on Computers, 5(9), 1843–1850. Yang, S. Y. (2006). An ontology-supported information management agent with solution integration and proxy. In Proceedings of the 10th WSEAS international conference on computers (pp. 974-979). Athens, Greece. Yang, S. Y. (2007). An ontology-supported user modeling technique with query templates for interface agents. In Proceedings of 2007 WSEAS international conference on computer engineering and applications (pp. 556–561). Gold Coast, Queensland, Australia. Yang, S. Y. (2007). How does ontology help user query processing for FAQ services. WSEAS Transactions on Information Science and Applications, 4(5), 1121–1128.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems with Applications (2007), doi:10.1016/j.eswa.2007.09.024

ARTICLE IN PRESS Expert Systems with Applications xxx (2008) xxx–xxx

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Developing of an ontological interface agent with template-based linguistic processing technique for FAQ services Sheng-Yuan Yang * Department of Computer and Communication Engineering, St. John’s University, 499, Sec. 4, TamKing Road, Tamsui, Taipei County 251, Taiwan, ROC

a r t i c l e

i n f o

Available online xxxx Keywords: Ontological interface agents Template-based query processing FAQ services

a b s t r a c t This paper proposes an ontological Interface agent which works as an assistant between the users and FAQ systems. We integrated several interesting techniques including domain ontology, user modeling, and template-based linguistic processing to effectively tackle the problems associated with traditional FAQ retrieval systems. Specifically, we address the following issues. Firstly, how can an interface agent learn a user’s specialty in order to build a proper user model for him/her? Secondly, how can domain ontology help in establishing user models, analyzing user query, and assisting and guiding interface usage? Finally, how can the intention and focus of a user be correctly extracted? Our work features an template-based linguistic processing technique for developing ontological interface agents; a nature language query mode, along with an improved keyword-based query mode; and an assistance and guidance for human–machine interaction. Our preliminary experimentation demonstrates that user intention and focus of up to eighty percent of the user queries can be correctly understood by the system, and accordingly provides the query solutions with higher user satisfaction. Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction With increasing popularity of the Internet, people depend more on the Web to obtain their information just like a huge knowledge treasury waiting for exploration. People used to be puzzled by such problems as ‘‘how to search for information on the Web treasury?” As the techniques of Information Retrieval (Salton, Wong, & Yang, 1975; Salton & McGill, 1983) matured, a variety of information retrieval systems have been developed, e.g., Search engines, Web portals, etc., to help search on the Web. How to search is no longer a problem. The problem now comes from the results from these information retrieval systems which contain some much information that overwhelms the users. Now, people wants the information retrieval systems to do more help, for instance, by only retrieving the results which better meet the user’s requirement. Other wanted capabilities include a better interface for the user to express his/her true intention, better-personalized services and so on. In short, how to improve traditional information retrieval systems to provide search results which can better meet the user requirements so as to reduce his/her cognitive loading is an important issue in current research (Chiu, 2003). The websites which provide Frequently Asked Questions (FAQ) organize user questions and expert answers about a specific product or discipline in terms of question–answer pairs on the Web. * Corresponding author. Tel.: +886 2 28013131x6394; fax: +886 28013131x6391. E-mail address: [email protected].

Each FAQ is represented by one question along with one answer and is characterized, to be domain-dependent, short and explicit, and frequently asked (Lee, 2000; OuYang, 2000). People usually go through the list of FAQs and read those FAQs that are to his/ her questions. This way of answering the user’s questions can save the labor power for experts from answering similar questions repeatedly. The problem is after the fast accumulation of FAQs, it becomes harder for people to single out related FAQs. Traditional FAQ retrieval systems, however, provide only little help, became they fail to provide assistance and guidance for human–machine interaction, personalized information services, flexible interaction interface, etc. (Chiu, 2003). In order to capture true user’s intention and accordingly provides high-quality FAQ answers to meet the user requests, we have proposed an Interface Agent acquires user intention through an adaptive human–machine interaction interface with the help of ontology-directed and template-based user models (Yang et al., 1999; Yang, Chiu, & Ho, 2004; Yang, 2006). It also handles user feedback on the suitability of proposed responses. The agent features ontology-based representation of domain knowledge, flexible interaction interface, and personalized information filtering and display. Specifically, according to the user’s behavior and mental state, we employed the technique of user modeling to construct a user model to describe his/her characteristics, preference and knowledge proficiency level, etc. We also used the technique of user stereotype (Rich, 1979) to construct and initialize a new user model, which helps provide fast-personalized services for new users.

0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.03.011

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS 2

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

outlined a procedure for this in Yang et al. (1999) from how the process was conducted in existent systems. By following the procedure we developed an ontology for the PC domain using Protégé 2000 (Noy & McGuinness, 2001) as the fundamental background knowledge for the system, which was originally developed in Chinese (Yang, Chuang, & Ho, 2007) but was changed to English here for easy explanation. Fig. 1 shows part of the ontology taxonomy. The taxonomy represents relevant PC concepts as classes and their parent–child relationships as isa links, which allow inheritance of features from parent classes to child classes. We carefully selected those properties that are most related to our application from each concept, and defined them as the detailed ontology for the corresponding class. Fig. 2 exemplifies the detailed ontology of the concept of CPU. In the figure, the root node uses various fields to define the semantics of the CPU class, each field representing an attribute of ‘‘CPU”, e.g., interface, provider, synonym, etc. The nodes at the lower level represent various CPU instances, which capture real world data. The arrow line with term ‘‘io” means the instance of relationship. The complete PC ontology can be referenced from the Protégé Ontology Library at Stanford Website (). We also developed a problem ontology to deal with query questions. Fig. 3 illustrates part of the Problem ontology, which contains query type and operation type. Together they imply the semantics of a question. Finally, we use Protégé’s APIs to develop a set of ontology services, which provide primitive functions to support the application of the ontologies. The ontology services currently available include transforming query terms into canonical ontology terms, finding definitions of specific terms in ontology, finding relationships among terms, finding compatible and/or conflicting terms against a specific term, etc.

We built domain ontology (Noy & McGuinness, 2001) to help define domain vocabulary and knowledge and based on that to construct user models and the Interface Agent. We extended the concept of pattern match (Hovy, Hermjakob, & Ravichandran, 2002) to query template to construct natural language based query models. This idea leads to the extraction of true user intention and focus from his/her query posted as nature language understanding, which help the agent to fast find out precise information for the user, just as (Coyle & Smyth, 2005) in which using interaction histories for enhancing search results. Our preliminary experimentation demonstrates that the intention and focus of up to eighty percent of the users’ queries can be correctly understood, and accordingly provides the query solutions with higher user satisfaction. The rest of the paper is organized as follows: Section 2 describes the fundamental techniques. Section 3 explains the Interface Agent architecture. Section 4 reports the system demonstrations and evaluations. Section 5 compares the work with related works, while Section 6 conclude the work. The Personal Computer (PC) domain is chosen as the target application of our Interface Agent and will be used for explanation in the remaining sections. 2. Fundamental techniques 2.1. Domain ontology The concept of ontology in artificial intelligence refers to knowledge representation for domain-specific contents (Chandrasekaran, Josephson, & Benjamins, 1999). It has been advocated as an important tool to support knowledge sharing and reusing in developing intelligent systems. Although development of an ontology for a specific domain is not yet an engineering process, we have

Hardware isa isa isa isa

Interface Card

isa

isa

Power Equipment

isa

isa

Memory

Storage Media

Case

isa isa

isa

isa

Network Chip

Sound Card

isa

Display Card

SCSI Card

isa

isa

Network Card

Power Supply

isa

isa

UPS

isa

isa

Main Memory

ROM

is a

Optical

isa

CD

isa

ZIP

isa

DVD

isa

isa

CDR/W

CDR

Fig. 1. Part of PC ontology taxonomy.

CPU Synonym= Central Processing Unit D-Frequency String Interface Instance* CPU Slot L1 Cache Instance Volume Spec. Abbr. Instance CPU Spec.

... io

XEON Factory= Intel

io

io

io

io

io

io

THUNDERBIRD 1.33G Synonym= Athlon 1.33G Interface= Socket A L1 Cache= 128KB Abbr.= Athlon Factory= AMD

DURON 1.2G Interface= Socket A L1 Cache= 64KB Abbr.= Duron Factory= AMD Clock= 1.2GHZ

PENTIUM4 2.0AGHZ D-Frequency= 20 Synonym= P4 2.0GHZ Interface= Socket 478 L1 Cache= 8KB Abbr.= P4

PENTIUM 4 1.8AGHZ D-Frequency= 18 Synonym= P4 1.8GHZ Interface= Socket 478 L1 Cache= 8KB Abbr.= P4

CELERO N 1.0G Interface= Socket 370 L1 Cache= 32KB Abbr.= Celeron Factory= Intel Clock= 1GHZ

PENTIUM 4 2.53AGHZ Synonym= P4 Interface= Socket 478 L1 Cache= 8KB Abbr.= P4 Factory= Intel

...

...

...

...

...

...

Fig. 2. Ontology of the concept of CPU.

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS 3

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

Query isa

isa

Operation Type

Query Type

io io

io

io

Adjust

Use

io

Setup

io

Close

io

io

Open

Support

io

io

Provide

How

What

io

io

Why

io

Where

Fig. 3. Part of problem ontology taxonomy.

2.2. Query templates To build the query templates, we have collected in total 1215 FAQs from the FAQ website of six famous motherboard factories in Taiwan and used them as the reference materials for query template construction (Hovy et al., 2002; Soubbotin et al., 2001). Currently, we only take care of the user query with one intention word and at most three sentences. These FAQs were analyzed and categorized into six types of questions as shown in Table 1. For each Table 1 Question types Question type (A-NOT-A) (HOW) (WHAT) (WHEN) (WHERE) (WHY)

Intention Asks about can or cannot, should or should not, have or have not Asks about solving methods Enumerates related information Asks about time, year, or date Asks place or position Asks reasons

Table 2 Examples of intention types

type of question, we further identified several intention types according to its operations. Table 2 illustrates some examples of intention types. Finally, we define a query pattern for each intention type. Table 3 illustrates the defined query patterns for the intention types of Table 2. Table 4 explains the syntactical constructs of the query patterns. Now all information for constructing a query template is ready, and we can formally define a query template. Table 5 defines what a query template is. It contains a template number, number of sentences, intention words, intention type, question type, operation type, query patterns, and focus. Table 6 illustrates an example query template for the ANA_CAN_SUPPORT intention type. Note here that we collect similar query patterns in the field of ‘‘Query patterns,” which are used in detailed analysis of a given query.

Table 4 Definition of pattern symbols and descriptions Symbol

Description

hi

Means single sentence and considers the sequence of intention words and keywords Means at least one sentence and only consider appeared keywords but sequence Means the variable part of a template which is any string consists of keywords Means the fixed part of a template which can help the system distinguish between user intentions Means the concepts in domain ontology which are usually domain terminologies Means if the variable part of a template is the user query key point, we called the focus

[]

Intention Type

Description

ANA_CAN_SUPPORT HOW_SET WHAT_IS WHEN_SUPPORT WHERE_DOWNLOAD WHY_SETUP

Asks Asks Asks Asks Asks Asks

if support some specifications or products the method of assignment the meaning of terminology when can support where can download reasons about setup

Si Intention Word Keyword Focus

Table 3 Examples of query patterns

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS 4

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

of query templates in our system. Currently, we have in total 154 intention words which form an intention word base.

Table 5 Query template specification Field

Description

Template_Number #Sentence

Template ID The number of sentences: 1 for one sentence, 2 for two sentences, and 3 for three or more sentences Intention words must appear Intention type Question type Operation type The semantic pattern consists of intention words and keywords The focus of the user

Intention_Word Intention_Type Question_Type Operation_Type Query_Patterns Focus

Table 6 Query template for the ANA_CAN_SUPPORT intention type

According to the generalization relationships among intention types, we can form a hierarchy of intention types to organize all FAQs. Currently, the hierarchy contains two levels as shown in Fig. 4. Now, the system can employ the intention type hierarchy to reduce the search scope during the retrieval of FAQs after the intention of a user query is identified. Table 7 shows the statistics

3. Interface agent architecture 3.1. User modeling In order to provide customized services, we observe, record, and learn the user behavior and mental state in a user model. A user model contains interaction preference, solution presentation, domain proficiency, terminology table, query history, selection history, and user feedback, as shown in Fig. 5. The interaction preference is responsible for recording user’s preferred interface, e.g., favorite query mode, favorite recommendation mode, etc. When the user logs on the system, the system can select a proper user interface according to this preference. We provide two modes, either through keywords or natural language input. We provide three recommendation modes according to hit rates, hot topics, or collaborative learning. We record recent user’s preferences in a time window, and accordingly determine the next interaction style. The solution presentation is responsible for recording solution ranking preferences of the user. We provide two types of ranking, either according to the degree of similarity between the proposed solutions and the user query, or according to user’s proficiency about the solutions. In addition, we use a Show_Rate parameter (described later) to control how many items of solutions for display each time, in order to reduce information-overloading problem. The domain proficiency factor describes how familiar the user is with the domain. By associating a proficiency degree with each ontology concept, we can construct a table, which contains a set of hconcept proficiency-degreei pairs, as his/her domain proficiency. Thus, during the decision of solution representation, we can calculate the user’s proficiency degree on solutions using the table, and accordingly only show his/ her most familiar part of solutions and hide the rest for advanced requests. To solve the problem of different terminologies to be used by different users, we include a terminology table to record this terminology difference. We can use the table to replace the terms used in the proposed solutions with the user favorite terms during solutions representation to help him better comprehend the solutions. Finally, we record the user’s query history as well as FAQ selection history and corresponding user feedback in each query session in the Interaction history, in order to support collaborative recommendation. The user feedback is a complicated factor. We remember both explicit user feedback in the selection history and implicit user feedback, which includes query time, time of FAQ click, sequence of FAQ clicks, sequence of clicked hyperlinks, etc. In order to quickly build an initial user model for a new user, we pre-defined five stereotypes, namely, expert, senior, junior, novice, and amateur (Yang et al., 1999, 2004), to represent different user group’s characteristics. This approach is based on the idea that the same group of user tends to exhibit the same behavior and requires the same information. Fig. 6 illustrates an example user stereotype. When a new user enters the system, he/she is asked to complete a

Fig. 4. Intention type hierarchy.

Interaction preference

Table 7 Statistics of query templates Question type

#Intention type

#Template

#Pattern

#FAQ (%)

(if) (how) (what) (when) (where) (why) Total

19 18 6 1 3 25 69

53 44 18 1 4 199 319

113 121 19 3 4 771 1031

385 (31.7) 265 (21.8) 91 (7) 3 (0.2) 4 (0.3) 467 (38.4) 1215

Solution Presentation

Terminology Interaction Explicit Table History Feedback

Domain Proficiency

Implicit Feedback

Fig. 5. Our user model.

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS 5

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

3.2. Architecture overview

Fig. 6. Example of expert stereotype.

questionnaire, which is used by the system to determine his/her domain proficiency, and accordingly select a user stereotype to generate an initial user model to him/her. However, the initial user model constructed from the stereotype may be too generic or imprecise. It will be refined to reflect the specific user’s real intent after the system has experiences with his/her query history, FAQ-selection history and feedback, and implicit feedback (Chiu, 2003).

Search Agent

Answerer Agent

Fig. 7 illustrates the architecture of the Interface Agent and shows how it interacts with other agents of the FAQ-master (Yang, 2006, 2007), which possesses intelligent retrieval, filtering, and integration capabilities and can provide high-quality FAQ answers from the Web. The Interaction Agent provides a personalization interaction, assistance, and recommendation interface for the user according to his/her user model, records interaction information and related feedback in the user model, and helps the User Model Manager and Proxy Agent (Yang, 2007) to update the user model. The Query Parser processes the user queries by first segmenting word, removing conflicting words, and standardizing terms, followed by the recording of the user’s terminologies in the terminology table of the user model. It finally applies the template matching technique to select best-matched query templates, and accordingly transforms the query into an internal query for the Proxy Agent to search for solutions and collect them into a list of FAQs, which each containing a corresponding URL. The Web Page Processor pre-downloads FAQ-relevant webpages and performs some pre-processing tasks, including labeling keywords for subsequent processing. The Scorer calculates the user’s proficiency degree for each FAQ in the FAQ list according to the terminology table in his/her user model. The Personalizer then produces personalized query solutions according to the terminology table. The User Model Manager is responsible for quickly building an initial user model for a new user using the technique of user stereotyping as well as updating the user models and stereotypes to dynamically reflect the changes of user behavior. The Recommender is responsible for recommending information for the user based on hit count, hot topics, or group’s interests when a similar interaction history is detected.

Proxy Agent

User Feedback

Solution & Web Page Links

Internal Query

USER Action & Query

Interaction Agent

Query Parser

Recommender

Web Page Processor

Internet

User Model Manager

Personalizer

Scorer

Data Flow

Support Folw

Homophone Debug Base

Template Base

User Model

Ontology

Interface Agent Fig. 7. Interface agent architecture.

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS 6

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

3.2.1. Interaction agent The Interaction Agent consists of the following three components: Adapter, Observor and Assistant. First, the Adapter constructs best interaction interfaces according to user’s favorite query and recommendation modes. It is also responsible for presenting to the user the list of FAQ solutions (from the Personalizer) or recommendation information (from the Recommender). During solution representation, it arranges the solutions in terms of the user’s preferred style (query similarity or solution proficiency) and displays the solutions according to the ‘‘Show_Rate.” Second, the Observer passes the user query to the Query Parser, and simultaneously collects the interaction information and related feedback from the user. The interaction information contains user preferred query mode, recommendation mode, solution presentation mode, and FAQ clicking behavior, while the related feedback contains user satisfaction degree and comprehension degree about each FAQ solution. The User Model Manager needs both interaction information and related feedback to properly update user models and stereotypes. The satisfaction degree in related feedback can also be passed to the Proxy Agent for tuning the solution search mechanism (Yang, 2006). Finally, the Assistant provides proper assistance and guidance to help the user query process. First, the ontology concepts are structured and presented as a tree so that the users who are not familiar with the domain can check on the tree and learn proper terms to enter their queries. We also rank all ontology concepts by their probabilities and display them in a keyword list. When the user enters a query at the input area, the Assistant will automatically ‘‘scroll” the content of the keyword list to those terms related to the input keywords. Fig. 8 illustrates an example of this automatic keyword scrolling mechanism. If the displayed terms of the list contain a concept that the user wants to enter, he can double-click the terms into the input area, e.g., ‘‘ ” (ASUS) at step 2 of Fig. 8. In addition to the keyword-oriented query mode, the Assistant also provides lists of question types and operation types to help question type-oriented or operation type-oriented search. The user can use one, two, or all of these three mechanisms to help form his/her query in order to convey his/her intention to the system. 3.2.2. Query Parser The Query Parser pre-processes the user query by performing Chinese word segmentation, correction on word segmentation, fault correction on homophonous or multiple words, and term standardization. It then employs template-based pattern matching to analyze the user query and extract the user intention and focus. Finally, it transforms the user query into the internal query format and then passes the query to the Proxy Agent for retrieving proper solutions (Yang, 2006). Detailed explanation follows. Fig. 9 illustrates two ways in which the user can enter Chinese query through Interface Agent. Fig. 9a shows the traditional keyword-based method, enhanced by the ontology features as illustrated in the left column. The user can directly click on the ontology terms to select them into the input field. Fig. 9b shows the user using natural language to input his/her query. In this case, Interface Agent first employs Query Parser to do Chinese word segmentation using MMSEG (Tsai, 2000), correction on word segmentation, fault correction on homophonous or multiple words, and

Fig. 8. Examples of automatic keyword scrolling mechanism.

Fig. 9. User query through our Interface Agent.

term standardization. It then applies the template-based pattern matching technique to analyze the user query, extract the user intention and focus, select best-matched query templates as shown in Fig. 9c (Yang et al., 2004), and trims any irrelevant keywords in accord with the templates. Finally, it transforms the user query into the internal query format and then passes the query to the Proxy Agent for retrieving proper solutions. Fig. 10 shows the flow chart of user query processing. We decided to use template matching for query processing because it is an easy and efficient way to handle pseudo-natural language processing and the result is usually acceptable on a focused domain. Given a user query in Chinese, we segment the query using MMSEG. The results of segmentation were not good, for the predefined MMSEG word corpus contains insufficient terms of the PC domain. For example, it does not contain the keywords ‘‘ ” or ‘‘AGP4X”, and returns wrong word segmentation like ‘‘ ”, ‘‘ ”, ‘‘AGP”, and ‘‘4X”. The step of query pruning can easily fix this by using the ontology as a second word corpus to bring those mis-segmented words back. It also performs fault correction on homophonous or multiple words using the ontology and homophone debug base (Chiu, 2003). The step of query standardization is responsible for replacing the terms used in the user query with the canonical terms in the ontology and intention word base. The original terms and the corresponding canonical terms will then be stored in the terminology table for solution presentation personalization. Finally, we label those recognized keywords by symbol ‘‘K” and intention words by symbol ‘‘I”. The rest are regarded as stop words and removed from the query. Now, if the user is using the keyword mode, we directly jump to the step of query formulation. Otherwise, we use templatebased pattern matching to analyze the natural language input. The step of pattern match is responsible for identifying the semantic pattern associated with the user query. Using the preconstructed query templates in the template base, we can compare the user query with the query templates and select the bestmatched one to identify user intention and focus. Fig. 11 shows the algorithm of fast selecting possibly matched templates, Fig. 12 describes the algorithm which finds out all patterns matched with the user query, and Fig. 13 removes those matched patterns that are generalization of some other matched patterns. Now we explain how template matching is working by the following user query: ‘‘Could Asus K7V motherboard support a CPU

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS 7

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

User Query

Homophone Debug Base

1.Segmentation

Ontology

2.Query Pruning

IntentionWord Base

3.Query Standardization

User Model

User Preference Terms

Query Modify Slightly

NL P Mode

4.Pattern Match

Fig. 12. Pattern match algorithm.

Template Base

Keyword Mode 5.User Confirmation

NO

YES

Fig. 13. Query pattern removal algorithm.

6.Query Formulation

Internal Query Format

Proxy Agent

Table 8 Internal user query and keyword trimming Data Flow

Support Folw

Fig. 10. Flow chart of the user query processing.

Fig. 11. Query template selection algorithm.

over 1 GHz?” Interface Agent first applies MMSEG to obtain the following list of keywords from the user query: hcould, support, 1 GHZ, K7V, motherboard, Asus, CPUi. The ‘‘could” query type in the intention type hierarchy is then followed to retrieve corresponding query templates. Table 6 illustrates the only one corresponding query template, which contains two patterns, namely, hcould S1 support S2i and hcould support S1i. We find the second pattern matches the user query and can be selected to transform the query into an internal form by query formulation (step 6) as shown in Table 8a. Note that there may be more than two patterns

(a) Internal query form before keyword trimming Query type Could Operation type Support Keywords 1 GHz, K7V, motherboard, Asus, CPU (b) Internal query form after keyword trimming Query type Could Operation Type Support Keywords 1 GHz, K7V, CPU

to become candidates for a given query. In this case, the Query Parser will prompt the user to confirm his/her intent (step 5), as illustrated in Fig. 9c. If the user says ‘‘No”, which means the pattern matching results is not the true intention of the user, he/she is allowed to modify the matched result or change to the keyword mode for placing query. The purpose of keyword trimming is to remove irrelevant keywords from the user query; irrelevant keywords sometimes cause adversarial effects on FAQ retrieval. Query Parser uses trimming rules, as shown in Table 9, to prune these keywords. For examples, in Table 8a, ‘‘motherboard” is trimmed and replaced with ‘‘K7V”, since the latter is an instance of the former and can subsume the

Table 9 Examples of trimming rules Rule no

Rule description

Example

1

A super-class can be replaced with its sub-class

2 3

A class can be replaced with its instance A slot value referring to some instance of a class can be replaced with the instance

‘‘Interface card” => ‘‘Sound card” ‘‘CPU” => ‘‘PIII” ‘‘Microsoft” => ‘‘Windows 2000”

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS 8

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

former according to Trimming Rule 2. Table 8b shows the result of the user query after keyword trimming, which now contains only three keywords, namely, 1 GHz, K7V, and CPU. 3.2.3. Web Page Processor The Web Page Processor receives a list of retrieved solutions, which contains one or more FAQs matched with the user query from the Proxy Agent, each represented as Table 10, and retrieves and caches the solution webpages according to the FAQ_URLs. It follows to pre-process those webpages for subsequent customization process, including URL transformation, keyword standardization, and keyword marking. The URL transformation changes all hyperlinks to point toward the cached server. The keyword standardization transforms all terms in the webpage content into ontology vocabularies. The keyword labeling marks the keywords appearing in the webpages by boldfaced hBi KeywordhnBi to facilitate subsequent keywords processing webpage readability. 3.2.4. Scorer Each FAQ is a short document; concepts involved in FAQs are in general more focused. In other words, the topic (or concept) is much clearer and professional. The question part of an FAQ is even more pointed about what concepts are involved. Knowing this property, we can use the keywords appearing in the question part of an FAQ to represent its topic. Basically, we use the table of domain proficiency to calculate a proficiency degree for each FAQ by calculating the proficient concepts appearing in the question part of the FAQ, detailed as shown in Fig. 14. 3.2.5. Personalizer The Personalizer replaces the terms used in the solution FAQs with the terms in the user’s terminology table, collected by the Query Parser, for improving the solution readability.

from the ontology. The user either answers a YES or NO to each question. The answers are collected and weighted according to the respective degrees and passed to the Manager, which then calculates a proficiency score for the user according to the percentage of correctness of his/her responses to the questions and accordingly instantiates a proper user stereotype as the user model for the user. The second task is to update user models. Here we use the interaction information and user feedback collected by the Interaction Agent in each interaction session or query session. An interaction session is defined as the time period from the time point the user logs in up to when he logs out, while a query session is defined as the time period from when the user gives a query up to when he gets the answers and completes the feedback. An interaction session may contain several query sessions. After a query session is completed, we immediately update the interaction preference and solution presentation of the user model. Specifically, the user’s query mode and solution presentation mode in this query session are remembered in both time windows, and the statistics of the preference change for each mode is calculated accordingly, which will be used to adapt the Interaction Agent on the next query session. Fig. 15 illustrates the algorithm to update the Show_Rate of the similarity mode. The algorithm uses the ratio of the number of user selected FAQs and that of the displayed FAQs to update the show rate; the algorithm to update the Show_Rate of the proficiency mode is similar. In addition, each user will be asked to evaluate each solution FAQ in terms of the following five levels of understanding, namely, very familiar, familiar, average, not familiar, very not familiar. This provides an explicit feedback and we can use it to update his/her domain proficiency table. Fig. 16 shows the updating algorithm. Finally, after each interaction session, we can update the user’s recommendation mode in this session in the respective time

3.2.6. User model manager The first task of the User Model Manager is to create an initial user model for a new user. To do this, we pre-defined several questions for each concept in the domain ontology, for example, ‘‘Do you know a CPU contains a floating co-processor?”, ‘‘Do you know the concept of 1 GB = 1000 MB in specifying the capacity of a hard disk?”, etc. The difficulty degrees of the questions are proportional to the hierarchy depth of the concepts in the ontology. When a new user logs on the system, the Manager randomly selects questions

Table 10 Format of retrieved FAQ Field

Description

FAQ_No. FAQ_Question FAQ_Answer FAQ_Similarity FAQ_URL

FAQ’s identification Question part of FAQ Answer part of FAQ Similarity degree of the FAQ met with the user query Source or related URL of the FAQ

Fig. 14. Proficiency degree calculation algorithm.

Fig. 15. Algorithm to update show rate in similarity mode.

Fig. 16. Algorithm to update the domain proficiency table.

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

9

window. At the same time, we add the query and FAQ-selection records of the user into the query history and selection history of his/ her user model. The third task of the User Model Manager is to update user stereotypes. This happens when a sufficient number of user models in a stereotype has undergone changes. First, we need to reflect these changes to stereotypes by re-clustering all affected user models, as shown in Fig. 17, and then re-calculates all parameters in each stereotype, an example as shown in Fig. 18. 3.2.7. Recommender The Recommender uses the following three policies to recommend information. (1) High hit FAQs. It recommends the first N solution FAQs according to their selection counts from all users in the same group within a time window. (2) Hot topic FAQs. It recommends the first N solution FAQs according to their popularity, calculated as statistics on keywords appearing in the query histories of the same group users within a time window. The algorithm does the hot degree calculation as shown in Fig. 19. (3) Collaborative recommendation. It refers to the user’s selection histories of the same group to provide solution recommendation. The basic

Fig. 20. Algorithm to do the collaborative recommendation.

idea is this. If user A and user B are in the same group and the first n interaction sessions of user A are the same as those of user B, then we can recommend the highest-rated FAQs in the (n + 1)th session of user A for user B, detailed algorithm as shown in Fig. 20. 4. Demonstrations and evaluations

Fig. 17. Algorithm to re-cluster all user groups.

Our Interface Agent was developed using the Web-Based clientserver architecture. On the client site, we use JSP (Java Server Page) and Java Applet for easy interacting with users, as well as observing and recording user’s behavior. On the server site, we use Java and Java Servlet, under Apache Tomcat 4.0 Web Server and MS SQL2000 Server running Microsoft Windows XP. In this section, we first demonstrate the developed Interface Agent, and then report how better it performs. 4.1. System demonstrations When a new user enters the system, the user is registered by the Agent as shown in Fig. 21. At the same time, a questionnaire is produced by the Agent for evaluating the user’s domain proficiency. His/Her answers are then collected and calculated in order to help build an initial user model for the new user.

Fig. 18. Example to update the stereotype of expert.

Fig. 19. Algorithm to calculate the hot degree.

Fig. 21. System register interface.

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS 10

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

Now the user can get into the main tableau of our system (Fig 22), which consists of the following three major tab-frames, namely, query interface, solution presentation, and logout. The query interface tab is comprised of the following four frames: user interaction interface, automatic keyword scrolling list, FAQ recommendation list, and PC ontology tree. The user interaction interface contains both keywords and NLP query modes as shown in Fig. 9. The keyword query mode provides the lists of question types and operation types, which allow the users to express their precise intentions. The automatic keyword-scrolling list provides rankedkeyword guidance for user query. A user can browse the PC ontology tree to learn domain knowledge. The FAQ recommendation list provides personalized information recommendations from the system, which contains three modes: hit, hot topic, and collaboration. When the user clicked a mode, the corresponding popup window is produced by the system. The solution presentation tab is illustrated in Fig 23. It pre-selects the solutions ranking method according to the user’s preference and hides part of solutions according to his/her Show_Rate for reducing the cognitive loading of the user. The user can switch the solution ranking method between similarity ranking and proficiency ranking. The user can click the question part of an FAQ (Fig. 24) for displaying its content or giving it a feedback, which contains the satisfaction degree and comprehension degree. Fig. 25 illustrates the window before system logout, which ask the user to fill a questionnaire for statistics to help further system improvement.

Fig. 24. FAQ-selection and feedback enticing.

Fig. 25. System logout.

Table 11 Effectiveness of constructed query patterns #Testing

#Correct

#Error

Precision rate (%)

1215

1182

33

97.28

Fig. 22. Main tableau of our system.

4.2. System evaluations

Fig. 23. Solution presentation.

The evaluation of the overall performance of our system involves lots of manpower and is time-consuming. Here, we focus on the performance evaluation of the most important module, i.e., the Query Parser. Our philosophy is that if it can precisely parse user queries and extract both true query intention and focus from them, then we can effectively improve the quality of the retrieved. First, we have done two experiments in evaluating how the user query processing performs under the support of query templates and domain ontology. Recall that this processing employs the technique of template-based pattern matching mechanism to understand user queries and the templates were manually constructed from 1215 FAQs. In the first experiment, we use these same FAQs for testing queries, in order to verify whether any conflicts exist within the query. Table 11 illustrates the experimental results, where only 33 queries match with more than one query patterns and result in confusion of query intention, called ‘‘error” in the table. These errors may be corrected by the user. The experiment shows the effectiveness rate of the constructed query templates

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx Table 12 User satisfaction evaluations K_WORD METHOD

CPU (SE/ST) (%)

MOTHERBOARD (SE/ST) (%)

MEMORY (SE/ST) (%)

AVERAGE (SE/ST) (%)

Alta Vista Excite Google HotBot InfoSeek Lycos Yahoo Our approach

63/61 66/62 66/64 69/63 69/70 64/67 67/61 78/69

77/78 81/81 81/80 78/76 71/70 77/76 77/78 84/78

30/21 50/24 38/21 62/31 49/28 36/20 38/17 45 /32

57/53 66/56 62/55 70/57 63/56 59/54 61/52 69/60

reaches 97.28%, which implies the template base can be used as an effective knowledge base to do natural language query processing. Our second experiment is to learn how well this processing understands new queries. First, we collected in total 143 new FAQs, different from the FAQs collected for constructing the query templates, from four famous motherboard factories in Taiwan, including ASUS (), SIS (), MSI (), and GIGABYTE (). We then used the question parts of those FAQs for testing queries, which test how well this processing performs. Our experiments show that we can precisely extract true query intentions and focuses from 112 FAQs. The rest of 31 FAQs contain up to three or more sentences in queries, which explain why we failed to understand them. In summary, 78.3% (112/143) of the new queries can be successfully understood. Finally, Table 12 shows the comparison of user satisfaction of our systemic prototype against other search engines. In the table, ST, for Satisfaction of testers, represents the average of satisfaction responses from 10 ordinary users, while SE, for Satisfaction of experts, represents that of satisfaction responses from 10 experts. Basically, each search engine receives 100 queries and returns the first 100 webpages for evaluation of satisfaction by both experts and non-experts. The table shows that our systemic prototype supported by ontological user modeling with query templates, the last row, enjoys the highest satisfaction in all classes. From the evaluation, we concluded that unless the comparing search engines are specifically tailored to this specific domain such as HotBot and Excite, our techniques, in general, can retrieve more correct webpages in almost all classes, resulting from the intention and focus of a user can be correctly extracted. 5. Related works and comparisons The work of Lee (2000) presents a user query system consisting of an intention part and a keywords part. With the help of syntax and parts-of-speech (POS) analysis, he constructs a syntax grammar from collected FAQs, and accordingly offers the capability of intention extraction from queries. He also extracts the keywords from queries through a sifting process on POS and stop words. Finally, he employs the semantic comparison technique of downward recursion on a parse tree to calculate the similarity degree of intention parts between the user query and FAQs, and then uses the concept of space mode (Salton et al., 1975) to calculate the vector similarity degree of the keyword parts between the user query and FAQs for finding out best-matched FAQs. The work of OuYang (2000) classifies pre-collected FAQs according to the business types of ChungHwa Telecom (). He employs the technique of TFIDF (Term Frequency and Inversed Document Frequency) to calculate the weights of individual keywords and intention words, and accordingly select representative keywords and intention words from them to work as the index to each individual class. Gi-

11

ven a user query, he first determines the class of the user query according to the keywords and intention words, and then calculates the similarity degree between the user query and FAQs in the related classes. The natural language processing technique was not used in the work of Sneiders et al. (1999) for analyzing user queries. It, instead, was applied to analyze the FAQs stored in the database long before any user queries are submitted, where each FAQ is associated with a required, optional, irrelevant, or forbidden keyword to help subsequent prioritized keyword matching. By this way, the work of FAQ retrieval can be reduced to keyword matching without inference. Razmerita, Angehrm, and Maedche (2003) presents a generic ontology-based user modeling architecture, (OntobUM), applied in the context of a Knowledge Management System (KMS). The proposed user modeling system relies on a user ontology, using Semantic Web technologies, based on the IMS LIP specifications, and it is integrated in an ontology-based KMS called Ontologging. Degemmis, Licchelli, Lops, and Semeraro (2004) presents the Profile Extractor, a personalization component based on machine learning techniques, which allows for the discovery of preferences and interests of users that have access to a website. Galassi, Giordana, Saitta, and Botta (2005) also presents a method for automatically constructing a sophisticated user/process profile from traces of user/process behavior, which is encoded by means of a Hierarchical Hidden Markov Model (HHMM). Finally, Hsu and Ho (1999) propose an intelligent interface agent to acquire patient data with medicine-related common sense reasoning. In summary, the work of OuYang determines the user query intention according to keywords and intention words appearing in query; while the work of Sneiders uses both similarity degrees on intention words and keywords for solution searching and selection. Both approaches only consider the comparison between words and skip the problem of word ambiguity; e.g., two sentences with the same intention words may not have the same intention. The work of Lee uses the analysis of syntax and POS to extract query intention, which is a hard job with Chinese query, because solving the ambiguity either explicit or implicit meanings of Chinese words, especially in query analysis on long sentences or sentences with complex syntax, is not at all a trivial task. In this paper, we integrated several interesting techniques including user modeling, domain ontology, and template-based linguistic processing to effectively tackle the above annoying problems, just like (Razmerita et al., 2003) in which differently associated with ontology and user modeling and especially, (Paraiso & Barthes, 2006) highlights the role of ontologies for semantic interpretation. In addition, both (Degemmis et al., 2004 & Galassi et al., 2005) propose different learning techniques for processing usage patterns and user profiles. The automatic processing feature, supported by HHMM and unsupervised learning and common sense reasoning techniques, respectively, provides another level of automation in interaction mechanism and deserves more attention.

6. Discussions and future work We have developed an Interface Agent to work as an assistant between the users and FAQ systems, which is different from system architecture and implementation over our previous work (Yang et al., 2004). It is also used to retrieve FAQs on the domain of PC. We integrated several interesting techniques including domain ontology, user modeling, and template-based linguistic processing to effectively tackle the problems associated with traditional FAQ retrieval systems. Specifically, we have solved the following issues. Firstly, our ontological interface agent can truly learn a user’s specialty in order to build a proper user model for him/her. Secondly, the domain ontology can efficiently and effectively help in

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

ARTICLE IN PRESS 12

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

establishing user models, analyzing user query, and assisting and guiding interface usage. Finally, the intention and focus of a user can be correctly extracted by the agent. In short, our work features an template-based linguistic processing technique for developing ontological interface agents; a nature language query mode, along with an improved keyword-based query mode; and an assistance and guidance for human–machine interaction. Our preliminary experimentation demonstrates that user intention and focus of up to eighty percent of the user queries can be correctly understood by the system, and accordingly provides the query solutions with higher user satisfaction. Most of our current experiments are on the performance test of the Query Parser. We are unable to do experiments on or comparisons of how good the Interface Agent is at capturing all interaction information/intention of a user. Our difficulties are summarized below: (1) To our knowledge, none of current interface systems adopt a similar approach as ours in the sense that none of them are relying on ontology as heavily as our system to support user interaction. It is thus rather hard for us to do a fair and convincing comparison. (2) Our ontology construction is based on a set of precollected webpages on a specific domain; it is hard to evaluate how critical this pre-collection process is to the nature of different domains. We are planning to employ the technique of automatic ontology evolution, for example, cooperated with data mining technology for discovering useful information and generating desired knowledge that support ontology construction (Wang, Lu, & Zhang, 2007), to help study the robustness of our ontology. Finally, in the future, not only will we employ the techniques of machine learning and data mining to automate the construction of the template base, as to the allover system evaluation, but also we are planning to engage the concept of usability evaluation on the domain of human factor engineering to evaluate the performance of the agent. Acknowledgements The author would like to thank Yai-Hui Chang and Ying-Hao Chiu for their assistance in system implementation. This work was supported by the National Science Council, ROC, under Grants NSC-89-2213-E-011-059, NSC-89-2218-E-011-014, and NSC-952221-E-129-019. References Chandrasekaran, B., Josephson, J. R., & Benjamins, V. R. (1999). What are ontologies, and why do we need them? IEEE Intelligent Systems, 14(1), 20–26. Chiu, Y. H. (2003). An interface agent with ontology-supported user models. Master Thesis, Department of Electronic Engineering, National Taiwan University of Science and Technology, Taiwan, ROC. Coyle, M., & Smyth, B. (2005). Enhancing web search result lists using interaction histories. In Proceedings of the 27th european conference on IR research on advances in information retrieval (pp. 543–545). Degemmis, M., Licchelli, O., Lops, P., & Semeraro, G. (2004). Learning usage patterns for personalized information access in e-commerce. In: Proceedings of the 8th

ERCIM workshop on user interfaces for user-centered interaction paradigms for universal access in the information society (pp. 133–148). Galassi, U., Giordana, A., Saitta, L., & Botta, M. (2005). Learning profile based on hierarchical hidden markov model. In Proceedings of the 15th international symposium on foundations of intelligent systems (pp. 47–55). Hovy, E., Hermjakob, U., & Ravichandran, D. (2002). A question/answer typology with surface text patterns. In Proceedings of the DARPA human language technology conference (pp. 247–250). Hsu, C. C., & Ho, C. S. (1999). Acquiring patient data by an intelligent interface agent with medicine-related common sense reasoning. Expert Systems with Applications: An International Journal, 17(4), 257–274. Lee, C. L. (2000). Intention extraction and semantic matching for internet FAQ retrieval. Master Thesis, Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, ROC. Noy, N. F., & McGuinness, D. L. (2001). Ontology development 101: A guide to creating your first ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI2001-0880. OuYang, Y. L. (2000). Study and implementation of a dialogued-based query system for telecommunication FAQ services. Master Thesis, Department of Computer and Information Science, National Chiao Tung University, Taiwan, ROC. Paraiso, E. C., & Barthes, J. P. A. (2006). An intelligent speech interface for personal assistants in R&D projects. Expert Systems with Applications: An International Journal, 31(4), 673–683. Razmerita, L., Angehrm, A., & Maedche, A. (2003). Ontology-based user modeling for knowledge management systems. In Proceedings of the 9th international conference on user modeling (pp. 213–217). Rich, E. (1979). User modeling via stereotypes. Cognitive Science, 3, 329–354. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York, USA: McGraw-Hill Book Company. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of ACM, 18(11), 613–620. Sneiders, E. (1999). Automated FAQ answering: Continued experience with shallow language understanding. In AAAI Fall symposium on question answering systems (pp. 97–107). Technical Report FS-99-02. North Falmouth, Massachusetts, USA: AAAI Press. Soubbotin, M. M., & Soubbotin, S. M. (2001). Patterns of potential answer expressions as clues to the right answer. In Proceedings of the TREC-10 conference (pp. 293–302). Tsai, C. H. (2000). MMSEG: A word identification system for Mandarin Chinese text based on two variants of the maximum matching algorithm. . Wang, C., Lu, J., & Zhang, G. Q. (2007). Mining key information of web pages: A method and its applications. Expert Systems with Applications: An International Journal, 33(2), 425–433. Yang, S. Y. (2006). An ontology-supported and query template-based user modeling technique for interface agents. In Symposium on application and development of management information system (pp. 168–173). Yang, S. Y. (2006). How does ontology help information management processing. WSEAS Transactions on Computers, 5(9), 1843–1850. Yang, S. Y. (2006). FAQ-master: A new intelligent web information aggregation system. In Proceedings of international academic conference 2006 special session on artificial intelligence theory and application (pp. 2–12). Yang, S. Y. (2007). An ontological multi-agent system for web FAQ query. In Proceedings of the international conference on machine learning and cybernetics (pp. 2964–2969). Yang, S. Y. (2007). An ontological proxy agent for web information processing. In Proceedings of the 10th international conference on computer science and informatics (pp. 671–677). Yang, S. Y., Chuang, F. C., & Ho, C. S. (2007). Ontology-supported FAQ processing and ranking techniques. Journal of Intelligent Information Systems, 28(3), 233–251. Yang, S. Y., & Ho, C. S. (1999). Ontology-supported user models for interface agents. In Proceedings of the 4th conference on artificial intelligence and applications (pp. 248–253). Yang, S. Y., Chiu, Y. H., & Ho, C. S. (2004). Ontology-supported and query templatebased user modeling techniques for interface agents. In The 12th national conference on fuzzy theory and its applications (pp. 181–186).

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Systems with Applications (2008), doi:10.1016/j.eswa.2008.03.011

WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS

Issue 11, Vol. 4, November 2007

ISSN: 1709-0832

An Ontological Interface Agent for FAQ Query Processing SHENG-YUAN YANG Dept. of Computer and Communication Engineering St. John’s University 499, Sec. 4, TamKing Rd., Tamsui, Taipei County 251 TAIWAN [email protected] http://mail.sju.edu.tw/~ysy Abstract: - In this paper, we describe an Interface Agent which works as an assistant between the users and FAQ systems to retrieve FAQs on the domain of Personal Computer. It integrates several interesting techniques including domain ontology, user modeling, and template-based linguistic processing to effectively tackle the problems associated with traditional FAQ retrieval systems. Specifically, we address how ontology helps interface agents to provide better FAQ services and describe related algorithms in details. Our work features an ontology-supported, template-based user modeling technique for developing interface agents. Our preliminary experimentation demonstrates that user intention and focus of up to eighty percent of the user queries can be correctly understood by the system, and accordingly provides the query solutions with higher user satisfaction. Key-words: - Ontology, Template-based Processing, User Modeling, Interface Agents.

1

Introduction

In this information-exploding era, the Internet affects people’s life style in terms of how people acquire, present, and exchange information. Especially the use of the World Wide Web has been leading to a large increase in the number of people who access FAQ knowledge bases to find answers to their questions [10]. As the techniques of Information Retrieval [6,7] matured, a variety of information retrieval systems have been developed, e.g., Search engines, Web portals, etc., to help search on the Web. How to search is no longer a problem. The problem now comes from the results from these information retrieval systems which contain some much information that overwhelms the users. Therefore, how to improve traditional information retrieval systems to provide search results which can better meet the user requirements so as to reduce his cognitive loading is an important issue in current research [2]. User query

Internal query format

Search Agent

Interface Agent

Search engines

Retrieved Query webpages terms

Solutions Content Base

Retrieved solutions Proxy Agent

Query terms

Answerer Agent

Fig. 1 System architecture for FAQ-master We have proposed an FAQ-master as an intelligent Web information aggregation system, which provides intelligent information retrieval, filtering, and aggregation services [12,17]. Fig. 1 illustrates the system architecture of FAQ-master. The Interface Agent captures user intention through an adaptive human-machine interaction interface with the help of ontology-directed and template-based user models [18]. The Search Agent performs in-time, user-oriented, and domain-related Web information retrieval with the help of ontology-supported website models [19]. The Answerer Agent works as a back end process to perform

ontology-directed information aggregation from the webpages collected by the Search Agent [14,20]. Finally, the Proxy Agent works as an ontology-enhanced intelligent proxy mechanism to share most query loading with the Answerer Agent [15,16,20]. This paper discusses the Interface Agent focusing on how it captures true user’s intention and accordingly provides high-quality FAQ answers. The agent features ontology-based representation of domain knowledge, flexible interaction interface, and personalized information filtering and display. Our preliminary experimentation demonstrates that the intention and focus of up to eighty percent of the users’ queries can be correctly understood by the system, and accordingly provides the query solutions with higher user satisfaction. The Personal Computer (PC) domain is chosen as the target application of our Interface Agent and will be used for explanation in the remaining sections.

2 2.1

Fundamental Techniques Domain Ontology and Services

The concept of ontology in artificial intelligence refers to knowledge representation for domain-specific contents [1]. It has been advocated as an important tool to support knowledge sharing and reusing in developing intelligent systems. Although development of an ontology for a specific domain is not yet an engineering process, we have outlined a procedure for this in [11] from how the process was conducted in existent systems. By following the procedure we developed an ontology for the PC domain in Chinese using Protégé 2000 [4], but was changed to English here for easy explanation, as the fundamental background knowledge for the system. Fig. 2 shows part of the ontology taxonomy. The taxonomy represents relevant PC concepts as classes and their parent-child relationships as isa links, which allow inheritance of features from parent classes to child classes. Fig. 3 exemplifies the detailed ontology of the concept of CPU. In the figure, the root node uses various fields to define the semantics of the CPU class, each field representing an attribute of “CPU”, e.g., interface, provider, synonym, etc. The nodes at the lower level represent various CPU instances,

1400

1401

WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS

which capture real world data. The arrow line with term “io” means the instance of relationship. The complete PC ontology can be referenced from the Protégé Ontology Library at Stanford Website (http://protege.stanford.edu/download/download.html).

Table 1 Examples of query patterns Question Type

Hardware

isa

isa

Interface Card

Memory

Storage Media

Case

isa

Sound Card

isa

Display Card

isa

SCSI Card

isa

Network Card

isa

Power Supply

isa

UPS

isa

isa

Main Memory

ROM

isa

isa

Optical

isa

CD

支援 (Support)

如何 (How)

安裝 (Setup)

什麼 (What)

是 (Is)

何時 (When)

支援 (Support)

哪裡 (Where)

下載 (Download)

為什麼 (Why)

列印 (Print)

isa

isa

DVD

isa

ZIP

isa

CDR/W

CDR

Fig. 2 Part of PC ontology taxonomy CPU Synonym= Central Processing Unit D-Frequency String Interface Instance* CPU Slot L1 Cache Instance Volume Spec. Abbr. Instance CPU Spec.

... io XEON Factory= Intel

是否 (If)

io

io

io

io

io

DURON 1.2G Interface= Socket A L1 Cache= 64KB Abbr.= Duron Factory= AMD Clock= 1.2GHZ

PENTIUM 4 2.0AGHZ D-Frequency= 20 Synonym= P4 2.0GHZ Interface= Socket 478 L1 Cache= 8KB Abbr.= P4

PENTIUM 4 1.8AGHZ D-Frequency= 18 Synonym= P4 1.8GHZ Interface= Socket 478 L1 Cache= 8KB Abbr.= P4

CELERON 1.0G Interface= Socket 370 L1 Cache= 32KB Abbr.= Celeron Factory= Intel Clock= 1GHZ

PENTIUM 4 2.53AGHZ Synonym= P4 Interface= Socket 478 L1 Cache= 8KB Abbr.= P4 Factory= Intel

...

...

...

...

...

...

Template_Number #Sentence Intention_Word Intention_Type Question_Type Operation_Type

We also developed a Problem ontology to deal with query questions. Fig. 4 illustrates part of the Problem ontology, which contains query type and operation type. Together they imply the semantics of a question. Finally, we use Protégé’s APIs to develop a set of ontology services, which provide primitive functions to support the application of the ontologies. The ontology services currently available include transforming query terms into canonical ontology terms, finding definitions of specific terms in ontology, finding relationships among terms, finding compatible and/or conflicting terms against a specific term, etc.

Query_Patterns Focus I n t e n t i o n

H i e r a r c h y

io

io

Use

io

Setup

io

Close

io

如何 (HOW_SOLVE)

HOW_SET ... HOW_FIX

什麼 (WHAT)

WHAT_SUPPORT ... WHAT_SETUP

何時 (WHEN)

WHEN_SUPPORT

哪裡 (WHERE)

WHERE_DOWNLOAD ... WHERE_OBTAIN

為什麼 (WHY_EXPLAIN)

WHY_USE ... WHY_OFF

Fig. 5 Intention type hierarchy

Q uery T ype

io io

ANA_CAN_SUPPLY ... ANA_CAN_SET

isa

Operation T ype

Adjust

304 3 是否(If)、支援(Support) ANA_CAN_SUPPORT 是否(If) 支援(Support) [S3] [S2] S1

是否 (IF)

T y p e

Query isa

Query Pattern

Table 2 Query template for the ANA_CAN_SUPPORT intention type

io

THUNDERBIRD 1.33G Synonym= Athlon 1.33G Interface= Socket A L1 Cache= 128KB Abbr.= Athlon Factory= AMD

Fig. 3 Ontology of the concept of CPU

io

Open

io Support

io Provide

How

W hat

io

io

io

W hy

W here

Fig. 4 Part of problem ontology taxonomy

2.2

Intention Type

ANA_CAN_SUPPORT GA-7VRX 這塊主機板是否支援 KINGMAX DDR-400? (Could the GA-7VRX motherboard support the KNIGMAX DDR-400 memory type?) HOW_SETUP 如何在 Windows 98SE 下,安裝 8RDA 的音效驅動程式? (How to setup the 8RDA sound driver on a Windows 98SE platform?) WHAT_IS AUX power connector 是什麼? (What is an AUX power connector?) WHEN_SUPPORT P4T 何時才能支援 32-bit 512 MB RDRAM 記憶體規格? (When can the P4T support the 32-bit 512 MB RDRAM memory specification?) WHERE_DOWNLOAD CUA 的 Driver CD 遺失,請問哪裡可以下載音效驅動程式? (Where can I download the sound driver of CUA whose Driver CD was lost?) WHY_PRINT [S1] 為什麼在 Win ME 底下,從休眠狀態中回復後,印表機無法列印。 (Why can I not print after coming back from dormancy on a Win ME platform?)

isa

isa

isa

Operation Type

isa

isa

Network Chip

isa

Power Equipment

ISSN: 1709-0832

retrieval of FAQs after the intention of a user query is identified.

isa isa

Issue 11, Vol. 4, November 2007

Ontological Query Templates

To build the query templates, we have collected in total 1215 FAQs from the FAQ website of six famous motherboard factories in Taiwan and used them as the reference materials for query template construction. Currently, we only take care of the user query with one intention word and at most three sentences. These FAQs were analyzed and categorized into six types of questions. For each type of question, we further identified several intention types according to its operations. Finally, we define a query pattern for each intention type. Table 1 illustrates the defined query patterns for the intention types. Now all information for constructing a query template is ready, and we can formally define a query template [3,8]. Table 2 illustrates an example query template for the ANA_CAN_SUPPORT intention type. Note here that we collect similar query patterns in the field of “Query patterns,” which are used in detailed analysis of a given query. According to the generalization relationships among intention types, we can form a hierarchy of intention types to organize all FAQs. Currently, the hierarchy contains two levels as shown in Fig. 5. Now, the system can employ the intention type hierarchy to reduce the search scope during the

3 3.1

System Architecture User Modeling Interaction preference

Solution Presentation

Terminology Interaction Explicit Table History Feedback

Domain Proficiency

Implicit Feedback

Fig. 6 Our user model A user model contains interaction preference, solution presentation, domain proficiency, terminology table, query history, selection history, and user feedback, as shown in Fig. 6. The interaction preference is responsible for recording user’s preferred interface, e.g., favorite query mode, favorite recommendation mode, etc. When the user logs on the system, the system can select a proper user interface according to this preference. We provide two modes, either through keywords or natural language input. We provide three recommendation modes according to hit rates, hot topics, or collaborative learning. We record recent user’s preferences in a time window, and accordingly determine the next interaction style. The solution presentation is responsible for recording solution ranking preferences of the user. We provide two types of ranking, either according to the degree of similarity

WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS

between the proposed solutions and the user query, or according to user’s proficiency about the solutions. In addition, we use a Show_Rate parameter to control how many items of solutions for display each time, in order to reduce information overloading problem. The domain proficiency factor describes how familiar the user is with the domain. By associating a proficiency degree with each ontology concept, we can construct a table, which contains a set of pairs, as his domain proficiency. Thus, during the decision of solution representation, we can calculate the user’s proficiency degree on solutions using the table, and accordingly only show his most familiar part of solutions and hide the rest for advanced requests. To solve the problem of different terminologies to be used by different users, we include a terminology table to record this terminology difference. We can use the table to replace the terms used in the proposed solutions with the user favorite terms during solutions representation to help him better comprehend the solutions. Finally, we record the user’s query history as well as FAQ selection history and corresponding user feedback in each query session in the Interaction history, in order to support collaborative recommendation. The user feedback is a complicated factor. We remember both explicit user feedback in the selection history and implicit user feedback, which includes query time, time of FAQ click, sequence of FAQ clicks, sequence of clicked hyperlinks, etc. In order to quickly build an initial user model for a new user, we pre-defined five stereotypes [5], namely, expert, senior, junior, novice, and amateur [11], to represent different user group’s characteristics. This approach is based on the idea that the same group of user tends to exhibit the same behavior and requires the same information. Fig. 7 illustrates an example user stereotype. When a new user enters the system, he is asked to complete a questionnaire, which is used by the system to determine his domain proficiency, and accordingly select a user stereotype to generate an initial user model to him. However, the initial user model constructed from the stereotype may be too generic or imprecise. It will be refined to reflect the specific user’s real intent after the system has experiences with his query history, FAQ-selection history and feedback, and implicit feedback [2]. Stereotype : Expert Query Mode

Interaction Preference

Keyword Mode : 0/5 NLP Mode : 1/5

Hit : 0/7 Recommendation Hot Topic : 0/7 Mode Collaborative : 1/7

Show Query Similarity : 1/5 Mode Solution Proficiency : 0/5

Solution Representation

Show_Rate (Similarity Mode) : 0.9

Domain Proficiency

Domain Proficiency Table Concept Proficiency 主機板 0.9 中央處理器 0.9 ... 0.9

Interaction History

Explicit Feedback

Use History N Time Window Size:5 Use History Co Time Window Size:7 Use History S Time Window Size:5

Show_Rate (Proficiency Mode) : 0.9 Terminology Table Prefer Standard Terminology Table

Implicit Feedback

Fig. 7 Example of expert stereotype

3.2

System Overview

Fig. 8 shows the architecture of the Interface Agent. The Interaction Agent provides a personalization interaction, assistance, and recommendation interface for the user according to his user model, records interaction information and related feedback in the user model, and helps the User Model Manager and Proxy Agent to update the user model. The Query Parser processes the user queries by first

Issue 11, Vol. 4, November 2007

ISSN: 1709-0832

segmenting word, removing conflicting words, and standardizing terms, followed by the recording of the user’s terminologies in the terminology table of the user model. It finally applies the template matching technique to select best-matched query templates, and accordingly transforms the query into an internal query for the Proxy Agent to search for solutions and collect them into a list of FAQs, which each containing a corresponding URL. The Web Page Processor pre-downloads FAQ-relevant webpages and performs some pre-processing tasks, including labeling keywords for subsequent processing. The Scorer calculates the user’s proficiency degree for each FAQ in the FAQ list according to the terminology table in his user model. The Personalizer then produces personalized query solutions according to the terminology table. The User Model Manager is responsible for quickly building an initial user model for a new user using the technique of user stereotyping as well as updating the user models and stereotypes to dynamically reflect the changes of user behavior. The Recommender is responsible for recommending information for the user based on hit count, hot topics, or group’s interests when a similar interaction history is detected. Search Agent

Answerer Agent

User Feedback

Proxy Agent

Solution & Web Page Links

Internal Query

USER

Action & Query

Interaction Agent

Query Parser

Recommender

Web Page Processor

Internet

User Model Manager

Personalizer

Scorer

Data Flow Support Folw

Homophone Debug Base

Template Base

User Model

Ontology

Interface Agent

Fig. 8 Interface agent architecture

3.2.1

Interaction Agent

The Interaction Agent consists of the following three components: Adapter, Observor and Assistant. First, the Adapter constructs best interaction interfaces according to user’s favorite query and recommendation modes. It is also responsible for presenting to the user the list of FAQ solutions (from the Personalizer) or recommendation information (from the Recommender). During solution representation, it arranges the solutions in terms of the user’s preferred style (query similarity or solution proficiency) and displays the solutions according to the “Show_Rate.” Second, the Observer passes the user query to the Query Parser, and simultaneously collects the interaction information and related feedback from the user. The interaction information contains user preferred query mode, recommendation mode, solution presentation mode, and FAQ clicking behavior, while the related feedback contains user satisfaction degree and comprehension degree about each FAQ solution. The User Model Manager needs both interaction information and related feedback to properly update user models and stereotypes. The satisfaction degree in related feedback can also be passed to the Proxy Agent for tuning the solution search mechanism [20]. Finally, the Assistant provides proper assistance and guidance to help the user query process. First, the ontology concepts are structured and presented as a tree so that the users who are not familiar with the domain can check on the

1402

1403

WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS

tree and learn proper terms to enter their queries. We also rank all ontology concepts by their probabilities and display them in a keyword list. When the user enters a query at the input area, the Assistant will automatically “scroll” the content of the keyword list to those terms related to the input keywords. Fig. 9 illustrates an example of this automatic keyword scrolling mechanism. If the displayed terms of the list contain a concept that the user wants to enter, he can double-click the terms into the input area, e.g., “華碩” (ASUS) at step 2 of Fig. 9. In addition to the keyword-oriented query mode, the Assistant also provides lists of question types and operation types to help question type-oriented or operation type-oriented search. The user can use one, two, or all of these three mechanisms to help form his query in order to convey his intention to the system. Keyword List

華邦

華碩

主機板

華碩 螢幕

螢幕 製程

二進制 交換器

Input Area

什麼華|

什麼華碩|

什麼華碩主|

Step

(1)

(2)

(3)

Fig. 9 Examples of automatic keyword scrolling mechanism

3.2.2

Query Parser U ser Query

H omophone D ebug B ase

Issue 11, Vol. 4, November 2007

ISSN: 1709-0832

second word corpus to bring those mis-segmented words back. It also performs fault correction on homophonous or multiple words using the ontology and homophone debug base [2]. The step of query standardization is responsible for replacing the terms used in the user query with the canonical terms in the ontology and intention word base. The original terms and the corresponding canonical terms will then be stored in the terminology table for solution presentation personalization. Finally, we label those recognized keywords by symbol “K” and intention words by symbol “I.” The rest are regarded as stop words and removed from the query. Now, if the user is using the keyword mode, we directly jump to the step of query formulation. Otherwise, we use template-based pattern matching to analyze the natural language input. The step of pattern match is responsible for identifying the semantic pattern associated with the user query. Using the pre-constructed query templates in the template base, we can compare the user query with the query templates and select the best-matched one to identify user intention and focus. Fig. 11 shows the algorithm of fast selecting possibly matched templates, Fig. 12 describes the algorithm which finds out all patterns matched with the user query, and Fig. 13 removes those matched patterns that are generalization of some other matched patterns. Template Selection : Q : User Query.

1.Segmentation

Q.Intention_Word = {I1, I2,..., IN}, Intention words in Q. Q.Sentence : Number of sentences in Q.

Ontology

2.Query Pruning

Template Base = {T1, T2,..., TM}, M : Number of templates. U ser Model

For each template Tj in Template Base {

IntentionW ord B ase

If Tj conforms to the follow rules, then select Tj into C. 3.Query Standardiz ation

User Preference Terms

Query Modify Slightly

NLP Mode

4.Pattern Match

1. Tj.Sentence = Q.Sentence. 2. Tj.Intention_Word ⊆ Q.Intention_Word. } return C : Candidate Templates.

Fig. 11 Query template selection algorithm Template B ase

Pattern Match : For each template Tj in C, candidate templates

Keyword Mode 5.U ser C onfirmation

{ Tj.Pattern = {P1, P2,...}, Pk : pattern k in template j.

NO

For each PK in Tj.Pattern { YES

If Pk match Q, the user query, then Pk.Intention_Word = Tj.Intention_Word,

6.Query Formulation

Internal Query Format

Pk.Intention_Type = Tj.Intention_Type, Pk.Quertion_Type = Tj.Question_Type, Pk.Operation = Tj.Operation, Pk.Focus = Tj.Focus,

D ata Flow

and put Pk in M and break this inner loop. Proxy Agent

}

Support Folw

}

Fig. 10 Flow chart of the Query Parser The Query Parser pre-processes the user query by performing Chinese word segmentation, correction on word segmentation, fault correction on homophonous or multiple words, and term standardization. It then employs template-based pattern matching to analyze the user query and extract the user intention and focus. Finally, it transforms the user query into the internal query format and then passes the query to the Proxy Agent for retrieving proper solutions [20]. Fig. 10 shows the flow chart of the Query Parser. Detailed explanation follows. Given a user query in Chinese, we segment the query using MMSEG [9]. The results of segmentation were not good, for the predefined MMSEG word corpus contains insufficient terms of the PC domain. For example, it does not contain the keywords “華碩” or “AGP4X”, and returns wrong word segmentation like “華”, “碩”, “AGP”, and “4X”. The step of query pruning can easily fix this by using the ontology as a

return M : Patterns matching Q.

Fig. 12 Pattern match algorithm Pattern Removal : For each pattern Pk in M, matched patterns { If Pk conforms to follow rule, then remove Pk from M.

{ ∃Pi ∈ M, Pk.Intention_Type = Pi.Intention_Type and Pk.Intention_Word ⊂ Pi.Intention_Word } }

Fig. 13 Query pattern removal algorithm Take the following query as an example: “華碩 K7V 主機 板是否支援 1GHz 以上的中央處理器呢?”, which means “Could Asus K7V motherboard support a CPU over 1GHz?” Table 2 illustrates a query template example, which contains two patterns, namely, and . We find the second pattern matches the user query and can be selected to transform the query into an internal form by query formulation (step 6) as shown in Table 3. Note that there may be more than two patterns to become candidates for

WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS

a given query. In this case, the Query Parser will prompt the user to confirm his intent (step 5). If the user says “No”, which means the pattern matching results is not the true intention of the user, he is allowed to modify the matched result or change to the keyword mode for placing query. Table 3 Internal form User_Level Query_Mode Intention_Type Question_Type Operation Keyword Focus

3.2.3

… … … … … … …

Web Page Processor

The Web Page Processor receives a list of retrieved solutions, which contains one or more FAQs matched with the user query from the Proxy Agent, each represented as Table 4, and retrieves and caches the solution webpages according to the FAQ_URLs. It follows to pre-process those webpages for subsequent customization process, including URL transformation, keyword standardization, and keyword marking. The URL transformation changes all hyperlinks to point toward the cached server. The keyword standardization transforms all terms in the webpage content into ontology vocabularies. The keyword labeling marks the keywords appearing in the webpages by boldfaced Keyword to facilitate subsequent keywords processing webpage readability. Table 4 Format of retrieved FAQ Field FAQ_No. FAQ_Question FAQ_Answer FAQ_Similarity FAQ_URL

3.2.4

Description FAQ’s identification Question part of FAQ Answer part of FAQ Similarity degree of the FAQ met with the user query Source or related URL of the FAQ

Scorer

Each FAQ is a short document; concepts involved in FAQs are in general more focused. In other words, the topic (or concept) is much clearer and professional. The question part of an FAQ is even more pointed about what concepts are involved. Knowing this property, we can use the keywords appearing in the question part of an FAQ to represent its topic. Basically, we use the table of domain proficiency to calculate a proficiency degree for each FAQ by calculating the proficient concepts appearing in the question part of the FAQ, detailed as shown in Fig. 14. A : Answer FAQs List, A = {F1, F2,..., Fn}, A ⊂ C, FAQ Collection. A : FAQs found by the system that best match user query. Fi.Q : The Query part of FAQ Fi. For each Fi ∈ A, we calculate each Fi' s Proficiency Score for user k. 1 ∗ Appearance(Concept j) ∗ Proficiency(Concept j), Fi = Number of Concept appering in Fi.Q ∑ j where Concept j ∈ Ontology, ⎧ 1, Concept j appears in Fi.Q Appearance(Concept j) : ⎨ ⎩ 0, Concept j doesn' t appear in Fi.Q, Proficiency(Concept j) : The drgree of user k' s proficiency in Concept j.

Fig. 14 Proficiency degree calculation algorithm

3.2.5

Personalizer

The Personalizer replaces the terms used in the solution FAQs with the terms in the user’s terminology table, collected by the Query Parser, for improving the solution readability.

3.2.6

User Model Manager

The first task of the User Model Manager is to create an initial user model for a new user. To do this, we pre-defined several questions for each concept in the domain ontology, for

Issue 11, Vol. 4, November 2007

ISSN: 1709-0832

example, “Do you know a CPU contains a floating co-processor?”, “Do you know the concept of 1GB=1000MB in specifying the capacity of a hard disk?”, etc. The difficulty degrees of the questions are proportional to the hierarchy depth of the concepts in the ontology. When a new user logs on the system, the Manager randomly selects questions from the ontology. The user either answers an YES or NO to each question. The answers are collected and weighted according to the respective degrees and passed to the Manager, which then calculates a proficiency score for the user according to the percentage of correctness of his responses to the questions and accordingly instantiates a proper user stereotype as the user model for the user. The second task is to update user models. Here we use the interaction information and user feedback collected by the Interaction Agent in each interaction session or query session. An interaction session is defined as the time period from the time point the user logs in up to when he logs out, while a query session is defined as the time period from when the user gives a query up to when he gets the answers and completes the feedback. An interaction session may contain several query sessions. After a query session is completed, we immediately update the interaction preference and solution presentation of the user model. Specifically, the user’s query mode and solution presentation mode in this query session are remembered in both time windows, and the statistics of the preference change for each mode is calculated accordingly, which will be used to adapt the Interaction Agent on the next query session. Fig. 15 illustrates the algorithm to update the Show_Rate of the similarity mode. The algorithm uses the ratio of the number of user selected FAQs and that of the displayed FAQs to update the show rate; the algorithm to update the Show_Rate of the proficiency mode is similar. NS : Number of FAQ in Solution FAQ List. N(Similarity Mode) : Number of FAQ shown to user in Similarity Mode. N(Similarity Mode) = ⎡NS ∗ Show_Rate(Similarity Mode)old ⎤ NHide = NS - N(Similarity Mode), Number of hidden FAQ. NSelect : Number of FAQ selected by user in the query session. Show_Rate(Similarity Mode) = Show_Rate(Similarity Mode)old + Variation

(

)

(

)

⎧ NSelect ⎛ ⎛ N(Similarity Mode) ⎞ ⎞, if NSelect ≥ 0.7 ⎟⎟ ⎪ NS − 0.7 ∗ ⎜⎝1 − exp⎜⎝ − NS α ⎠⎠ ⎪ ⎪ Variation = ⎨0, if 0.3 < NSelect < 0.7 N S ⎪ ⎛ ⎪ NSelect ⎛ N(Similarity Mode) ⎞ ⎞, if NSelect ≤ 0.3, ⎟⎟ ⎪⎩ NS − 0.3 ∗ ⎝⎜1 − exp⎝⎜ − NS α ⎠⎠ where α : weight change rate, Show_Rate(Similarity Mode) new = Max (Min (Show_Rate(Similarity Mode) , 1) , 0.01)

Fig. 15 Algorithm to update show rate in similarity mode F = {F1, F2,...} ⊆ C, FAQ Collection. Fi ∈ F : FAQs selected and rated by user. For each Fi, update user proficiency of each Concept j. Proficiency (Concept j) = ⎛ ⎞ T ⎛⎜⎝ Concept j⎞⎟⎠ ∗ Understanding Level⎟⎟ Proficiencyold (Concept j) + α ∗ ⎜⎜ ⎝ Number of Concept in Fi ⎠ Proficiencynew (Concept j) = Max(Min (Proficiency (Concept j) , 1) , 0), where α : Learning rate, Concept j∈ Ontology, T(Concept j) : The times Concept j appears in Fi, ⎧+ 2, user rates " very familiar" for Fi. ⎪+ 1, user rates " familiar" for Fi. ⎪⎪ Understanding Level : ⎨0, user rates " average" for Fi. ⎪- 1, user rates " not familiar" for Fi. ⎪ ⎪⎩- 2, user rates " very not familiar" for Fi.

Fig. 16 Algorithm to update the domain proficiency table In addition, each user will be asked to evaluate each solution FAQ in terms of the following five levels of understanding, namely, very familiar, familiar, average, not familiar, very not familiar. This provides an explicit feedback and we can use it to update his domain proficiency table. Fig. 16 shows the updating algorithm. Finally, after each interaction session, we can update the user’s recommendation mode in this session in the respective time window. At the

1404

1405

WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS

same time, we add the query and FAQ-selection records of the user into the query history and selection history of his user model. For each useri in each user proficiency group {

∑ Proficiency(Concept j)

Proficiencyavg(useri ) =

j

Number of concepts in Domain Proficiency Table where Concept j : jth Concept in useri' s Domain Proficiency Table,

,

if (0.8 ≤ Proficiencyavg(useri ) ≤ 1.0 ), user i reassigned to Expert group. if (0.6 ≤ Proficiencyavg(useri ) < 0.8 ), user i reassigned to Senior group. if (0.4 ≤ Proficiencyavg(useri ) < 0.6 ), user i reassigned to Junior group. if (0.2 ≤ Proficiencyavg(useri ) < 0.4 ), user i reassigned to Novice group. if (0.0 ≤ Proficiencyavg(useri ) < 0.2 ), user i reassigned to Amateur group. }

Fig. 17 Algorithm to re-cluster all user groups SExpert : Stereotype of Expert. U = {U1, U2,..., UN}, Users in Expert group. SExpert.Show_Rate(Similarity Mode) =

4.1

i =1

N

ISSN: 1709-0832

from all users in the same group within a time window. 2) Hot topic FAQs. It recommends the first N solution FAQs according to their popularity, calculated as statistics on keywords appearing in the query histories of the same group users within a time window. The algorithm does the hot degree calculation as shown in Fig. 19. 3) Collaborative recommendation. It refers to the user’s selection histories of the same group to provide solution recommendation. The basic idea is this. If user A and user B are in the same group and the first n interaction sessions of user A are the same as those of user B, then we can recommend the highest-rated FAQs in the (n+1)th session of user A for user B, detailed algorithm as shown in Fig. 20.

4

N

∑ Ui.Show_Rate(Similarity Mode)

Issue 11, Vol. 4, November 2007

System Demonstration and Experiments System demonstration

N

SExpert.Show_Rate(Proficiency Mode) =

∑ Ui.Show_Rate(Proficiency Mode) i =1

N For each Concept j in Domain Proficiency Table (DPT)

Basic user Information

{ N

SExpert.DPT.Proficiency(Concept j) =

∑ Ui.DPT.Proficiency(Concept j)

Questionnaire

i =1

N

}

Fig. 18 Example to update the stereotype of Expert The third task of the User Model Manager is to update user stereotypes. This happens when a sufficient number of user models in a stereotype has undergone changes. First, we need to reflect these changes to stereotypes by re-clustering all affected user models, as shown in Fig. 17, and then re-calculates all parameters in each stereotype, an example as shown in Fig. 18.

3.2.7

Fig. 21 System register interface

Automatic Keyword Scrolling List User Interaction Interface

Recommender For each faq i ∈ C, FAQ Collection. HOT Score faq i = 1 ∗ ∑ Appearance(Concept j ) ∗ Weight(Concept j), Number of Concepts appearing in faq i j

FAQ Recommendation List

where Concept j ∈ Ontology,

PC Ontology Tree

⎧1, Concept j appears in faq i Appearance(Concept j ) : ⎨ ⎩0, Concept j doesn' t appear in faq i, Weight(Concept j) : Within a time window, the times that the Concept j appears in user queries of the same user group,

Fig. 19 Algorithm to calculate the hot degree G = {U1, U2,..., Um} , Users in the same group G. Y = {Uj ∈ G | Uj ≠ Ui}. FAQ Collection = {F1, F2,..., Fn}. Ui = {Si,1, Si,2,..., Si, x} , Query sessions of user i. Si, x = {Ri, x,1, Ri, x,2,..., Ri, x, n} , FAQs rated and/or selected by User i in session x. ⎧1, If User i select Fy in session x, and rate more than satisfying. ⎪0.8, If User i select Fy in session x, and rate satisfying. ⎪ ⎪0.6, If User i select Fy in session x, and rate average. ⎪ Ri, x, y : ⎨0.6, If User i select Fy in session x, but no rating. ⎪0.4, If User i select Fy in session x, and rate dissatisfying. ⎪ ⎪0.2, If User i select Fy in session x, and rate less than dissatisfying. ⎪0, If User i doesn' t select Fy in session x. ⎩ ⎛ Similarity(Si, x, Sj, k) = arg max⎜⎜ Sj, k ⎝



∑ ∑ (1 - Distance(Si, x, Sj, k) )⎟⎟,

n

where Distance(Si, x, Sj, k) =



Uj∈Y Sj, k∈Uj

∑ (Ri, x, z - Rj, k, z )

2

z =1

. n Let Sj, k be the most similar to Si, x, and Si, x + 1 be user i in current query session. Recommend following FAQs for Si, x + 1 = {Fy | Rj, k, y > 0.5 and Ri, x, y = 0} U {Fy | Rj, k + 1, y > 0.5 and Ri, x, y = 0}

Fig. 20 Algorithm to do the collaborative recommendation The Recommender uses the following three policies to recommend information. 1) High hit FAQs. It recommends the first N solution FAQs according to their selection counts

Fig. 22 Main tableau of our system When a new user enters the system, the user is registered by the Agent as shown in Fig. 21. At the same time, a questionnaire is produced by the Agent for evaluating the user’s domain proficiency. His answers are then collected and calculated in order to help build an initial user model for the new user. Now the user can get into the main tableau of our system (Fig. 22), which consists of the following three major tab-frames, namely, query interface, solution presentation, and logout. The query interface tab is comprised of the following four frames: user interaction interface, automatic keyword scrolling list, FAQ recommendation list, and PC ontology tree. The user interaction interface contains both keywords and NLP (Natural Language Processing) query modes. The keyword query mode provides the lists of question types and operation types, which allow the users to express their precise intentions. The automatic keyword scrolling list provides ranked-keyword guidance for user query. A user can browse the PC ontology tree to learn domain knowledge. The FAQ recommendation list provides personalized information recommendations from the system, which contains three modes: hit, hot topic, and collaboration.

WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS

When the user clicked a mode, the corresponding popup window is produced by the system. The solution presentation tab is illustrated in Fig. 23. It pre-selects the solutions ranking method according to the user’s preference and hides part of solutions according to his Show_Rate for reducing the cognitive loading of the user. The user can switch the solution ranking method between similarity ranking and proficiency ranking. The user can click the question part of an FAQ (Fig. 24) for displaying its content or giving it a feedback, which contains the satisfaction degree and comprehension degree. Fig. 25 illustrates the window before system logout, which ask the user to fill a questionnaire for statistics to help further system improvement.

Ranking Order Show All Return in E-mail

Fig. 23 Solution presentation

Issue 11, Vol. 4, November 2007

ISSN: 1709-0832

first experiment, we use this same FAQs for testing queries, in order to verify whether any conflicts exist within the query. Table 5 illustrates the experimental results, where only 33 queries match with more than one query patterns and result in confusion of query intention, called “error” in the table. These errors may be corrected by the user. The experiment shows the effectiveness rate of the constructed query templates reaches 97.28%, which implies the template base can be used as an effective knowledge base to do natural language query processing. Table 5 Effectiveness of constructed query patterns #Testing 1215

#Correct 1182

#Error 33

Precision Rate (%) 97.28 %

Our second experiment is to learn how well the Parser understands new queries. First, we collected in total 143 new FAQs, different from the FAQs collected for constructing the query templates, from four famous motherboard factories in Taiwan, including ASUS, GIGABYTE, MSI, and SIS. We then used the question parts of those FAQs for testing queries, which test how well the Parser performs. Our experiments show that we can precisely extract true query intentions and focuses from 112 FAQs. The rest of 31 FAQs contain up to three or more sentences in queries, which explain why we failed to understand them. In summary, 78.3% (112/143) of the new queries can be successfully understood. Table 6 User satisfaction evaluation CPU MOTHERBOARD MEMORY AVERAGE (SE / ST) (SE / ST) (SE / ST) (SE / ST) Alta Vista 63% / 61% 77% / 78% 30% / 21% 57% / 53% Excite 66% / 62% 81% / 81% 50% / 24% 66% / 56% Google 66% / 64% 81% / 80% 38% / 21% 62% / 55% HotBot 69% / 63% 78% / 76% 62% / 31% 70% / 57% InfoSeek 69% / 70% 71% / 70% 49% / 28% 63% / 56% Lycos 64% / 67% 77% / 76% 36% / 20% 59% / 54% Yahoo 67% / 61% 77% / 78% 38% / 17% 61% / 52% Our approach 78% / 69% 84% / 78% 45% / 32% 69% / 60% K_WORD METHOD

User Feedback

Fig. 24 FAQ-selection and feedback enticing

A Questionnaire For Statistics

Fig. 25 System logout

4.2

System Evaluation

The evaluation of the overall performance of our system involves lots of manpower and is time-consuming. Here, we focus on the performance evaluation of the most important module, i.e., the Query Parser. Our philosophy is that if it can precisely parse user queries and extract both true query intention and focus from them, then we can effectively improve the quality of the retrieved. Recall that the Query Parser employs the technique of template-based pattern matching mechanism to understand user queries and the templates were manually constructed from 1215 FAQs. In the

Finally, Table 6 shows the comparison of user satisfaction of our systemic prototype against other search engines. In the table, ST, for Satisfaction of testers, represents the average of satisfaction responses from 10 ordinary users, while SE, for Satisfaction of experts, represents that of satisfaction responses from 10 experts. Basically, each search engine receives 100 queries and returns the first 100 webpages for evaluation of satisfaction by both experts and non-experts. The table shows that our approach, the last row, enjoys the highest satisfaction in all classes. From the evaluation, we conclude that, unless the comparing search engines are specifically tailored to this specific domain, such as HotBot and Excite, our techniques, in general, retrieves more correct webpages in almost all classes.

5

Discussions and Future work

We have developed an Interface Agent to work as an assistant between the users and systems, which is different from system architecture and implementation over our previous work [12]. It is used to retrieve FAQs on the domain of PC. We integrated several interesting techniques including user modeling, domain ontology, and template-based linguistic processing to effectively tackle the problems associated with traditional FAQ retrieval systems. In short, our work features an ontology-supported, template-based user modeling technique for developing interface agents; a nature language query mode, along with an improved keyword-based query mode; and an assistance and guidance for human-machine

1406

1407

WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS

interaction. Our preliminary experimentation demonstrates that user intention and focus of up to eighty percent of the user queries can be correctly understood by the system, and accordingly provides the query solutions with higher user satisfaction. In the future, we are planning to employ the techniques of machine learning and data mining to automate the construction of the template base. As to the allover system evaluation, we are planning to employ the concept of usability evaluation on the domain of human factor engineering to evaluate the performance of the user interface.

Acknowledgements The author would like to thank Yai-Hui Chang and Ying-Hao Chiu for their assistance in system implementation. This work was supported by the National Science Council, R.O.C., under Grant NSC-95-2221-E-129-019.

[14]

[15]

[16]

[17]

References: [1] [2]

[3]

[4]

[5] [6] [7] [8]

[9]

[10] [11]

[12]

[13]

Chandrasekaran, B., Josephson, J.R., and Benjamins, V.R., What Are Ontologies, and Why Do We Need Them? IEEE Intelligent Systems, Vol. 14, No. 1, 1999, pp. 20-26. Chiu, Y.H., An Interface Agent with Ontology-Supported User Models, Master Thesis, Department of Electronic Engineering, National Taiwan University of Science and Technology, Taiwan, R.O.C., 2003. Hovy, E., Hermjakob, U., and Ravichandran, D., A Question/Answer Typology with Surface Text Patterns, Proc. of the DARPA Human Language Technology conference, San Diego, CA, USA, 2002, pp. 247-250. Noy, N.F. and McGuinness, D.L., Ontology Development 101: A Guide to Creating Your First Ontology, Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Tech. Rep. SMI-2001-0880, 2001. Rich, E., User Modeling via Stereotypes, Cognitive Science, Vol. 3, 1979, pp. 329-354. Salton, G., Wong, A., and Yang, C.S., A Vector Space Model for Automatic Indexing, Communications of ACM, Vol. 18, No. 11, 1975, pp. 613-620. Salton, G. and McGill, M.J., Introduction to Modern Information Retrieval, McGraw-Hill Book Company, New York, USA, 1983. Soubbotin, M.M. and Soubbotin, S.M., Patterns of Potential Answer Expressions as Clues to the Right Answer, Proc. of the TREC-10 Conference, NIST, Gaithersburg, MD, USA, 2001, pp. 293-302. Tsai, C. H. MMSEG: A word identification system for Mandarin Chinese text based on two variants of the maximum matching algorithm. Available at http://technology.chtsai.org/mmseg/, 2000. Winiwarter, W., Adaptive Natural Language Interface to FAQ Knowledge Bases, International Journal on Data and Knowledge Engineering, Vol. 35, 2000, pp. 181-199. Yang, S.Y. and Ho, C.S., Ontology-Supported User Models for Interface Agents, Proc. of the 4th Conference on Artificial Intelligence and Applications, Chang-Hwa, Taiwan, 1999, pp. 248-253. Yang, S.Y. and Ho, C.S., An Intelligent Web Information Aggregation System Based upon Intelligent Retrieval, Filtering and Integration, Proc. of the 2004 International Workshop on Distance Education Technologies, Hotel Sofitel, San Francisco Bay, CA, USA, 2004, pp. 451-456. Yang, S.Y., Chiu, Y.H., and Ho, C.S., Ontology-Supported and Query Template-Based User Modeling Techniques for Interface Agents, 2004 The 12th National Conference on

[18]

[19] [20]

Issue 11, Vol. 4, November 2007

ISSN: 1709-0832

Fuzzy Theory and Its Applications, I-Lan, Taiwan, 2004, pp. 181-186. Yang, S.Y., Chuang, F.C., and Ho, C.S., Ontology-Supported FAQ Processing and Ranking Techniques, Accepted for publication in International Journal of Intelligent Information Systems, 2005. Yang, S.Y., Liao, P.C., and Ho, C.S., A User-Oriented Query Prediction and Cache Technique for FAQ Proxy Service, Proc. of the 2005 International Workshop on Distance Education Technologies, Banff, Canada, 2005, pp. 411-416. Yang, S.Y., Liao, P.C., and Ho, C.S., An Ontology-Supported Case-Based Reasoning Technique for FAQ Proxy Service, Proc. of the Seventeenth International Conference on Software Engineering and Knowledge Engineering, Taipei, Taiwan, 2005, pp. 639-644. Yang, S.Y., FAQ-master: A New Intelligent Web Information Aggregation System, International Academic Conference 2006 Special Session on Artificial Intelligence Theory and Application, Tao-Yuan, Taiwan, 2006, pp. 2-12. Yang, S.Y., An Ontology-Supported and Query Template-Based User Modeling Technique for Interface Agents, 2006 Symposium on Application and Development of Management Information System, Taipei, Taiwan, 2006, pp. 168-173. Yang, S.Y., An Ontology-Supported Website Model for Web Search Agents, Accepted for presentation in 2006 International Computer Symposium, Taipei, Taiwan, 2006. Yang, S.Y., How Does Ontology Help Information Management Processing, WSEAS Transactions on Computers, Vol. 5, No. 9, 2006, pp. 1843-1850.