Improving Performance in Constructing specific Web ... - Springer Link

Improving Performance in Constructing specific Web Directory using Focused Crawler: An Experiment on Botany Domain Madjid Khalilian, Farsad Zamani Boroujeni, Norwati Mustapha Universiti Putra Malaysia (UPM),Faculty of Computer Science and Information Technology(FSKTM)

Abstract-Nowadays the growth of the web causes some

experiments show that information which gathered by this

difficulties to search and browse useful information especially in

approach does not have enough quality. We proposed a semi-

specific domains. However, some portion of the web remains

automatic framework for constructing high quality web

largely underdeveloped, as shown in lack of high quality contents. An example is the botany specific web directory, in which lack of well-structured web directories have limited user’s ability to browse required information. In this research we propose an

directory for experiment. We select plants classification and identification domain because of the importance of this domain for agricultural

improved framework for constructing a specific web directory. In

science. Besides having own special properties, this domain

this framework we use an anchor directory as a foundation for

uses a literature includes specific terms in Latin and it is

primary web directory. This web directory is completed by

standard for all languages. In addition, this domain has a

information which is gathered with automatic component and

perfect hierarchy. These two properties of plant classification

filtered by experts. We conduct an experiment for evaluating

affect methods and frame work.

effectiveness, efficiency and satisfaction.

Two main components have been used in our framework; one

Keywords: Botany, Ontology, Focused Crawler, Web Directory,

of them refers to human knowledge while the second one is

Document Similarity

about automatic component. We choose to use vertical search engine instead of multipurpose search engine. Vertical search I. Introduction

engine uses a specific domain for doing search. There is a main

One of the main challenges for a researcher is how to organize

component in vertical search engine that is called focused

vast amount of data and facilities for accessing them through

crawler. In contrast with regular crawler, in this kind of crawler

the web. With the explosive growth of the web, the need for

specific domain web pages are gathered. It helps us to avoid

effective search and browse is apparent. User overwhelmed

less quality web pages which are not related to specific

with a lot of unstructured or semi-structured information. One

domain. Some techniques have been used for increasing

solution for this problem is creating and organizing data in a

precision and recall. These two parameters are described by

specific domain directory. There are two main approaches for

Fig. 1.

creating a web directory:

Our interest is to increase the intersection of A and B; this can

1-Creating web directory manually, 2-creating and organizing

be done by increasing precision and recall. In the next section a

automatically. Each of them has its own disadvantages. For

survey on related works is given. Then in section 3 our

manual web directory we need lots of editors whose knowledge

research model is explained. Section 4 describes the

is different from one another. On the other hand, it is not

methodology. Experiment, results and discussion are explained

scalable because web is being growth but our resources are limited. Using automatic approach has own difficulties. Most

K. Elleithy (ed.), Advanced Techniques in Computing Sciences and Software Engineering, DOI 10.1007/978-90-481-3660-5_79, © Springer Science+Business Media B.V. 2010

KHALILIAN ET AL.

462

that ontology is used for web mining of home pages. In this work the employed ontology is hand made. Some of projects use ontology to do online classification of documents [4]. To do so for different meaning of query terms different classes are created and documents in the result list are put in appropriate classes. The used ontology in this case is WordNet which is a general ontology. Reference [5] defined ontology for email and Fig.1. A: specific domain web pages, B: gathered web pages by crawler, B-A:

dose classification. It also uses protégé for ontology definition

gathered web pages which are not relevant, A-B: not gathered web pages

and CLIPS is employed reasoning on it. In most of this works a

which are relevant web pages.

ready made ontology is used and prevents extendibility of the

in sections 5 and 6 and finally we present our conclusion and

proposed methods for new domains. Reference [6] proposed an

future works.

approach that uses an automatic method for creation of ontology. Although it might not be suitable for reasoning

II. Related Works A very large number of studies over the years have been done

proposes, it is a good facility for extending the search for the

to improve searching and browsing web pages. There is a

2-Automatic and semiautomatic search environments

general categorization for search systems: traditional text-

One Approach for creating effective ontology is using vertical

based

directory.

search engine as a tool. Vertical search engines use focused

Development organizing knowledge for domain specific

crawler for specific domain. The motivation for focused

applications has challenged academics and practitioners due to

crawling comes from the poor performance of general-purpose

the abundance of information and the difficulty of the

search engines, which depended on the result of generic web

categorizing the information. In this review of related research,

crawlers. The focus crawler is a system that learns the

we maintain our focus on research concerning web directory

specialization from examples, and then explores the web,

search environments. In this review we study three main

guided by a relevance and popularity rating mechanism.

topics:

search

environments

Ontology,

automatic

[1]

and

Web

and semiautomatic search

whole web space.

Reference [7] suggested a framework using combination of

environments, plants web directory.

link structure and content similarity. Experiments in this paper

1-Ontology and web directory

show that considering link structure and content together can

In the first section we review the ontology concept. In fact

improve quality of retrieved information. Reference [8]

knowledge is classification and without classification there

proposed a new method based on content block Algorithm to

could be no thought, action or organization. Reference [2]

enhance focused crawler’s ability of traversing tunnel. The

mentioned that one of the most important problems for web is

novel algorithm not only avoid granularity becoming too

huge amount of information without any structure among them.

coarse when evaluation on the whole page but also avoid

Ontology is an agreement between agents that exchange

granularity becoming too fine based on link context. An

information. The agreement is a model for structuring and

intelligent crawling technique has been used which can

interpreting exchanged data and a vocabulary that constraints

dynamically adapt ontology to the particular structure of the

these exchanges. Using ontology, agents can exchanges vast

relevant predicate. In reference [9] adaptive focused crawler

quantities of data and consistently interpret it.

based on ontology has used which is included two parts: topic

Many applications contain leverage ontology as an effective

filter and link forecast. The topic filter can filter the web pages

approach for constructing web directory. Reference [3] states

IMPROVING PERFORMANCE IN CONSTRUCTING SPECIFIC WEB DIRECTORY

463

having already been fetched and not related to topic ; and the

because they can determine list of web pages which point to

link forecaster can predict topic link for the next crawl , which

specific page. After extraction of this graph web pages are

can guide our crawler to fetch topic page as many as possible

ranked and base on this ranking web pages are arranged.

in a topic group by traversing the unrelated link . Proposed

3-Plants web resources

method use ontology for crawling and during this process it is

There is some web directories such as USDA (Protection

able to concrete ontology. It also use ontology learning concept

American agriculture) is the basic charge of the U.S.

that it can be useful for using for our project as a future work.

Department agriculture (www.usda.gov). APHIS (Animal and

Reference [10] proposed a frame work with semi-automatic

Plant Health Inspection Science) provides leadership in

approach. In fact it combines the ability of human and machine

ensuring the health and care of animal and plants. None of

together and avoids disadvantages of each of these two

them have been concentrated on botany topic. The Internet

methods separately. It seems that the automatic part of this

Directory of Botany (IDB) is an index to botanical information

framework needs to be improved. A meta-search section has

available on the internet (www.botany.net). It is created by

been employed. In order to gather information and web pages

some botany specialists manually but it is a well organized

for this purpose, a multipurpose search engine is used. [11].

alphabetic web directory which we choose it as a benchmark.

Using multipurpose search engine has own disadvantages: 1) Enormous amount of unusable web pages can reduce the speed of the process. 2) Precision is not acceptable because a huge amount of web pages should be searched. 3) Multipurpose repository should be up to dated and it consumes time. On the other hand only a limited section of web has been covered by even the biggest search engine. Reference [12] mentioned that using traditional methods of information retrieval can help us to increase efficiency based on these methods. Using vector space model, web pages which are related to specific domain had been selected and all links inside them had been crawled. Reference [13] proposed a novel approach based on manifold ranking of document blocks to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by

III. Research Model In this study we improve the framework proposed by [10] which includes three main steps. Instead of using meta- search we use a vertical search engine with focused crawler. For this reason we need to have a kernel for our crawler and a vector that shows domain literature. So we change step1 and step2 of this framework and step3 remains unchanged, As shown in Fig.2. In our framework, after constructing anchor directory we initiate a new set of URLs which are related to plant domain. In third step we design a focused crawler that works based on cosine similarity. For this purpose, we developed a software tool that crawls links from the kernel using cosine similarity method. Gathered web pages are put in a repository and finally a heuristic filtering is done. In this section we describe the proposed framework.

propagating the ranking scores between the blocks on a weighted graph. The manifold ranking process can make use of the relationships among document blocks to improve retrieval performance. However, link structure among web pages has not been considered. Reference [14] has used domain graph concept for web page ranking. For every page in kernel a graph is constructed from pages that point to selected page. Multipurpose search engine is leveraged in this method

1. Anchor directory Based on our review on web search engines and directories, we choose DMOZ directory as the anchor directory (www.Dmoz.com). In most search engine this directory is employed. In addition, expert people refer to this web directory for special search in Botany. For this reason some refinement should be done to construct anchor web directory.

KHALILIAN ET AL.

464

view on web search engines, we have chosen the DMOZ directory as the anchor directory because of its comprehensive Anchor directory

Heuristic Filtering

plant directory and its wide acceptance, as seen in the fact that is used by such major search engines as Google, AOL and Netscape search [10]. We use English language in our research. At the first trying we aggregated 1180 nodes from DMOZ, and after filtering the nodes we got 331 nodes corresponding to plant classification and identification. Table 1 Summarize

Kernel set of URL’s and Literature vector

statistics of PCI web directory.

Focused crawler

In order to compare our proposed approach, we selected a benchmark web directory called IDB (internet directory for

Fig.2.An

improved

semi-automatic

framework

for

constructing

web

directories.

botany) www.botany.net/IDB/botany.html, it is an index to botanical information available on directory.

2. Kernel & literature vector creation for the next section We need to have a kernel for gathering information through

V. Experimental Procedure A. Design and Participants

the web. For this purpose, a set of URLs should be created. It

We have measured effectiveness, efficiency and satisfaction

also requires a vector with specific terms in Botany domain.

of our PCI web directory with compare IDBAL as a

Creation of the set and vector has been done by experts. For

benchmark. It is one of the best well designed web site for

this purpose 110 web pages have been selected and a vector

botany but it is not scalable, user friendly and efficient. Ten

with 750 terms in botany domain has been created.

experts in botany have been invited and by them we measured the criteria that we mentioned. Two groups of experts,

3. Vertical search engine

graduated and undergraduate with different age and experience

To fill in the plant web directory we use a focused crawler

have been selected, for testing and comparing PCI and IDBAL,

that works based on similarity search among web pages and

We interested in observing effectiveness, efficiency and

vector that describes plant domain. For example, in focused

satisfaction between two groups of PCI and IDBAL via

crawler based on content of current web page, a vector of

experts’ testing. Basically we measure effectiveness by three

existing terms is created after that value of similarity between

values of precision, recall and f-value. Efficiency is time taken

current page and vector domain is calculated. If this similarity

by the users to get some questions. Based on some questions

was less than a threshold value, this page would be saved in a

we are able to measure satisfaction. For calculating efficiency,

repository and crawling would be continued, otherwise it

we organize an experiment on the search of a specific topic by

would be deleted and would be crawling stopped.

experts in both web directories (benchmark and proposed web

4. Heuristic filtering

directory). At the end, using a questionnaire about different

Finally we evaluate output of step 2 and step 3 by a group of TABLE 1 SUMMERY STATISTICS OF PCI WEB DIRECTORY

experts. Expert people can score the output of crawling. IV. Methodology Using mentioned framework, we developed a web directory called

PCI

web

directory

(Plant

Classification

and

Identification) which includes plant web sites. Based on our

IMPROVING PERFORMANCE IN CONSTRUCTING SPECIFIC WEB DIRECTORY aspects of web directory, i.e. being helpful and positive and

recall =

negative points, we are able to measure the satisfaction parameter (table2). We ranked each question between 1 and 5 where 1 is for completely satisfied and 5 for not satisfied.

465

| {Relevant} ∩ {Retrieved} | | {Relevant} |

F-value = 2 * recall * precision / (recall + precision).

This parameter is used to balance between recall and B. Hypothesis testing

precision. With these three parameters we can define the

Our objective focused on using a frame work that combines

quality.

experts’ knowledge and automatic approaches to help for

Efficiency (Speed up): comparing speed of users

searching and browsing information in specific domain like

when they use web directory for searching and

botany and improves efficiency and speed up in this situation.

browsing a specific topic with multipurpose search

Constructing a web directory for botany can help specialists to

engines.

search and browse more efficient, effective and it is able to satisfy users than benchmark.

Satisfaction: how much web directory is useful for users?

C. Performance measure We defined follow terms for evaluating proposed framework:

Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct”

experiment. For this experiment (Table3) a web directory has been organized (PCI) and compared with a benchmark

responses)

precision =

VI. Results and discussion In this section, we report and discuss the results of our

| {Relevant} ∩ {Retrieved} | | {Retrieved} |

Recall: the percentage of documents that are relevant

(IDBAL). As it is seen hypothesis supported in most criteria. Efficiency is not better in PCI than IDBAL. It refers to sorted and well organized web pages in IDBAL. It seems that is difficulty in structure and organization of web directory (PCI).

to the query and were, in fact, retrieved VII.Conclusions and Future Works Constructing a specific domain directory which is included

TABLE 2 COMMENTS BASED ON EXPERIMENTS Comments on PCI Web directory Easy and nice to use Useful and with relevant results Good way to organization information Not enough information for browsing No comments/unclear comments

high quality web pages has challenged developers of web

Number of subjects expressing the meanings

portals. In this paper we proposed a semiautomatic framework

1 4 1 4 0

Experts evaluate outline web directory with compare a

that combines knowledge of human with automatic techniques.

benchmark. Most expected results has been satisfied. For future work we will investigate new intelligent techniques for improving precision and recall in focused crawler. It seems that using dynamic vector instead of vector with constant terms

Comments on IDBAL Web directory

Number of subjects

can improve effectiveness. It is feasible with adding and

expressing

deleting terms in reference vector. Another direction includes

the meanings

refining PCI web directory and makes it more efficient. Other

Not useful/not good

2

Not clear for browsing/gives irrelevant results

2

Easy to navigate or browse No comments/unclear comments

3 3

specific domain can be applied in this framework.

KHALILIAN ET AL.

466 TABLE 3

STATISTICAL RESULTS OF HYPOTHESIS TESTING (S=supported) PCI

IDBAL

MEASURE

Mean

Mean

Precision

0.87

Recall

0.29

0.17

S

f-value

0.435

0.255

S

Efficiency

173

160

NOT S

Satisfaction

3.1

3.3

S

0.75

Results

[6] M. Khalilian, K. Sheikh, H. abolhassani (2008), Classification of web pages by automatically generated categories, Innovations and Advanced Techniques in Systems, Computing Sciences and Software Engineering, springer,ISBN: 978-1-4020-8734-9 [7] M. Jamali et .al . A Frame Work using Combination of link structure and Content similarity.

S

References [1] H. Topi, W. Locase, Mix and Match: combining terms and opratores for successful web searches, information processing and management 41(2005) 801-817 [2] H.P.ALSSO and F.SMILL, Thinking on the Web, John WILLEY New Jersey 2006. [3] Ee -Peng Lim and Aixin Sun: Web Mining- The Ontology Approach, The International Advanced Digital Library Conference in Nagoya Noyori Conference Hall Nagoya University, Japan August 25-26, 2005 [4] EW De Luca, A Nürnberger: Improving Ontology-Based Sense Folder Classification of Document Collections with Clustering Methods Proc. of the 2nd Int. Workshop on Adaptive Multimedia. 2004. [5] Taghva, K. Borsack, J. Coombs, J. Condit, A. Lumos, S. Nartker, T: Ontology-based classification of email, ITCC 2003. International Conference on Information Technology: Coding and Computing 2003.

[8] N.LUO, W-ZUO. F.YUON, A New Method for Focused Crawler Cross Tunnel, RSKT2006. pp 632-637 [9] Chage Su. J.yang,An efficient adaptive focused crawler based on ontology learning . 5th ICHIS IEEE 2005 [10] Wingyan Chung, G. Lai, A. Bonillas, W. Xi, H. Chen, organizing domain-specific information on the web: An experiment on the Spanish business web directory, int. j. human computer studies 66 (2008) 51-66 [11] M. Khalilian, K. Sheikh, H. abolhassani (2008), Controlling Threshold Limitation in Focused crawler with Decay Concept, 13th National CSI Conference Kish Island Iran [12] F. Menczer and G. Pant and P. Srinivasan. Topic-driven crawlers: Machine learning issues, ACMTOIT, Submitted 2002.

[13] X. Wan, J. Yang, J. Xiao, Towards a unified approach to document similarity search using manifold ranking of blocks, Information processing and Management 44 (2008) 1032-1048 [14] M. Diligenti, F. Coetzee, S. Lawrence, C. Giles and M. Gori, Focused Crawling Using Context Graphs, In Proceedings of the 26th International Conference on VLDB Egypt (2000) [15] Cai, D., Yu, S., Wen,J., & Ma., W, -Y. (2003) ;VIPS ; A vision based page segmentation algorithm. Microsoft Technical Report, MSRTR- 2003-79.