For every page in kernel a graph is constructed from pages .... [2] H.P.ALSSO and F.SMILL, Thinking on the Web, John WILLEY New. Jersey 2006. [3] Ee -Peng ...
Improving Performance in Constructing specific Web Directory using Focused Crawler: An Experiment on Botany Domain Madjid Khalilian, Farsad Zamani Boroujeni, Norwati Mustapha Universiti Putra Malaysia (UPM),Faculty of Computer Science and Information Technology(FSKTM)
Abstract-Nowadays the growth of the web causes some
experiments show that information which gathered by this
difficulties to search and browse useful information especially in
approach does not have enough quality. We proposed a semi-
specific domains. However, some portion of the web remains
automatic framework for constructing high quality web
largely underdeveloped, as shown in lack of high quality contents. An example is the botany specific web directory, in which lack of well-structured web directories have limited user’s ability to browse required information. In this research we propose an
directory for experiment. We select plants classification and identification domain because of the importance of this domain for agricultural
improved framework for constructing a specific web directory. In
science. Besides having own special properties, this domain
this framework we use an anchor directory as a foundation for
uses a literature includes specific terms in Latin and it is
primary web directory. This web directory is completed by
standard for all languages. In addition, this domain has a
information which is gathered with automatic component and
perfect hierarchy. These two properties of plant classification
filtered by experts. We conduct an experiment for evaluating
affect methods and frame work.
effectiveness, efficiency and satisfaction.
Two main components have been used in our framework; one
Keywords: Botany, Ontology, Focused Crawler, Web Directory,
of them refers to human knowledge while the second one is
Document Similarity
about automatic component. We choose to use vertical search engine instead of multipurpose search engine. Vertical search I. Introduction
engine uses a specific domain for doing search. There is a main
One of the main challenges for a researcher is how to organize
component in vertical search engine that is called focused
vast amount of data and facilities for accessing them through
crawler. In contrast with regular crawler, in this kind of crawler
the web. With the explosive growth of the web, the need for
specific domain web pages are gathered. It helps us to avoid
effective search and browse is apparent. User overwhelmed
less quality web pages which are not related to specific
with a lot of unstructured or semi-structured information. One
domain. Some techniques have been used for increasing
solution for this problem is creating and organizing data in a
precision and recall. These two parameters are described by
specific domain directory. There are two main approaches for
Fig. 1.
creating a web directory:
Our interest is to increase the intersection of A and B; this can
1-Creating web directory manually, 2-creating and organizing
be done by increasing precision and recall. In the next section a
automatically. Each of them has its own disadvantages. For
survey on related works is given. Then in section 3 our
manual web directory we need lots of editors whose knowledge
research model is explained. Section 4 describes the
is different from one another. On the other hand, it is not
methodology. Experiment, results and discussion are explained
scalable because web is being growth but our resources are limited. Using automatic approach has own difficulties. Most
K. Elleithy (ed.), Advanced Techniques in Computing Sciences and Software Engineering, DOI 10.1007/978-90-481-3660-5_79, © Springer Science+Business Media B.V. 2010
KHALILIAN ET AL.
462
that ontology is used for web mining of home pages. In this work the employed ontology is hand made. Some of projects use ontology to do online classification of documents [4]. To do so for different meaning of query terms different classes are created and documents in the result list are put in appropriate classes. The used ontology in this case is WordNet which is a general ontology. Reference [5] defined ontology for email and Fig.1. A: specific domain web pages, B: gathered web pages by crawler, B-A:
dose classification. It also uses protégé for ontology definition
gathered web pages which are not relevant, A-B: not gathered web pages
and CLIPS is employed reasoning on it. In most of this works a
which are relevant web pages.
ready made ontology is used and prevents extendibility of the
in sections 5 and 6 and finally we present our conclusion and
proposed methods for new domains. Reference [6] proposed an
future works.
approach that uses an automatic method for creation of ontology. Although it might not be suitable for reasoning
II. Related Works A very large number of studies over the years have been done
proposes, it is a good facility for extending the search for the
to improve searching and browsing web pages. There is a
2-Automatic and semiautomatic search environments
general categorization for search systems: traditional text-
One Approach for creating effective ontology is using vertical
based
directory.
search engine as a tool. Vertical search engines use focused
Development organizing knowledge for domain specific
crawler for specific domain. The motivation for focused
applications has challenged academics and practitioners due to
crawling comes from the poor performance of general-purpose
the abundance of information and the difficulty of the
search engines, which depended on the result of generic web
categorizing the information. In this review of related research,
crawlers. The focus crawler is a system that learns the
we maintain our focus on research concerning web directory
specialization from examples, and then explores the web,
search environments. In this review we study three main
guided by a relevance and popularity rating mechanism.
topics:
search
environments
Ontology,
automatic
[1]
and
Web
and semiautomatic search
whole web space.
Reference [7] suggested a framework using combination of
environments, plants web directory.
link structure and content similarity. Experiments in this paper
1-Ontology and web directory
show that considering link structure and content together can
In the first section we review the ontology concept. In fact
improve quality of retrieved information. Reference [8]
knowledge is classification and without classification there
proposed a new method based on content block Algorithm to
could be no thought, action or organization. Reference [2]
enhance focused crawler’s ability of traversing tunnel. The
mentioned that one of the most important problems for web is
novel algorithm not only avoid granularity becoming too
huge amount of information without any structure among them.
coarse when evaluation on the whole page but also avoid
Ontology is an agreement between agents that exchange
granularity becoming too fine based on link context. An
information. The agreement is a model for structuring and
intelligent crawling technique has been used which can
interpreting exchanged data and a vocabulary that constraints
dynamically adapt ontology to the particular structure of the
these exchanges. Using ontology, agents can exchanges vast
relevant predicate. In reference [9] adaptive focused crawler
quantities of data and consistently interpret it.
based on ontology has used which is included two parts: topic
Many applications contain leverage ontology as an effective
filter and link forecast. The topic filter can filter the web pages
approach for constructing web directory. Reference [3] states
IMPROVING PERFORMANCE IN CONSTRUCTING SPECIFIC WEB DIRECTORY
463
having already been fetched and not related to topic ; and the
because they can determine list of web pages which point to
link forecaster can predict topic link for the next crawl , which
specific page. After extraction of this graph web pages are
can guide our crawler to fetch topic page as many as possible
ranked and base on this ranking web pages are arranged.
in a topic group by traversing the unrelated link . Proposed
3-Plants web resources
method use ontology for crawling and during this process it is
There is some web directories such as USDA (Protection
able to concrete ontology. It also use ontology learning concept
American agriculture) is the basic charge of the U.S.
that it can be useful for using for our project as a future work.
Department agriculture (www.usda.gov). APHIS (Animal and
Reference [10] proposed a frame work with semi-automatic
Plant Health Inspection Science) provides leadership in
approach. In fact it combines the ability of human and machine
ensuring the health and care of animal and plants. None of
together and avoids disadvantages of each of these two
them have been concentrated on botany topic. The Internet
methods separately. It seems that the automatic part of this
Directory of Botany (IDB) is an index to botanical information
framework needs to be improved. A meta-search section has
available on the internet (www.botany.net). It is created by
been employed. In order to gather information and web pages
some botany specialists manually but it is a well organized
for this purpose, a multipurpose search engine is used. [11].
alphabetic web directory which we choose it as a benchmark.
Using multipurpose search engine has own disadvantages: 1) Enormous amount of unusable web pages can reduce the speed of the process. 2) Precision is not acceptable because a huge amount of web pages should be searched. 3) Multipurpose repository should be up to dated and it consumes time. On the other hand only a limited section of web has been covered by even the biggest search engine. Reference [12] mentioned that using traditional methods of information retrieval can help us to increase efficiency based on these methods. Using vector space model, web pages which are related to specific domain had been selected and all links inside them had been crawled. Reference [13] proposed a novel approach based on manifold ranking of document blocks to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by
III. Research Model In this study we improve the framework proposed by [10] which includes three main steps. Instead of using meta- search we use a vertical search engine with focused crawler. For this reason we need to have a kernel for our crawler and a vector that shows domain literature. So we change step1 and step2 of this framework and step3 remains unchanged, As shown in Fig.2. In our framework, after constructing anchor directory we initiate a new set of URLs which are related to plant domain. In third step we design a focused crawler that works based on cosine similarity. For this purpose, we developed a software tool that crawls links from the kernel using cosine similarity method. Gathered web pages are put in a repository and finally a heuristic filtering is done. In this section we describe the proposed framework.
propagating the ranking scores between the blocks on a weighted graph. The manifold ranking process can make use of the relationships among document blocks to improve retrieval performance. However, link structure among web pages has not been considered. Reference [14] has used domain graph concept for web page ranking. For every page in kernel a graph is constructed from pages that point to selected page. Multipurpose search engine is leveraged in this method
1. Anchor directory Based on our review on web search engines and directories, we choose DMOZ directory as the anchor directory (www.Dmoz.com). In most search engine this directory is employed. In addition, expert people refer to this web directory for special search in Botany. For this reason some refinement should be done to construct anchor web directory.
KHALILIAN ET AL.
464
view on web search engines, we have chosen the DMOZ directory as the anchor directory because of its comprehensive Anchor directory
Heuristic Filtering
plant directory and its wide acceptance, as seen in the fact that is used by such major search engines as Google, AOL and Netscape search [10]. We use English language in our research. At the first trying we aggregated 1180 nodes from DMOZ, and after filtering the nodes we got 331 nodes corresponding to plant classification and identification. Table 1 Summarize
Kernel set of URL’s and Literature vector
statistics of PCI web directory.
Focused crawler
In order to compare our proposed approach, we selected a benchmark web directory called IDB (internet directory for
Fig.2.An
improved
semi-automatic
framework
for
constructing
web
directories.
botany) www.botany.net/IDB/botany.html, it is an index to botanical information available on directory.
2. Kernel & literature vector creation for the next section We need to have a kernel for gathering information through
V. Experimental Procedure A. Design and Participants
the web. For this purpose, a set of URLs should be created. It
We have measured effectiveness, efficiency and satisfaction
also requires a vector with specific terms in Botany domain.
of our PCI web directory with compare IDBAL as a
Creation of the set and vector has been done by experts. For
benchmark. It is one of the best well designed web site for
this purpose 110 web pages have been selected and a vector
botany but it is not scalable, user friendly and efficient. Ten
with 750 terms in botany domain has been created.
experts in botany have been invited and by them we measured the criteria that we mentioned. Two groups of experts,
3. Vertical search engine
graduated and undergraduate with different age and experience
To fill in the plant web directory we use a focused crawler
have been selected, for testing and comparing PCI and IDBAL,
that works based on similarity search among web pages and
We interested in observing effectiveness, efficiency and
vector that describes plant domain. For example, in focused
satisfaction between two groups of PCI and IDBAL via
crawler based on content of current web page, a vector of
experts’ testing. Basically we measure effectiveness by three
existing terms is created after that value of similarity between
values of precision, recall and f-value. Efficiency is time taken
current page and vector domain is calculated. If this similarity
by the users to get some questions. Based on some questions
was less than a threshold value, this page would be saved in a
we are able to measure satisfaction. For calculating efficiency,
repository and crawling would be continued, otherwise it
we organize an experiment on the search of a specific topic by
would be deleted and would be crawling stopped.
experts in both web directories (benchmark and proposed web
4. Heuristic filtering
directory). At the end, using a questionnaire about different
Finally we evaluate output of step 2 and step 3 by a group of TABLE 1 SUMMERY STATISTICS OF PCI WEB DIRECTORY
experts. Expert people can score the output of crawling. IV. Methodology Using mentioned framework, we developed a web directory called
PCI
web
directory
(Plant
Classification
and
Identification) which includes plant web sites. Based on our
IMPROVING PERFORMANCE IN CONSTRUCTING SPECIFIC WEB DIRECTORY aspects of web directory, i.e. being helpful and positive and
recall =
negative points, we are able to measure the satisfaction parameter (table2). We ranked each question between 1 and 5 where 1 is for completely satisfied and 5 for not satisfied.
465
| {Relevant} ∩ {Retrieved} | | {Relevant} |
F-value = 2 * recall * precision / (recall + precision).
This parameter is used to balance between recall and B. Hypothesis testing
precision. With these three parameters we can define the
Our objective focused on using a frame work that combines
quality.
experts’ knowledge and automatic approaches to help for
Efficiency (Speed up): comparing speed of users
searching and browsing information in specific domain like
when they use web directory for searching and
botany and improves efficiency and speed up in this situation.
browsing a specific topic with multipurpose search
Constructing a web directory for botany can help specialists to
engines.
search and browse more efficient, effective and it is able to satisfy users than benchmark.
Satisfaction: how much web directory is useful for users?
C. Performance measure We defined follow terms for evaluating proposed framework:
Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct”
experiment. For this experiment (Table3) a web directory has been organized (PCI) and compared with a benchmark
responses)
precision =
VI. Results and discussion In this section, we report and discuss the results of our
| {Relevant} ∩ {Retrieved} | | {Retrieved} |
Recall: the percentage of documents that are relevant
(IDBAL). As it is seen hypothesis supported in most criteria. Efficiency is not better in PCI than IDBAL. It refers to sorted and well organized web pages in IDBAL. It seems that is difficulty in structure and organization of web directory (PCI).
to the query and were, in fact, retrieved VII.Conclusions and Future Works Constructing a specific domain directory which is included
TABLE 2 COMMENTS BASED ON EXPERIMENTS Comments on PCI Web directory Easy and nice to use Useful and with relevant results Good way to organization information Not enough information for browsing No comments/unclear comments
high quality web pages has challenged developers of web
Number of subjects expressing the meanings
portals. In this paper we proposed a semiautomatic framework
1 4 1 4 0
Experts evaluate outline web directory with compare a
that combines knowledge of human with automatic techniques.
benchmark. Most expected results has been satisfied. For future work we will investigate new intelligent techniques for improving precision and recall in focused crawler. It seems that using dynamic vector instead of vector with constant terms
Comments on IDBAL Web directory
Number of subjects
can improve effectiveness. It is feasible with adding and
expressing
deleting terms in reference vector. Another direction includes
the meanings
refining PCI web directory and makes it more efficient. Other
Not useful/not good
2
Not clear for browsing/gives irrelevant results
2
Easy to navigate or browse No comments/unclear comments
3 3
specific domain can be applied in this framework.
KHALILIAN ET AL.
466 TABLE 3
STATISTICAL RESULTS OF HYPOTHESIS TESTING (S=supported) PCI
IDBAL
MEASURE
Mean
Mean
Precision
0.87
Recall
0.29
0.17
S
f-value
0.435
0.255
S
Efficiency
173
160
NOT S
Satisfaction
3.1
3.3
S
0.75
Results
[6] M. Khalilian, K. Sheikh, H. abolhassani (2008), Classification of web pages by automatically generated categories, Innovations and Advanced Techniques in Systems, Computing Sciences and Software Engineering, springer,ISBN: 978-1-4020-8734-9 [7] M. Jamali et .al . A Frame Work using Combination of link structure and Content similarity.
S
References [1] H. Topi, W. Locase, Mix and Match: combining terms and opratores for successful web searches, information processing and management 41(2005) 801-817 [2] H.P.ALSSO and F.SMILL, Thinking on the Web, John WILLEY New Jersey 2006. [3] Ee -Peng Lim and Aixin Sun: Web Mining- The Ontology Approach, The International Advanced Digital Library Conference in Nagoya Noyori Conference Hall Nagoya University, Japan August 25-26, 2005 [4] EW De Luca, A Nürnberger: Improving Ontology-Based Sense Folder Classification of Document Collections with Clustering Methods Proc. of the 2nd Int. Workshop on Adaptive Multimedia. 2004. [5] Taghva, K. Borsack, J. Coombs, J. Condit, A. Lumos, S. Nartker, T: Ontology-based classification of email, ITCC 2003. International Conference on Information Technology: Coding and Computing 2003.
[8] N.LUO, W-ZUO. F.YUON, A New Method for Focused Crawler Cross Tunnel, RSKT2006. pp 632-637 [9] Chage Su. J.yang,An efficient adaptive focused crawler based on ontology learning . 5th ICHIS IEEE 2005 [10] Wingyan Chung, G. Lai, A. Bonillas, W. Xi, H. Chen, organizing domain-specific information on the web: An experiment on the Spanish business web directory, int. j. human computer studies 66 (2008) 51-66 [11] M. Khalilian, K. Sheikh, H. abolhassani (2008), Controlling Threshold Limitation in Focused crawler with Decay Concept, 13th National CSI Conference Kish Island Iran [12] F. Menczer and G. Pant and P. Srinivasan. Topic-driven crawlers: Machine learning issues, ACMTOIT, Submitted 2002.
[13] X. Wan, J. Yang, J. Xiao, Towards a unified approach to document similarity search using manifold ranking of blocks, Information processing and Management 44 (2008) 1032-1048 [14] M. Diligenti, F. Coetzee, S. Lawrence, C. Giles and M. Gori, Focused Crawling Using Context Graphs, In Proceedings of the 26th International Conference on VLDB Egypt (2000) [15] Cai, D., Yu, S., Wen,J., & Ma., W, -Y. (2003) ;VIPS ; A vision based page segmentation algorithm. Microsoft Technical Report, MSRTR- 2003-79.