a vertical meta search engine with query expansion using ... - IJESAT

3 downloads 66333 Views 672KB Size Report
using single query submission. 3. Getting the top best results from the ranked list of the vertical search engine. 4. Query expansion becomes less laborious work,.
ISSN: 2250–3676

SANDEEP JOSHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

Volume-2, Issue-2, 227 – 232

A VERTICAL META SEARCH ENGINE WITH QUERY EXPANSION USING ARTIFICIAL RELEVANCE FEEDBACK MECHANISM Sandeep Joshi1, Satpal Singh Kushwaha2 1

2

Associate Professor, Dept. of Computer Science, SEC, Rajasthan, India, [email protected] Research Scholar, Dept. of Computer Science, SEC, Sikar, Rajasthan, India, [email protected] Abstract

World Wide Web is growing day by day so with this rapid development in the size of internet, Information extraction[1] on Internet is gaining its importance day by day. At present there are millions of Websites and billions of homepages available on the Internet. A large amount of on-line information resides on the invisible web[2] – web pages generated dynamically from databases and other data sources hidden from the user. They are not indexed by a static URL but is generated when queries are asked via a search engine (we denote them as specialized search engines or vertical search engine). OpenFind states that it indexes 3.5 billion Web pages; Google claims 2.4 billion, AlltheWeb - 2.1 billion, Inktomi - a little more than 2 billion, WiseNut - 1.5 billion and AltaVista - 1 billion Web pages. No search engine index more than one third of the total size[3] of the web. So from this big collection of web pages information retrieval is a very crucial task. The user query plays a vital role in the information retrieval process. So for the better information retrieval results several methods have been devised which assists the user in the query expansion task. In the proposed system we present a Vertical Meta Search Engine with query expansion using Artificial Relevance feedback mechanism. The proposed system provides a simple way of query expansion based on relevance feedback and reduces the user’s searching time with less no of hits to get the accurate results.

Index Terms: Vertical Search engines, invisible web, query expansion. --------------------------------------------------------------------- *** -----------------------------------------------------------------------1. INTRODUCTION A Meta search engine (also known as multi-threaded engine) is a search tool that sends user’s search query[4] in parallel to several search engines, Web directories and sometimes to the so-called Invisible (Deep) Web. Invisible web is a collection of online information that is not indexed by traditional search engines. After retrieving the results, the Meta search engine will remove the duplicate links and, according to its algorithm, combine/rank the results into a single merged list which is finally presented to the user. As compare to individual search engines and directories, the Meta search engines do not have their own database of web pages and also do not accept URL submissions. Query expansion is one of the important parts of the whole searching process. In the proposed system the query expansion is performed with relevance feedback technique. Here we receives the user query from the user and find the unique terms or information terms from the web, then high frequency terms from among these terms are presented in front of the user to choose suitable word to expand his query.

In this paper some real time experiments are also carried out and the produced results show that the proposed system can reduce the overall searching process. Our experiments show that 70 percent cases the suggested terms for query expansion are correct and the system presents suitable vertical search engine to the user.

2. MOTIVATION 1. 2. 3. 4.

Increased search coverage because a Meta search engine return results from multiple search engines. Reduces the no of hits to get the relevant document using single query submission. Getting the top best results from the ranked list of the vertical search engine. Query expansion becomes less laborious work, because user doesn’t have to maintain vast size of dictionaries and their maintenance.

IJESAT | Mar-Apr 2012 Available online @ http://www.ijesat.org

227

ISSN: 2250–3676

SANDEEP JOSHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

3. OPERATIONS The operations of the proposed Vertical[5] Meta search engine can be devided in the following two parts: 3.1 Offline Operation 3.2 Online Operation

3.1 Offline Operation This operation basically contains the creation of the database of the vertical search engines. The vertical search engine database contains the relevant terms which describe the search engine, URL of the search engine, connection parameters and domain.

Volume-2, Issue-2, 227 – 232

Normalized frequency of a distinct word is calculated as follows: Let HW is the frequency of a word that has highest occurring frequencies in the list of the extracted word list. So normalized frequency of a word w is =Occurring frequency of word w/ HW We have taken a limit on the no of words to be stored in the search engine selection database to 15. The distinct words with their frequencies are stored in the search engine selection index.

3.2 Online Operation

For getting relevant terms which describe a vertical search engine [6] neighborhood based topic identification method is used. In this process relevant terms are obtained from the neighbor web pages, which contain weblink to that vertical search engine. Following method of neighborhood [7] based topic identification is used to collect terms relevant to the vertical search engine.

Online operation can be devided into following two parts:

Back link method

In 1960, Maron and Kuhn’s mentioned query modification first time by suggesting that terms closely related to the original query terms can be added to the query and thus retrieve more relevant documents.

In this method the relevant terms are collected from those pages which have a weblink to the home page of the vertical search engine. we have to send the url of the vertical search engine (http://www.monsterindia.com the job search engine) to a general purpose search engine like http://www.google.com , when the results returned then we can extract unique words which describe the vertical search engine in a better way. Like in case of the http://www.monsterindia.com the relevant terms will be job, career, resume, biodata, interview, etc. The above unique words are describing that the above search engine is a job search engine. The algorithmic steps are as follows.

Algorithm 1. Web pages that have links to the home page of the vertical search engine are identified. This is done simply by sending the vertical search engine’s URL to a general purpose search engine here we uses Google. 2. The identified web pages are downloaded. 3. Distinct terms from the web documents are extracted. 4. Extracted distinct terms are stored in the search engine selection index with their normalized frequencies.

3.2.1 3.2.2

Query expansion Search engine ranking

3.2.1

Query expansion

There are three approaches for the user query expansion[8] a. b. c.

Manual Query Expansion Interactive Query Expansion Automatic Query Expansion

Manual and interactive query expansion requires user’s involvement in the query expansion process. In the proposed system we uses the approach of automatic query expansion which is the process of supplementing additional term or phrases to the original query to improve the retrieval performance without users intervention. Several methods for query expansion have been developed and used. Some methods get relevant terms from thesaurus[9], but the creation and the maintenance of the thesaurus is a laborious work. Other methods are using Latent semantic analysis where user query terms are semantically analyzed and relationship in between them is used to expand the user query but that approach also uses large dictionaries.

IJESAT | Mar-Apr 2012 Available online @ http://www.ijesat.org

228

ISSN: 2250–3676

SANDEEP JOSHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

Volume-2, Issue-2, 227 – 232

Recent methods for query expansion are mining user logs[10] and construction of user profile[11] .Web based query expansion is the process where the information terms to the user query are retrieved from web directly by sending the user query to the general purpose search engine then the unique or distinct words with high information are retrieved from the returned results. Because we are using automatic[12] query expansion so the top three unique words with high frequency are automatically selected to expand the user query.

4. EXPERIMENTAL RESULTS:

The proposed algorithm proceeds as follows:

4.1 Results during creation of search engine selection index

1. 2. 3. 4. 5. 6.

3.2.2

The user submits the query to the proposed system. Sending the user query to a general purpose search engine. Extraction of the distinct words from the results returned from search engine. Calculating the frequency of the extracted distinct words. The unique words are then arranged in the decreasing order of their frequencies. At last, top three words are chooses for the query expansion.

Vertical Search Engine Ranking

In this process the best vertical search engine is picked out from the list of the vertical search engine. Here the relevant terms which were retrieved in the query expansion step are compared with the representative terms of the vertical search engine terms stored in the search engine database. After the matching process is finished top two or three best search engines are picked out from the search engine database. Search engines are selected on the basis of the relevance factor. The relevance factor of a search engine e for a given set W= (w1, w2 …) of query expansion terms is as follows:

Real time queries are supplied to the proposed system. We have devided the experimental results in three parts which are as follows: 4.1 Results during creation of search engine selection index 4.2 Query Expansion Results 4.3 Search Engine Ranking Results

We have presented some results which were received during creation of search engine selection index. For the creation of search engine selection index total 69 vertical search engines were selected from different-different areas like medical, news, blogs, people search, source code etc. In the search engine selection index candidate terms or representative terms related to a vertical search engine are stored with their frequencies. These terms are stored in the decreasing order of their frequencies. This search engine selection index is used at the time of the search engine ranking. Following is the table containing URL of the vertical search engine and representative terms.

Website URL

Distinct words with high frequency

www.animalsear ch.net

Animal, wallpaper, Pets, desktop, originals, wildlife, include, pet, website, insects, Dogs, cats, birds, categories, breed, Enter, charolais Gama, flash, media, music, song, lyrics, opacity, plain, volar, videos, haces, open, bomb, absolute, Indian, Pictures, license, mp3 Art, artcyclopedia, online, fine, database, viewed, deals, john, Canadian, founded, site, malyon, museum, quality, artist, website, names, artists, museums Medical, information, checker, engine, drug, check, med, info, health, Lookup, photos, WebMD, seo, conditions, Analysis, online, medications, interaction, advice

www.musicgama .com

http://www.artcy clopedia.com

Relevance Factor (e, W) = ∑ (fi * ci) Where ci is the no of occurrences of wi counted in the query expansion process and fi is the frequency[13] of term wi in the search engine selection index for e. After the search engine ranking [14] the results are presented in front of the user. The results contain the most suitable vertical search engine for the expanded user query.

http://www.medc hecker.com/

Table-1: Representative terms with URL of Vertical Search Engine

IJESAT | Mar-Apr 2012 Available online @ http://www.ijesat.org

229

ISSN: 2250–3676

SANDEEP JOSHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

4.2 Query Expansion Results User Query job

travel

Book London

java

Terms for query expansion with their frequencies [employment] => 9 [Christian] => 4 [career] => 4 [au] => 3 [online] => 3 [thousands] => 2 [federal] => 2 [government] => 2 [force] => 2 [information] => 2 [defense] => 2 [official] => 2 [largest] => 2 [opportunities] => 2 [site] => 2 [seekers] => 2 [flights] => 25 [cheap] => 11 [Australia] => 9 [car] => 5 [book] => 5 [hotels] => 5 [world] => 4 [airlines] => 4 [airfares] => 4 [agents] => 4 [virgin] => 4 [Qantas] => 4 [hire] => 4 [insurance] => 4 [au] => 3 [international] => 3 [holiday] => 3 [tours] => 3 [deals] => 3 [destination] => 3 [jetstar] => 3 [compare] => 3 [online] => 3 [holidays] => 3 [cruises] => 3 [endless] => 2 [guides] => 2 [possibilities] => 2 [simple] => 2 [choosing] => 2 [domestic] => 2 [packages] => 2 [house] => 2 [home] => 2 [cheap] => 24 [hotels] => 17 [hostels] => 10 [flights] => 8 [England] => 5 [hostel] => 5 [online] => 4 [flight] => 3 [youth] => 3 [luxury] => 3 [booking] => 2 [reviews] => 2 [guide] => 2 [accommodation] => 2 [easy] => 2 [arraylist] => 6 [api] => 4 [Boolean] => 4 [techlead] => 3 [ebook] => 3 [team] => 3 [ee] => 3 [increasing] => 3 [docs] => 3 [timeout] => 3 [language] => 3 [programming] => 3 => 2 [shared] => 2 [swing] => 2 [platform] => 2 [editor] => 2 [blog] => 2 [document] => 2 [based] => 2 [trademarks] => 2 [examplejava] => 2 [dummies] => 2 [ejb] => 2 [mdb] => 2 [automation] => 2 [enhanced] => 2 [util] => 2 [format] => 2 )

Table-2: Terms for query expansion with their frequencies

Volume-2, Issue-2, 227 – 232

4.3.1 User Query: “Book London” Related Results Sugge sted Term

Url

Relev ance Factor

book+london +cheap cheap http://www.tripadvi book+london hotels 17 sor.in/ +hotels hostels book+london +hostels book+london +cheap cheap http://www.aardvar book+london hotels 12 ktravel.net/ +hotels hostels book+london +hostels book+london +cheap cheap http://www.travigat book+london hotels 8 or.com/ +hotels hostels book+london +hostels Table-3: Results for user query “Book London”

4.3.2 User Query : “Video” Related Results Suggest ed Term

Url

Releva nce Factor

videos funny online

http://www.videosur f.com/

12

videos funny online

http://www.youtube. com/

9

videos funny online

http://video.google.c om/

7

4.3 Search Engine Ranking Results The expanded user query terms are compared with representative terms of the vertical search engine which are stored in the search engine selection index. So on the basis of the relevance factor the top three vertical search engines are selected which are presented to the user with the relevance factor and expanded query.

Fire Query

Fire Query video+vid eos video+fun ny video+onl ine video+vid eos video+fun ny video+onl ine video+vid eos video+fun ny video+onl ine

IJESAT | Mar-Apr 2012 Available online @ http://www.ijesat.org

230

ISSN: 2250–3676

SANDEEP JOSHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Table-4: Results for user query “Video”

4.3.3 User Query : “Art” Related Results, Sugges ted Term

Url

Releva nce Factor

gallery cultural artist

http://www.searchgov. com/

6

gallery cultural artist

http://search.nic.in/

4

gallery cultural artist

http://www.artcyclope dia.com/

3

Volume-2, Issue-2, 227 – 232

We are especially thankful to Google Search Engine from the bottom of our heart for returning useful reference pages, articles and free eBooks. We shall be failing in our duties, if we don’t express our indebtedness to blessing of our parents.

Fire Query art+gall ery art+cult ural art+artis t art+gall ery art+cult ural art+artis t art+gall ery art+cult ural art+artis t

REFERENCES [1] Rahardjo, B. and Yap, R. Automatic Information Extraction from Web Pages, SIGIR, 2001, 430-431. [2] Gravano, L., Ipeirotis, P. G. and Sahami, M. QProber: ASystem for Automatic Classification of Hidden-Web Databases. ACM Transactions on Information Systems (TOIS), Vol. 21, No. 1, 2003. [3] Lu Y., Meng W., Shu L., Yu C., and Liu K. Evaluation of result merging strategies for metasearch engines. WISE Conference,New York, NY, 2005, pp. 53–66. [4] E.N. Efthimiadis, Query expansion, Annu. Rev. Inform. Syst. Technol. 31 (1996) 121–187. [5] Harter, Stephen P, “Online Information Retrieval: Concepts, Principles, and Techniques”, Orlando: Academic Press, 1986.

Table-5: Results for user query “art”

5.

CONCLUSION

Query expansion is a very important task in the search process. We have presented the results of the web query expansion approach. Our approach reduces the human efforts to maintain the large dictionaries, which is a very laborious work. In our approach the user need not to store these types of dictionaries and the maintenance of the same. We have applied this approach to the vertical Meta search engine where the results shows that the user query is expanded appropriately and suitable search engine is selected every time. In the future scope user feedback can be used to improve the search engine selection index, where the frequency of the representative terms which are stored in the search engine selection index, can be increased on the basis of the user feedback to improve the relevance of the vertical search engine.

ACKNOWLEDGEMENT

[6]Chau M , Spidering and filtering web pages for vertical search engine, in prc American Conf on Information System(AMCIS 2002Doctoral Consortium, Dallas, Texas) 2002. [7] Meng W., Wu Z., Yu C., and Li Z. A highly scalable and effective method for metasearch. ACM TOIS, 19(3):310–335, 2001. [8] Atsushi Sugiura and Oren Etzioni, “Query routing for web search engines,” In the proceedings 9th International World Wide Web Conference, Amsterdam, Netherlands, May 2000. [9] L. Sangoi Pizzato, and V. Strube de Lima, “Evaluation of a Thesaurus-Based Query Expansion Technique”, PROPOR’2003. Faro, Portugal, June 26-27, 2003. [10] H. Cui, J.-R.Wen,W.-Y. Ma, Query expansion by mining user logs, IEEE Trans. Knowl. Data Eng. 15 (4) (2003) 829– 839.

First and foremost, we thank Lord Almighty for the grace, strength and hope to make our endeavor a success.

IJESAT | Mar-Apr 2012 Available online @ http://www.ijesat.org

231

SANDEEP JOSHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 2250–3676 Volume-2, Issue-2, 227 – 232

[11] S. Gauch, J.B. Smith, Search improvement via automatic query reformulation, ACM Trans. Inform. Syst. 9 (3) (1991) 249–280. [12]. M. Mitra, A. Singhal, C. Buckley, Improving automatic query expansion, in: Proceedings of the 21st Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, 1998, pp. 206–214. [13] C. Carpineto, G. Romano, and V. Giannini, “Improving retrieval feedback with multiple term-ranking function combination”. TOIS 20(3), 2002, pp. 259-290. [14] F. Radlinski, T. Joachims, Active Exploration for Learning Rankings from Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2007.

BIOGRAPHIES Sandeep Joshi is M. Tech. in Computer Science & Engineering and having vast experience in teaching.

Satpal Singh Kushwaha is pursuing his M.Tech from Rajasthan Technical University, Kota(Rajasthan).

IJESAT | Mar-Apr 2012 Available online @ http://www.ijesat.org

232