Integration of Search Engine and Recommender ...

4 downloads 35376 Views 203KB Size Report
Integration of Search Engine and Recommender System For the Holy Qur'an. Personalization ... The user rating became the search engine ranking component because the ..... We are Best Aware of that which they allege.”). Fig. 7. Search ...

Integration of Search Engine and Recommender System For the Holy Qur’an Personalization Lukman Heryawan University of Gadjah Mada Indonesia Abstract. This paper proposed the idea of combining search engine technology and recommender system to find, rank, and recommend articles and websites which are related to topics in the Holy Qur’an. This idea can also be applied to the domain of problem such as programming domain, which refers to the syntax references of the related programming language. In order to find articles and websites scattered inside the hyperspace, an atomic keyword is used, which will be explained later in this paper. The search engine rank and user rating method is combined in order to rank the search result. And for the recommendation process, it uses the collaborative filtering technique. Hence, integration of

search engine and recommender system is expected to encourage people more enthusiastic in learning the Holy Qur’an and to actively contribute in maintaining the purity and correctness of the Holy Qur’an in an easy way, by giving a correct rating to the articles and websites found by the system. Keywords : Holy Qur’an, Search Engine, Recommender system, atomic keyword, expert user, collaborative filtering

I. INTRODUCTION The Holy Qur’an is divided into the discussed topics. The topics in the Holy Qur’an correlate with the verses of the Holy Qur’an. With the advances of technology, the Holy Qur’an can be accessed using the computers and internet in the form of Digital Qur’an and online Holy Qur’an. However, the current Digital Qur’an does not experience any changes in the contents, and the only change is in the accessing method (can be accessed from computers and has a word matching search function). The current Digital Qur’an structure is shown in Figure 1. The idea of this paper is to modify the content structure of digital Qur’an, thus became a dynamic content (along with an up to date articles that discuss the topics inside the Holy Quran, which is distributed in many sources) and more personalized (the topic articles matches user preference, thus the user are more enthusiastic in learning the Holy Qur’an). Those features can be achieved by using the verses in the Holy Qur’an, which is called atomic keyword in this paper, in order to find the articles scattered in various websites in the huge hyperspace. It is expected that by using atomic keywords of the Holy Qur’an verses, which correlates with particular topics in the Holy Qur’an,

the various up to date articles and websites that discuss about topics in the Holy Qur’an can be found. After the information related to the Holy Qur’an topics is found, the search result ranking process is started. The search result ranking is done by combining the search engine rank and the user rating.

Fig. 1. Content Structure of Digital Qur’an

The user rating became the search engine ranking component because the information sources are scattered. With the user rating, it is expected to give high quality contents/spiritual information to the user and to improve the correct understanding about topics in the Holy Qur’an. In other words, the search result articles that contain the Holy Qur’an verses but having an incorrect content (contains propaganda and provocation) will have a low ranking, even with a high search engine rank. Next, after the search results are found and ranked using atomic keyword and the user rating, the next step is to recommend the information from various sources with high quality content to the user. The method for recommending is the collaborative filtering method. Generally, it can be described that the system is the extension or expansion of the current digital Holly Qur’an. The illustration is as follows:

Fig. 2. The Illustration of proposed system

In addition to the expansion of digital Qur’an, this system is also expected to solve the problems in the system/websites which discuss about topics in the Holy Qur’an.

Generally, websites that discussed topics of Al-Qur’an is one way interaction (no personalization) and have a limited article. In this context, one way interaction means that the shown articles on the websites are static or stayed the same for every user and no feedback from the user in the form of rating towards the article being read.

filtering which is the current most successful information filtering method [3].

Even though there are some sites that gave the rating function, there is no explanation about the rating function, and the shown article stayed the same, even after the user rates the article.

Fig. 3. System stages proposed in the paper

Limited articles are the article source/database shown in the websites discussed about topics of the Holy Qur’an, which are usually centralized, made and managed by the experts on the related websites. Even if there is a search function to find another article outside the website, the user will have to fill in the atomic keyword, which is the verses of the Holy Qur’an, and not every user memorized the verses. And if the keyword is too general, for example the user want to find articles about patience, and the user input the keyword ”patient”, then all information in the hyperspace containing the word ”patient” will be shown, even those without any correlations with the Patience topics within the Holy Qur’an. II. RELATED WORKS The system proposed in this paper attempts to integrate the search engine technology with the currently available recommendation system. The search engine technology is used to find as much correct articles as possible and websites which correlates with the topics within the Holy Qur’an. The two prevalent search engine algorithms that can be used in this system to find articles and websites that correlates with the topics of the Holy Qur’an is the PageRank [1] Algorithm which is implemented in Google Search Engine and HITS (Hyperlink-Induced Topic Search) [2] algorithm. In this stage, the first ranking of the search result is also done by both algorithms. This stage is known as the data gathering stage. The search result ranking by the HITS or PageRank algorithm is then combined with the user rating. In this stage, the incorporation of ranking from the search engine with the user rating is expected to give a list of articles/websites which are easy to read and give the correct understanding about the topics in the Holy Qur’an. This stage is also known as the data re-ranking stage. The successfulness of this data re-ranking stage depends entirely from the user activeness in rating the articles. After the data re-ranking stage, the last stage is to recommend the related articles about topics in the Holy Qur’an according to user profiles and preferences. This stage is called as the data recommender stage. The method used in the data recommender stage is the collaborative

Many of recommender system nowadays use collaborative filtering, such as amazon.com (recommends books and music) [4]. Besides, there is also a system that utilizes user tags to find particular information, such as restaurant location, and then recommend the restaurant to the user. This is already done in the foodio54.com (recommends restaurant) [5]. In the current available system, there is no system that integrates the search engine technology and recommender system in the spiritual/religion domain. This is the background of this paper. III. SYSTEM OVERVIEW In the system proposed in this paper, the user can choose the topics by inputting a topic query or choose it directly from the list of topics. If the user inputs a query, the system will do a word matching beforehand, from the input query and the list of topics in the Holy Qur’an. After the topic is found or selected, the search engine will automatically change the user input into an atomic keyword in form of the Holy Qur’an verse that correlates with the topic. The search is done by using the already available search engines, such as Google. If Google is the first step to find the information in the Internet, then this system is expected to be the first step to find spiritual or Islamicrelated information in the Internet which correlates with the topics in the Holy Qur’an. After the search is done, the search result will be reranked and recommended to the user, this is like when the amazon.com recommends books/music/movies to the user. Users can give rating to the information/articles. The search result re-ranking came from the addition of search engine ranking and the user rating. The re-ranked search result is then recommended to the user by using the collaborative filtering method. Thus, there are two things being recommended here, the topic and information (articles and websites) ranking, which correlates with certain topics in the Holy Qur’an. For example, users that read about sharia banking topic should also read about Islamic law topic, and users accessing the sharia banking topic will be given articles/websites which have been ranked based on the Google PageRank and user rating. Rating is given by two kinds of user, expert user and common user. Expert user is a person with an expertise in

Islamic religion and the Holy Qur’an. Expert user can give rating / comments / explanation to the articles from the search result. In order to be considered as an expert user, there is some kind of test to determine the expertise of a user. By default, every user is considered a common user which is only able to give rating (not able to give comments/explanations). Rating from an expert user has a better weight than the rating from a common user [7]. If the expert user gives a positive/ recommended rating, then the ranking of an article will be increased more than when the positive rating is given by a common user, and vice versa for the negative rating.

the various sources on the Internet. Thus, the atomic keyword is used as the keyword to find articles and websites in the hyperspace which discussed the topics in the Holy Qur’an. This is possible because every topic discussion in the Holy Qur’an will always be accompanied with the Holy Qur’an verses/atomic keyword related to the discussion topic. Writing the Holy Qur’an verses on the article that discuss the topics of the Holy Qur’an will improve the legitimacy of the article, thus the content is more accountable. Legitimacy of an article is also determined by how many sources or references that cited to write the article [8]. User Query / User chosen Topic

2

Holy Qur’an 1

Verses

Articles with Holy Qur’an verses

Fig. 5. Process in the data gathering stage Fig. 4. System Overview

Explanation : Rating can be given by the subscribed user. And the unsubscribed user can only searches and reads about Islamic information and topics in the Holy Qur’an. The recommendation for an unsubscribed user is limited to another related topic that the user should read. For example, the user reading an article about patience is also recommended to read articles about sincerity and gracefulness (expected to match the result of collaborative filtering). While the recommendations given to the subscribed user are the same with the unsubscribed user, and also recommendation based on the user profile (profile based filtering) and collaborative filtering in order to display topics and information which is most relevant with the profile and user access history (the same recommender system used in online music subscription service) [6] which will be explained in the next part. IV. SYSTEM DESIGN This system is developed using the same technology with the currently available recommender system. The difference lies only in the information gathering technique and the search result ranking which involves expert user [7,8]. While its recommender system uses collaborative filtering method, which is also used in the current online music subscription service. The articles and websites as the collection/database of items that will be ranked and recommended are spreading in

1. User query / selected topic is searched within the topic database, if it is found, an atomic keyword (the Holy Qur’an verses) from the database that correlates with the topic will be chosen. 2. Selected Al-Qur’an verses / atomic keyword will be used as the search engine query to find articles/websites that discuss topics in the Holy Qur’an and contains the Holy Qur’an verses. Next stage is the data re-ranking stage [13]. The data re-ranking stage is the re-ranking of the search result. This re-ranking is expected to improve the ranking of high quality articles about topics in the Holy Qur’an, and to reduce the ranking of provocative articles (or articles containing propaganda) and false understanding about topics in the Holy Qur’an. The data re-ranking stage involves two kinds of subscribed user, the common user which only able to rate the article, and expert user, which is able to rate and give comment/explanation to an article [9]. Due to the information used in this system came from various sources, the expert user used by the system should also came from various sources. In other words, the system did not specifically provide the expert in the system to regulate the re-ranking of search result, the expert is just a common user which could be considered as an expert by solving some kind of test that is provided by the system in order to determine the expertise level of a user. If a common user passes the test given by the system, then the common user can be considered as an expert user that can give a higher

weighted rating than a common user, and give comments/explanations about the article (e.g. comments that explain about the legitimacy of an article). The comments or explanations are proposed to justify the higher weighted rating given by the expert user. Common User

3.

Old subscribed user.

The algorithm used for unsubscribed user is the Modelbased collaborative filtering algorithm [14], using the item similarities [15] approach, which in this case, the items are articles and topics read by the user. The algorithm to count topic/article similarities are:

Rating Rating



For every article within the collection, count another ‘k’ number of articles having similarities.



The similarities of article i and j is high if there are many users (either the subscribed or unsubscribed) reading both articles. For example, if the user reads article i, thus he/she would read article j, and vice versa.



Thus if the user reads article i, the system will recommend another k-number of articles with high similarities to the article i.

Comment Expertise Test

Expert User

Explanation

Fig. 6. Expert User Determination process

The TABLE 1 shows an illustration about re-ranking that combines search engine rank and user rating, which then became the expected final ranking (articles that are the most worth reading/gives a correct understanding about topics in the Holy Qur’an have the highest ranking, followed by another worth reading articles, and the least legitimate/correct articles in the lowest ranking).

In order to count the item similarities this following formula is used [15] :

TABLE 1. Data re-ranking illustration table

For example : TABLE 2. Illustration of article similarities determination Article

1

User A

read

2

User B

Note : Rating

User C 1 - 5 ≈ recommend - not recommend

After the data is gathered from many sources in the Internet and re-ranking has been done to guarantee the legitimacy and worthiness of the articles and websites shown to the user, the final stage is to recommend the information using the collaborative filtering method. This method is well-explained in [10]. Collaborative filtering is a method to recommend something to the user based on the assumption that similar user (e.g. demographic similarity) has similar preferences [11]. In this paper, the collaborative filtering method used by proposed system is the same with the ones used in the online music subscription service [6][12]. In addition to the categorization of common user and expert user in terms of user rating, there are also three kinds of user in terms of recommendations, they are: 1.

Unsubscribed user which only access the system to read articles about topics in the Holy Qur’an.

2.

Newly subscribed user.

User D

read

3

4

5

1

2

3

read

read

1

1

read

urge to user B

2

1

5

read

read

read

read

read

read

Reading history

3

1

4 1

1 1

2

1

3

4

5

1

1

2

3

1

2 2

Article similarity

User B is an unsubscribed user which recently accessed the system and reads article 3. According to the article similarities, the article 5 will be recommended to the user B because article 3 and article 5 has the highest article similarities. The algorithm used for the newly subscribed user is the demographic filtering algorithm combined with collaborative filtering algorithm. The demographic filtering algorithm is used to group new user into the user class with the same stereotype (such as grouping based on hobby, age, gender, and country). After the new user is inserted into the user class with the same stereotype, the collaborative filtering algorithms is applied. The user-based collaborative filtering [16] approach is used for collaborative filtering. This algorithm is based on the facts that new users are member of a higher user groups with the same stereotype.

The recommended articles and topics for the new user are: the articles with a positive rating and often accessed by users from the same user group. For example: User A is included in group i (based on stereotype). Within group i, such reading history was found:

For example : TABLE 4. Illustration of recommendation generation based on article reading timestamp Timestamp

Articles Read

Article Set Occurrences

support

1

1,2

{1}

50%

2

1,2,3

{2}

75%

3

2,3

{3}

50%

4

4

{2,3}

50%

TABLE 3. Reading history table Article

1

2

User A

recommended to user A

User B

5

User C User D

3

4

5 recommended user A

to

Confidence 2 => 3 = support ({2,3}) / support ({2}) = 66.6% Confidence 3 => 2 = support ({2,3}) / support ({3}) = 100%

2

3

6 1

From the TABLE 3 data, thus article 5 and article 2 will be recommended to user A because article 5 is read 6 times by user C and article 2 is read 5 times by user B and 1 times by user D. The algorithm that used to recommend articles and topics to old subscribed users with some reading history uses the model-based collaborative filtering [14] algorithm. The approach used is the Association Rules [17]. This approach is based on the reading history from the old subscribed user. The main idea is that if is a set of topics / articles that appears often, thus if user A reads article 1 and never read article 3, the system would recommend article 3 to the user A. Association rule is defined as follows: •

Assume I is a set / collection of article and D is a set of transaction / article reading.



Association rule is the implication of: X => Y [ c , s ] where :

-

And c is the probability of (Y|X), which is the confidence where c% transaction within D containing X also contains Y.

-

s is the probability

V. ANALYSIS This system finds articles and websites which discuss about topics in the Holy Qur’an by utilizing atomic keyword that is saved within the system database. The atomic keyword that used in this system is the Holy Qur’an verses which correlates with the topics of the Holy Qur’an. But, in some articles discussing the topics of the Holy Qur’an, the verses are translated into certain languages, and not in the original language, the Arabic language. Due to that condition, the translated version of the Holy Qur’an verses can also be used as the atomic keyword in order to find articles discussing the Holy Qur’an topics in particular language, such as Indonesian language. Here is the search result acquired using Google search engine with an atomic keyword of translated the Holy Qur’an verse, Surah AlMu’minuun verse 96 (QS 23 : 96) which correlates with “Moral & Culture” topic and “patience” subtopic. The translation of the verse is “Tolaklah perbuatan buruk mereka dengan yang lebih baik. Kami lebih mengetahui apa yang mereka sifatkan”. (“Repel evil with that which is better. We are Best Aware of that which they allege.”).

, which is the support

where s% transaction within D contains Association rule describes articles/topics within dataset.

Thus if there is an old user that have read article 3, and have never read article 2, the system would recommend article 2 to the user.

the

relation

. between

Fig. 7. Search result using atomic keyword of QS 23 : 96 Atomic keyword contained in an article

e.g. : read (user A, article 1) => read (user A, article 3) In order to produce Association rule, an Association rule mining is used. The basics is to produce every Association rule with a higher support and confidence than the specified minimum value.

Fig. 8. An article containing the atomic keyword

By utilizing the various sources of information spread in the Internet as an item collection or articles and websites database that discusses topics of the Holy Qur’an for this system, it is expected that the system can be developed at a lower cost, because the database is distributed. Besides, a distributed information source will make this system dynamically evolved. The re-ranking of the search result is expected to filter incorrect information (articles/websites) about topics in the Holy Qur’an, and places high quality information in the higher ranking. The purpose of this re-ranking can be achieved if there are many users (common and expert subscribed user) that contributes to rate the information shown by the system. Expert user has a higher contribution in the re-ranking stage, because expert user is a user considered to be an expert in particular topic, and is expected to give a more correct rating towards the information given by the system. Because this system resides in the field of religion/spiritual which could be pretty sensitive towards certain people, the system has the function to add comments and explanations by expert user, in order to maintain trust of the system [9] and the legitimacy and worthiness of the search result information. Collaborative filtering method that used in this system is to recommend articles and topics toward the user, because this method does not consider the content of the article, but is more focused to the user preference [11]. This method is suitable when applied in a system with many kinds of formats and contents (due to the information source characteristics that is distributed in the Internet) [18]. VI. CONCLUSION Search engine and recommender system technology which are rapidly developing nowadays can give a big contribution to the field of religion/spiritual. Integration of search engine and recommender system in the spiritual field enables the development of system that could find many information distributed in the Internet, filters it to display a high quality content, and finally recommends the filtered information to the user based on the user profiles and preferences. Integration of search engine and recommender system is expected to encourage people more enthusiastic in learning the Holy Qur’an and to actively contribute in maintaining the purity and correctness of the Holy Qur’an in an easy way, by giving a correct rating to the articles and websites found by the system.

[2] Ayman Farahat, Thomas LoFaro, Joel C. Miller, Gregory Rae, and Lesley A. Ward. 2006. Authority Rankings from HITS, PageRank, and SALSA: Existence, Uniqueness, and Effect of Initialization. SIAM Journal on Scientific Computing.

[3] Mukund Deshpande and George Karypis. 2004. Item-Based Top-N Recommendation Algorithms. ACM Transactions on Information Systems, Vol. 22, No. 1. ACM Press.

[4] G. Linden, B. Smith, J. York. 2003. Amazon. com recommendations : Item-to-item collaborative filtering. Internet Computing, IEEE.

[5]

H. Nguyen and S. Lin. 2007. Collaborative Dining: A Social Recommender System for Restaurants. UC Berkeley School of Information.

[6] Gawesh Jawaheer, Martin Szomszor, and Patty Kostkova. 2010. Comparison of implicit and explicit feedback from an online music recommendation service. Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems. ACM Press.

[7] Anton Eli¨ens and Yiwen Wang. 2007. Expert Advice and Regret For Serial Recommenders. Paper. Faculty of Sciences, VU University Amsterdam and Dept. of Math. and Comp. Sc. Eindhoven University of Technology.

[8] Antonietta Grasso and Andre Bergholz. 2007. Method and system for expertise mapping based on user activity in recommender systems. US Patent. http://www.patentstorm.us

[9] U. Rohini, Vamshi Ambati. 2005. A collaborative filtering based reranking strategy for search in digital libraries. Proceedings of the 8th international conference on Asian Digital Libraries: implementing strategies and sharing experiences. Springer Berlin Heidelberg.

[10] Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John Riedl. 2004. Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems, Vol. 22, No. 1. ACM Press.

[11] B. M Marlin. 2003. Modeling user rating profiles for collaborative filtering. In Advances in neural information processing systems. Washington University.

[12] F. Reichlin and R. Wattenhofer. 2008. History-Based Collaborative Filtering for Music Recommendation. Swiss Federal Institute of Technology Zurich.

[13] T. Yamamoto, S. Nakamura, and K. Tanaka. 2007. Rerank-byexample: Efficient browsing of web search results. In Database and Expert Systems Applications. Springer Berlin Heidelberg.

[14] Y. Bergner, S. Droschler, G. Kortemeyer, S. Rayyan, D. Seaton, and D.E. Pritchard. 2012. Model-Based Collaborative Filtering Analysis of Student Response Data: Machine-Learning Item Response Theory. International Educational Data Mining Society.

[15] X. Su and T.M. Khoshgoftaar. 2009. A survey of collaborative filtering techniques. Advances in artificial intelligence. Hindawi Publishing Corp.

[16] Z.D. Zhao and M.S. Shang. 2010. User-based collaborative-filtering recommendation algorithms on hadoop. In Knowledge Discovery and Data Mining. IEEE.

[17] M.L. Shyu, C. Haruechaiyasak, S.C. Chen, and N. Zhao. 2005. Collaborative filtering by mining association rules from user access sequences. In Web Information Retrieval and Integration. IEEE.

[18] R. Suguna and D. Sharmila. 2013. An Efficient Web

REFERENCES [1] Alon Altman and Moshe Tennenholtz. 2005. Ranking Systems : The PageRank Axioms. Proceedings of the 6th ACM conference on Electronic commerce (EC-05). ACM Press.

Recommendation System using Collaborative Filtering and Pattern Discovery Algorithms. International Journal of Computer Applications.