Intelligent spider for information retrieval to support ... - Semantic Scholar

Expert Systems with Applications Expert Systems with Applications 34 (2008) 347–356 www.elsevier.com/locate/eswa

Intelligent spider for information retrieval to support mining-based price prediction for online auctioning C.-C. Henry Chan

*

E-Business Research Lab, Department of Industrial Engineering and Management, Chaoyang University of Technology, 168 Jifong E. Road, Wufeng, Taichung 41349, Taiwan, ROC

Abstract Since the emergence of online auctions in 1995, many individuals have joined auction markets. Sometimes, many bidders have made wrong decisions (e.g., ‘‘Winner Curse’’) due to their limited knowledge and resources. Unfortunately, search engines only show a list of search results to users, and fail to provide further analysis that could help improve users’ decision-making. To solve this problem, this study proposed an intelligent spider for information retrieval, and applied data mining technology to differentiate between customers. Two software programs, a URL searching agent and an auction data agent, are developed to automatically collect related information whenever users input the searched product. Two neural networks are used to perform data clustering and price prediction after this information is crawled and stored into a database. The first neural network adopts a self-organizing map (SOM) to cluster customer data into nine homogenous groups. The second backpropagation network (BPN) is then used to predict the final price. This study develops a prototype of the proposed spider, and conducts an empirical study by crawling over 1000 deals from Taiwan’s eBay. Finally, important information, such as predicted price, prediction error and historic records, are presented to the user. The user can thus easily target the right bidding policy for wining a bid based on the mining-based information about price prediction. 2006 Elsevier Ltd. All rights reserved. Keywords: Spider program; Online auction; Data mining; Neural network; Price predication

1. Introduction Online auctions have become a vital market for electronic commerce since the emergence of eBay in 1995. Online auctions create a marketplace where buyers and sellers can conduct transactions anonymously. This rapidly growing business channel has been efficiently established on the basis of its low cost (Turban, Jae, Lee, & Chung, 2000). As a leading online auction company, eBay has become the largest auction web site by more than 60,000,000 individuals registering globally as eBay members. More than 600,000 product items daily are registered to conduct transactions worldwide. Owing to the accelerated growth of the online auction market, searching and

*

Tel.: +886 4 23323000; fax: +886 4 23742327. E-mail address: [email protected]

0957-4174/$ - see front matter 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2006.09.031

accumulating useful information for decision-making is increasingly difficult for users. Additionally, the rules of eBay mandate that the auction price is announced only when the transaction is finalized. All transaction records are then deleted automatically after two days, making the collection of online information extremely hard. If a user is looking for assistance from search tools, search engines can only display a ranked list of results without any further analysis. Users cannot obtain the whole picture of the searched information until they click on every page and manually retrieve data. Users can therefore make decisions based only on their personal experiences with restricted resources and knowledge. A bidder can submit an irrational price to win a bid, and then regret his decision. This phenomenon is called the winner curse (Oh, 2002). On eBay, each seller is requested to set a secret ‘‘reserve price’’ such that if the highest bid remains below the reserve price, then the seller does not carry out the transaction

348

C.-C.H. Chan / Expert Systems with Applications 34 (2008) 347–356

with the higher bidder. Conversely, each buyer must also submit a bid or a maximum bidding price (Lucking-Reiley, Doug, Naghi, & Reeves, 2000). Regardless of how much effort the users make to collect information, the bidding price is still a tricky number to determine rationally. To solve this problem, this study presents an intelligent spider for information retrieval, and applies data mining technology in data cluster to aid price prediction. The predicted price will eventually be a good source of reference for sellers setting reserve prices, and for buyers determining maximum bids. 2. Related works Two issues must be addressed in this study. Firstly, a spider and its applications must first be defined. Secondly, the application of previous price prediction approaches to the online market must also be considered. 2.1. Internet spider Internet information retrieval has been studied under the terms of search engine, spider, crawler, robot (Bots) and agents. Search engines (e.g. Yahoo and Google) are the most popular search tools in the cyberspace. Each tool uses different techniques for indexing, ranking and visualizing web documents. Additionally, meta-search engines such as MetaCrawler (www.metacrawler.com) and Dogpile (www.dogpile.com) connect to multiple search engines and integrate the results returned (Chen, Chau, & Zeng, 2002). Spiders have been used as the backend to support the client collecting information. (Chen et al., 2002). This investigation is concerned with ‘‘the functions of the spiders’’. Many previous studies have already defined and

Table 1 summarizes these definitions (Ali & McRoy, 2000; Chen et al., 2002; Green & Pant, 1999; Kendall & Kendall, 1999; Miller & Bharat, 1998). Base on previous research, spiders are developed to navigate, retrieve and filter useful information on the web sites for the user. Additionally, crawled information must be stored and organized into a database (Ali & McRoy, 2000). Moreover, users act on the online market by collecting information to satisfy their own preferences (Eriksson & Janson, 2002). Because of the accelerated growth of the Internet, two technologies have been presented to monitor the change and filter unneeded information. One of these is push technology. For instance, Ewatch (www.ewatch.com) can push information to users when they specify their areas of interest (Chen et al., 2002). On the other hand, an intelligent agent can be developed by pull technology. Kendall (Kendall & Kendall, 1999) defines pull technologies in four levels: Alpha-pull, Beta-pull, Gamma-pull and Delta-pull (see Table 2). The purpose of information retrieval of users moves from ‘‘feel they want’’ to ‘‘really need’’. From this perspective, an ideal spider or agent can deliver information according to a user’s personal need, and provide analytical results for supporting decision making. In real applications, most spiders or agents are developed for information retrieval, recommendation and bidding (Kao, 2002). Since the World Wide Web is now large, searching and retrieving valuable information has become the most important mission of spiders. Some works have developed personalized spiders for web searches, (such as OR including) SPHNIX (Miller & Bharat, 1998) and CI spider (2001) (Chen et al., 2002). Cho, Kim, and Kia (2002) presented a personalized recommendation agent, which improves the effectiveness and quality of recommendations when applied to Internet shopping malls.

Table 1 Definition of spider Definition

Article

Crawlers also called robot and spider are programs that browse the world wide web autonomously Spiders are independent software agents that crawl the Web to gather information Spiders, Bots (short for robots) or software agents are computer programs that have a variety of characteristics. They have been called on-line pseudo people, cited as popular test beds for multiple aspects of AI; labeled as intelligent on-line assistants and group negotiators Search engines are a simple example; typically they make use of a program (called a spider) that traverses the Web and creates databases of the keywords in a Web page (allowing fast, local retrieval of these resources) Internet spiders (a.k.a. crawlers), have been used as the main program in the backend of most search engines. These are programs that collect Internet pages and explore outgoing links in each page to continue the process

Miller and Bharat (1998) Green and Pant (1999) Kendall and Kendall (1999)

Ali and McRoy (2000) Chen et al. (2002)

Table 2 Pull technologies for information retrieval Name

Description

Metaphor for the user

Accesses what users

Alpha-pull Beta-pull Gamma-pull Delta-pull

Clicking on links Using a search engine Adopting a personal agent Creating an evolutionary agent

Surfing the net Using an information services guide; asking a Librarian Hiring a personal assistant (spider or bot) Creating a friendly bot that understands the user and changes over time

Feel they want Think they want Really want Really need

Source: Kendall and Kendall (1999).

C.-C. Henry Chan / Expert Systems with Applications 34 (2008) 347–356

2.2. Price prediction on online market Online auctioning is one of the most important online markets on the cyberspace. Price is significant to both bidders and sellers. Researchers have recently begun to develop agents for negotiating prices automatically. For example, MoCAAS (Lee, Yun, & Jo, 2003), Nomad (Sandholm & Huai, 2000) and Magnet (Steinmetz et al., 1998) are auction agents with mobile agent mechanisms. Lucking-Reiley et al. (2000) presented an exploratory analysis of the determinants of prices in online auctions for collectable one-cent coins at the eBay Web site. ATTac-2001 (Stone, Schapire, Csirik, Littman, & McAlleste, 2001) applies boosting techniques to learn conditional distributions of auction clearing prices. The core of ATTac-2001 is to learn a model of the empirical price dynamics based on past data, and utilize the model to compute the greatest extent possible optimal bids analytically. In the 2002 Trading Agent Competition (TAC), agents (Eriksson & Janson, 2002) are developed with the aim of assembling packages from flight tickets, hotel rooms and travel events. In that competition, virtually all participating agents employ some approaches for predicting hotel prices. Furthermore, Wellman, Reeves, Lochner, and Vorobeychik (2002) developed agents with a challenging market game in the domain of travel shopping. A pivotal issue concerning travel agents is uncertainty about hotel prices, which significantly affect on the relative cost of alternative trip schedules. 3. System design Although many spiders have already been developed, what does an ideal spider look like? Based on the definition of ‘‘spiders’’ by Kendall and Kendall (1999), a superior agent or spider should understand the user, and then adapts over time based on the need of customers. Unfortunately, no existing spider can comprehend customer behavior. To observe customer’s behavior for decision support, this study develops a prototype spider for information retrieval and data mining technique for price prediction. If any user sends a request to the spider, then the informaUser Request

Information Retrieval Intelligent Spider Internet

Decision Support Customer

Customer Behavior Model

Data Mining

Database

Fig. 1. Intelligent spider capture information for decision support.

349

tion is first retrieved from the Web. A data mining technique is then used to analyze data and create a customer behavior model. Accurate decision support information can then be delivered to the user according to the customer behavior model (see Fig. 1). eBay is one of the best-known auction web sites, and contains very rich customer information for study (Lucking-Reiley et al., 2000; Standifird, 2001; Turban, 1997). Customer information from eBay is composed of three types of information: transaction data, product data and customer basic data (Rayals, 2002). This study uses the PHP language to develop an intelligent spider running on a Linux system to crawl transaction data from eBay. This spider performs information retrieval task with a URL searching agent and an auction data agent (see Fig. 2). The details of all components of this study’s system are described as follows: • URL searching agent: The URL searching agent looks for URL addresses for each requested product. The URL sever of eBay would return a corresponding URL address for each request. The identified URL address is then recorded in the auction URL database. • Auction data agent: Whenever an URL address is found and identified, auction data agent will transmit a request to eBay’s server. The agent then retrieves the transaction information from eBay, and stores it in the auction data database for further analysis. • Data mining: The retrieved information is clustered into several homogenous groups, which are utilized to create a customer behavior model. • Price prediction: Decision support for bidding prices is presented to the user based on this behavior model.

4. Decision support processes for online auction To construct an intelligent decision support environment, this study adopts the three-phase processes (data set collection, data analysis and decision support) depicted in Fig. 3 to produce decision-support information for users (Shaw, Subramaniam, Tan, & Welge, 2001). The details of each process are described as follows: • Data set collection – The original data, containing both transaction and product information, are obtained from eBay. In the data collection phase, consumer information is retrieved by searching URL address of a product first, and then gaining the corresponding consumer information. • Data analysis – The customer data set is transformed into a recognizable data format for neural networks. Customer behavior is then segmented into several

350


eBay

URL Searching

Auction URL

Agent

Database

Auction Data

Auction Data

Agent

Database

Transaction Database

Search Engine

Product Database

SQL Sever

Customer Database

Customer

Price Prediction

Data Mining

Fig. 2. System design.

Data Collection

Submit a bid Buyer (Customer)

URL Searching Data Analysis Data Transformation

Data Acquisition From web site

Decision Support

Data Clustering Decision Support Customer Segmentation

Price Suggestion

Customer Behavior Model Price Prediction

Fig. 3. Spider-based decision support processes.

homogenous groups based on customer data using selforganizing map (SOM). • Decision support – After customer data were segmented, a backpropagation network (BPN) is used to predict the final bid price. The predicted price is utilized to help users determine the price. Finally, the price suggestion rules are delivered to the user.

4.1. Data collection This research proposes four steps to collect data for this study (see Fig. 4). First, the user needs input the searched product. In the second step, the URL Searching Agent is used to crawl the information with a key word, such as ‘‘Digital Camera’’. The URL Searching agent then displays

C.-C. Henry Chan / Expert Systems with Applications 34 (2008) 347–356

Xmin = minimum value in the data source; Xmax = maximum value in the data source. The fourth step is to apply the neural networks model to cluster data into homogenous groups. The fifth step is to interpret the meanings of those clustered customer data. The final step is to choose and construct a customer behavior model based on previous customer segmentation. These six steps create a cycle to construct a customer behavior model as shown in Fig. 5. The following sections discuss the details from the fourth step and the sixth step.

Input search product item

Extract online auction URL address

Retrieve auction data set from eBay

Record auction data into a database

Output data for further analysis

Fig. 4. The steps of data collection with a spider program.

a listing of auction links related to the search keyword by traversing this link (Lucking-Reiley et al., 2000). The URL Searching agent then gathers the ID of the auctions that had closed on the previous day. Each auction ID is used in a Web URL. For instance, an ID like 3003607831 is used to generate a new address such as, http://cgi.tw. ebay.com/ws/eBayISAPI.dll?ViewItem&item=3003607831. The URL is a query to eBay to retrieve information related to number 3003607831. For each query, eBay responds with a web page containing the last bid, opening and closing time and date, seller’s ID and rating, minimum bid, numbers of bids and a listing of bid history (LuckingReiley et al., 2000). The bid history includes information on each bidder, buyers’ ID and rating, and the price, time and date of bids. The spider program collects information about sellers and bidders. Data are retrieved and recorded in a database in the third and fourth steps. The final step is to output the data model for further analysis. 4.2. Data analysis This work presents a six-step cycle (Chan, 2005) to discover customer behavior patterns. First, the transaction data of a searched product should be retrieved from eBay. The information retrieval process is finished by a URL searching agent and an auction data agent, as described in the previous section. The second step is to select significant variables as the input of the customer segmentation algorithm. In this study, the selected variables include the last bid (if any), opening and closing time and date, seller’s ID and rating, minimum bid, number of bids, and a listing of bid history (Lucking-Reiley et al., 2000). The third step is to (convert OR transform) data into a normalized format. To convert each data into a recognizable data format for neural networks, the following function is applied to normalize data into a value between zero and one (Azoff, 1994): X i;normal ¼ ðX i X min Þ=ðX max X min Þ

351

ð1Þ

4.2.1. Neural net for data clustering To segment the behavior of customers, various researchers (Hsieh, 2004; Kohzadi, Boyd, Kermanshahi, & Kaastra, 1996) has proved neural networks much better than statistical techniques such as k-mean, linear discriminate analysis (LDA), multiple discriminate analysis (MDA) and logistic regression analysis (LRA). The application of neural networks in clustering consumer data is a challenging and promising research area. This work proposes SOM as the data-mining tool, because SOM not only can group the customer data, but also graphically represents the relationships among clustered groups (Smith & Ng, 2003). The relationships between clustered groups are important knowledge for interpreting hidden patterns of clustered groups. Many recent investigations have adopted SOM as a key data clustering technology (Chan, 2005; Hsieh, 2004; Smith & Ng, 2003). SOM is an unsupervised learning neural network that applies neighborhood and topology to cluster associated data into one group. Various recent studies have utilized SOM in the analysis of customer behaviors (Chan, 2005; Hsieh, 2004) and web navigation patterns (Smith & Ng, 2003). SOM is described in detail as follows (Jang, Sun, & Mizutani, 1997): Step 1. Select the winning output unit as the one with the largest similarity measure (or smallest dissimilarity measure) between all weight vectors wi and the input vector x. If the Euclidean distance is chosen as the dissimilarity measure, then the winning unit c satisfies the following equation: 1. Information Retrieval 6. Customer Behavior Model Selection

2. Variable Selection

5. Customer Data Interpretation

3. Data Normalization 4. Cluster Customer Data

Fig. 5. The cycle of constructing a customer behavior model.

352


kx wck ¼ min kx wik

ð2Þ

where the index c refers to the winning unit. Step 2. Let NBc denote a set of index corresponding to a neighborhood around winner c. The weights of the winner and its neighboring units are then updated by Dwi ¼ gcðiÞðx wi Þ;

i 2 NBc

ð3Þ

where is a small positive learning rate. Instead of defining the neighborhood of a winning unit, we can use a neighborhood function c(i) around a winning unit c. The Gaussian function can be used as the neighborhood function ! 2 kpi pc k cðiÞ ¼ exp ð4Þ 2r2 4.2.2. Customer segmentation After data clustering, a customer behavior model is then built. Based on the customer behavior model defined by Turban et al. (2000), the decision making process in electronic commerce is influenced by the characteristics of the sellers and buyers, the environment, the technology and the EC logistic (Turban et al., 2000). Buyers on the online auction market can find only the id (identity) of the sellers instead of their real names, making users’ behavior very hard to understand. To cope with such difficulty, this study describes customer behaviors using several important stimuli (price, promotion, product and quality) and personal characteristics (latest login time, total bid time and number). Although customer behavior can be determined in several ways, this study employs the well-known RFM (recency, frequency and monetary value) model to represent the characteristics of each customer (Hsieh, 2004). This model clusters customer behavior according to three dimensions of the customer’s transactional data, namely recency, frequency and monetary value (Lynette, 2002). Recency indicates the length of time since a bid begun; Frequency reveals how frequently a user submits a bid, and Monetary value measures the amount of money that the customer has spent in a bid (Jonker, Piersma, & Van den Poel, 2004). In this study, the data values such as average of latest login day, auction cycle day and purchase monetary amount were derived from databases (Chan, 2005). The RFM model assumes that future consumer trading patterns are similar to past and existing patterns. The computed RFM values are summarized to represent the behavior patterns of customers. This study uses the following RFM variables (Jonker et al., 2004). The behavior model of online auction is described as follows: Behavior Model Bxy