Applying I-FGM to Image Retrieval and an I-FGM System Performance ...

5 downloads 1628 Views 264KB Size Report
large databases, for a specific query/task, usually only a small amount of data is ..... gIG-Builder: The gIG-Builder is a place where specific information retrieval ...
Applying I-FGM to Image Retrieval and an I-FGM System Performance Analyses

Eugene Santos, Jr.1, Eunice E. Santos2, Hien Nguyen3, Long Pan2, John Korah2, Qunhua Zhao1 and Huadong Xia2 1

Thayer School of Engineering, Dartmouth College, Hanover, NH, {Eugene.Santos.Jr, Qunhua.Zhao}@dartmouth.edu 2 Department of Computer Science, Virginia Polytechnic Institute & State University, Blacksburg, VA, [email protected], {panl, jkorah, xhd}@vt.edu 3 Mathematical and Computer Sciences Department, University of Wisconsin, Whitewater, WI, [email protected] ABSTRACT Intelligent Foraging, Gathering and Matching (I-FGM) combines a unique multi-agent architecture with a novel partial processing paradigm to provide a solution for real-time information retrieval in large and dynamic databases. I-FGM provides a unified framework for combining the results from various heterogeneous databases and seeks to provide easily verifiable performance guarantees. In our previous work, I-FGM had been implemented and validated with experiments on dynamic text data. However, the heterogeneity of search spaces requires our system having the ability to effectively handle various types of data. Besides texts, images are the most significant and fundamental data for information retrieval. In this paper, we extend the I-FGM system to incorporate images in its search spaces using a region-based Wavelet Image Retrieval algorithm called WALRUS. Similar to what we did for text retrieval, we modified the WALRUS algorithm to partially and incrementally extract the regions from an image and measure the similarity value of this image. Based on the obtained partial results, we refine our computational resources by updating the priority values of image documents. Experiments have been conducted on I-FGM system with image retrieval. The results show that I-FGM outperforms its control systems. Also, in this paper we present theoretical analysis of the systems with a focus on performance. Based on probability theory, we provide models and predictions of the average performance of the I-FGM system and its two control systems, as well as the systems without partial processing. Keywords: Geospatial information retrieval, dynamic information space, image retrieval, distributed information retrieval, parallel information retrieval, multi-agent systems, performance, evaluation, theoretical analysis

1. INTRODUCTION Information Retrieval is the science of searching and presenting to users, such as homeland security agencies, useful information which is relevant to the users’ queries. Due to the broad spectrum of information digitalization and data sharing techniques, there continues to be an enormous proliferation of information. This fast growth is demonstrated not only by the increased number of shared/private databases and the huge amount of data contained in databases, but also by many different types of data. The vast amount and diverse types of data both enable intelligence agencies to achieve more comprehensive and elaborate instances about their interests as well as introduces new challenges. Firstly, in these large databases, for a specific query/task, usually only a small amount of data is relevant to the user’s request. This useful information, we call relevant documents in this paper, is immersed in a sea of currently unimportant data which takes a great amount of time and effort to search through. Thus, it is not surprising that a large time delay is introduced between data acquisition and its complete processing. However, for some time-critical cases, such as disaster relief and criminal/terrorist detection, these large delays can fatally degenerate the analysis results and decisions recommended by the intelligence agencies. Secondly, the search space is dynamic1,2. In any given moment, new information becomes available while old information is invalidated. For world or regional “hot spots”, there is only a small window of time that information may remain valid. Lastly, search spaces for current information retrievals are highly heterogeneous. This heterogeneity is found mainly in the sources and formats of data. Different kinds of information are contained in diverse types of sources which require/employ different formats of data, such as plain text, structured text documents,

images, maps, audio, video, etc. Each kind of data may contain important and indispensable information for the intelligence agencies. Therefore, in order to well assist intelligence agencies in addressing these three challenges, there is a clear and urgent need to develop a general framework to effectively and efficiently handle massive, dynamic, heterogeneous data. I-FGM1, 2, 3 (Intelligent Foraging, Gathering, and Matching) is designed for improving system efficiency by applying an intelligent computational resource allocation strategy based on the analysis of partial processing results obtained from the search space. In our previous work2, 3, I-FGM has been experimentally observed to be an effective general tool for text retrieval on large and dynamic search spaces. However, as mentioned earlier, search spaces are highly heterogeneous. In order to demonstrate that IFGM is a general framework for information retrieval, it is necessary to study the system’s ability to effectively retrieve information from different types of data and solve the problems introduced by the data heterogeneity. Among various types of data contained in information retrieval, images, as well as texts, are the most prevalent and fundamental. Images are quite helpful and common sources for intelligence agencies. There are great amount of images stored and used in numerous tasks/repositories such as the Internet, medical databases, newspapers, area reconnaissance, etc. Usually, one image can contain more information and more clearly express ideas/scenarios than several hundreds or thousands of words. Additionally, images are the basic elements of videos. Although there is no effective technique for retrieving video, we believe that the success of image retrieval will lead significant progress in video retrieval. Therefore, this paper focuses on image retrieval in our I-FGM. The immediate objective is the verification of the effectiveness and efficiency of I-FGM for retrieving information from images. The long-term goal is to use images together with texts as our case study to learn and analyze our I-FGM system’s capability for handling heterogeneous data. In this paper, we will discuss how I-FGM retrieves information from image data, thus laying the foundations for future research on how to deal with the challenges caused by data heterogeneity. The image retrieval approaches adopted in our I-FGM are based on a currently available region-based Wavelet Image Retrieval algorithm called WALRUS6. It is an effective algorithm for content-based image retrieval, which returns images that are visually similar to a given query image. It first divides each image into regions; a similarity value between a pair of images is then computed as the ratio of area between matched regions and the whole images. Regions of an image are extracted by clustering sliding windows, which move all over the image, according to their visual feature proximity6. Two features of WALRUS facilitate its use into our system. First, the WALRUS algorithm is naturally decomposable into several sub-processes with well defined functions, such as clustering of sliding windows and matching of two images’ regions. Such modularity can be easily mapped to different components of I-FGM. Second, WALRUS can process an image one sliding window per time. So incremental processing can be easily implemented for our system. To the best of our knowledge, WALRUS has not been used to incrementally process images in a distributed environment. Finally, I-FGM is a general information retrieval framework which combines a unique multi-agent architecture with a novel partial processing paradigm to provide a solution for real-time information retrieval. The I-FGM system is based on an assumption that if we can partially process data, we may be able to improve system efficiency by intelligently allocating/reallocating its computational resources according to the analysis of partial results during the partial processing. While, this may sound like a plausible idea, without rigorous analysis, we cannot guarantee the performance of the system. Also in this paper, we will apply probability theory to compare the performance between a partial processing system and a full processing system. We define a full processing system as a system whose results can be obtained only after fully processing all data. From our theoretical analyses, we will show what factors affect the system’s performance, and under what conditions and to what extent the partial processing system can outperform the full processing system. We will also provide the performance comparison between I-FGM and its two control systems. In the following sections, we give a brief review of current image retrieval systems. Next, we will present our design of I-FGM system for image retrieval. This is followed by a description of our system evaluation and experimental setup. Based on the experimental results, we analyze our system’s performance for image retrieval. Finally, we provide our theoretical analysis of the system’s performance for I-FGM, control systems, and the system without partial processing. In the lasts section, we will provide our conclusion and discuss future work.

2. RELATED WORK

In this section, we will discuss the related technologies concerning image retrieval approaches and region-based matching methods. Roughly, image retrieval can be classified into two types of approaches: Annotation Based Image Retrieval (ABIR) and Content Based Image Retrieval (CBIR). ABIR requires experts to annotate images and is essentially based on text retrieval. It is a manually intensive process, difficult to automate and is not suitable for the large image databases that I-FGM deals with. Another popular technique used in internet image search engines is to use the text around the image as well as text in the image tags and section heading to describe images. Although this works well for images in web pages, it cannot be adapted for stand alone pictures. Our focus in this paper is CBIR which deals with interpreting images based on their visual contents. We decided to use a CBIR method because the performance of these systems has improved over the past decade. This is not an easy task due to the constraints placed by the sensory and semantic gap10 phenomenon. A particularly significant development is the introduction of region-based CBIR methods that has the potential to improve semantically sensitive image retrieval. In addition, CBIR systems typically require minimal human intervention. Due to constraints in space, we only focus on the methods and techniques that have influenced our image retrieval work. Details of CBIR can be found in various Surveys11, 12, 13. A typical CBIR method involves converting the visual content into a representation or feature signature. Needless to say performance will depend on the low-level features of the image that we wish to use. Some examples of low level features are color, texture, and contrast. They can be converted into a signature, typically a histogram or a wavelet transform. A wavelet is a mathematical structure that can be used to store image shape and texture information. Wavelets provide multiresolution analysis of the image making it tolerant to scaling. Haar and Daubechies14 are commonly used wavelet transforms in image retrieval. Another area of interest to us is the region-based matching methods15, 16, 17. These methods adopt a fine grained approach towards feature extraction, by dividing the images into a set of regions. The feature vectors of each of these regions are used to represent the image. One major goal of the CBIR community is to extract objects or concepts from images and regions. In many cases, objects or simple concepts will have similar pixel characteristics in images. This means that the regions that are extracted from an image could correspond to the objects contained in this image. If concepts or objects can be extracted from images efficiently, then information in the images can be represented as a graphical structure with concepts and relations. This is similar to the common knowledge representation used in I-FGM and will help us in dealing with heterogeneous data types. WALRUS6 and SIMPLIcity18 use wavelets to represent the feature signature of regions. We direct special attention to WALRUS and SIMPLIcity because they form the basis of the image retrieval method used in this paper. WALRUS uses a sliding window over an image and generates wavelet coefficients for this window. The wavelet coefficients are then clustered into regions. It also defines a similarity measure based on the areas of similar regions and hence is a simple and effective method for searching for images similar to a given query image. SIMPLIcity is similar to WALRUS in that it uses sliding windows and wavelet coefficients to get the regions from images. However, it is concerned with categorization of images into general semantic classes such as “indoor”, “outdoor”, “textured”, etc.

3. IMAGE RETRIEVAL ALGORITHM In order to process image data, a content-based image retrieval technique based on WALRUS is adopted in I-FGM. WALRUS (Wavelets based Retrieval of User specified Scenes) is a region-based algorithm for retrieving images that are similar to a given query image according to their visual features6. The architecture of the retrieval algorithm is shown in Figure 1.It consists of three phases: image segmentation, image region retrieval, and image matching. In the following paragraphs, we will briefly introduce these phases. The details of WALRUS algorithm can be found in 6. Image segmentation: When an image first enters the system, it is segmented into thousands of sub-images by sliding windows over it. The sliding windows are small squares with variable sizes of 8*8, 16*16 or 32*32. They move all over the image and overlap each other. For each sub-image covered by a sliding window, a 6-dim feature vector is extracted to represent this sub-image. The LUV color space is used to calculate the feature vector. The first three values in the feature vector consist of the average color values of the pixels in the window. The other 3 values are the higher frequency bands of the 2x2 Haar wavelet coefficient calculated for the window. This feature vector is similar to the one used in SIMPLIcity. Each image is represented by a number of feature vectors which represent the color and texture features of the image. The file containing these feature vectors is called the image’s feature file. Image region retrieval: Once feature vectors of an image are extracted, similar windows will be clustered together if their feature vector distances are smaller than a designated threshold. We use Birch clustering algorithm7 to carry out this

task. Each cluster naturally represents a region with similar color and texture in the image. Each region is denoted by the centroid of the corresponding cluster’s feature vectors, which is also a 6-dim feature vector. Centroids of all regions in an image are stored in a file called the image’s region file. Image matching: In this phase, a test image is compared with the query image by matching regions in their region files. Two regions are matched if they are similar, that is, their centroids’ distance is lower than a predefined threshold. A similarity value of these two images is calculated as the fraction of matched area, which is between 0-1. Query image

Calculate wavelet signature for sliding windows

Cluster windows into regions

Compare regions

Test image

Calculate wavelet signature for sliding windows

Similarity

Cluster windows into regions

Fig 1: The architecture of WALRUS algorithm

4. SYSTEM ARCHITECTURE 4.1 Overview I-FGM provides a general framework for large and dynamic information spaces. Information retrieval techniques in specific domains can be easily adapted into this platform and take advantage of quick prototyping and intelligent resources allocation strategy of the system. To deploy computation tasks over a distributed environment, our system adopt a multi-agent architecture, which includes the following components: 1. I-Forager 2. gIG-Soup 3. gIG-Builder 4. I-Matcher 5. Blackboard A more detailed description of those components can be found in our previous papers1, 2, 3. In this section, we focus on the adaptation of image retrieval techniques into our system. To address the problem introduced by data heterogeneity, a preliminary study on semantically combining information contained in texts and images using document graph is discussed. 4.2 Components I-Forager: This component receives query from a user, gathers relevant information from various databases such as the Internet, and then puts them in a repository called the gIG-Soup. It uses currently available technologies such as web search engines to provide first level of filtering. The rough results need further refinement by content-based image retrieval technique implemented in our system. Different search technologies represented by those various image searches have different levels of effectiveness, which is quantified using a parameter called reliability. The reliability is used to calculate the first-order similarity of the documents in the gIG-Soup. The first-order similarity is used to measure the initial ranking or priority for the newly downloaded images in the gIG-Soup. In our prototype of I-FGM with image retrieval, we employed 3 I-Foragers to retrieve related images from the Internet. Each I-forager uses a different image search engine. The search engines used by us are Google, Yahoo, and MSN. Downloaded images are placed in the gIGSoup, waiting to be processed by gIG-Builders. We also set the same reliability value for all the image search engines. The results from the experimental runs in this paper will be used to refine the reliability of the image I-Foragers for future runs.

gIG-Soup: This component is a repository implemented as a Network File System (NFS). It contains images received from I-Forager and processed forms of images produced by gIG-Builders. In our implementation, the processed form of an image document includes intermediate data, such as the wavelet feature file and the clustered region file. Additionally, gIG-Soup has a database to store important information about images in gIG-Soup, such as file names, URL, the identification for a query, image sizes, the number of sliding windows completed and so forth. These tables also store the partial and full relevancy measurements of images in the gIG-Soup. gIG-Builder: The gIG-Builder is a place where specific information retrieval techniques are implemented. In previous papers1, 2, we have implemented and discussed gIG-Builder for text retrieval. In this prototype we added modules to process information from images. The new gIG-Builders were integrated into the previous prototype as plug-in modules without any difficulty. The first two phases of the image retrieval method, image segmentation and image region retrieval, are implemented in this component. Since both phases process one window at a time, the processing time in each partial processing step can be easily adopted by controlling the number of sliding windows processed each time. We use a window size of 8 by 8 pixels and a sliding distance of 4 pixels. These values were decided upon after initial testing, to maintain balance between processing time and performance. Each image in the gIG-Soup has a priority value. Priority of an image is used to quantify to what extent it is potentially relevant to the given query. The priority function used in the previous prototype is also used here with changes made to the constants and the half life period. Each gIG-Builder uses a selection criterion (the same as the previous prototype) based on priority value to choose an image for processing. It then begins to process some sliding windows. The number of sliding windows to be processed is determined by the priority value. Thus image processing is done in an incremental manner. In our current implementation, 12 nodes are used as gIG-Builders. gIG-Builders continuously query the gIG-Soup, specifically the MySQL database, for images that are not fully processed. The information about the selected images such as priority and the number of sliding windows to be processed can be obtained from the MySQL database. After a gIG-Builder picks out an image, it proceeds to analyze the image for the given amount of sliding windows. When this is done, the image feature file and region file are modified and placed back into the gIG Soup. Information such as the number of sliding windows processed is updated in the MySQL database. Since there are multiple gIG-Builders working simultaneously, appropriate flags are used to ensure synchronization among the components while accessing the database and the documents contained in the gIG-Soup. I-Matcher: In our previous papers1, 2, we have developed I-Matchers for text retrieval. Here, we discuss I-Matchers for image retrieval. Each I-Matcher generates a query region file to compare with region files from gIG-Soup. The query region file is generated from a query image that is manually selected to conform to the given text query. However, we will use natural language queries directly for image retrieval in our future work. I-Matcher searches the MySQL database continuously for image region files that have been processed by gIG-Builders. Such images with the highest priorities will be selected by the I-Matcher and the similarity values for them will be calculated based on the comparison with the query image. The similarity value is calculated using the following formula: area (∪in=1 ( Pi )) + area (∪in=1 (Qi )) Similarity ( P, Q ) = area ( P ) + area (Q ) Here P and Q represent the retrieved image and the query image respectively. The set of ordered pairs {(P1, Q1), …, (Pn, Qn)} form a similarity region pair set for P and Q, where Pi is similar to Qi and for i ≠ j, Pi ≠ Pj, Qi ≠ Qj. The priority of this document is then updated based on the similarity measure. Blackboard: The blackboard continuously queries the MySQL database and displays the top ten most relevant documents to the user. It reflects the newest results in real time as the information is being processed by the system. 4.3 Heterogeneous issues:

In the discussion above, we focused only on the implementation of image retrieval. However, the real challenge is heterogeneous search spaces which include information from both texts and images. How do you compare an image file with similarity value of 0.7 and a text document with similarity value of 0.7? We cannot simply say one is better than the other based on the similarity values of the similarity since they are using totally different metrics. Another challenging problem involves compound documents containing several paragraphs of texts and several images. It is even harder to define an impartial metric to evaluate the overall similarity under such situation. Here, we sketch a general idea to deal with this problem. As we have shown in1,2, text information can be converted into a document graph by extracting related concepts from text. Our approach seeks to understand the semantic meaning of an image and construct a similar document graph. We can use the retrieval concepts of the WALRUS algorithm to recognize objects in an image and denote them as predefined concepts8. Each concept is associated with a text description9. Descriptions of those concepts recognized from an image are combined to form the image’s annotation. After that, a document graph will be constructed for the image based on the generated annotation. Thus we provide a unified model to understand information from text and images. All kinds of data, images and texts or compound documents, can now be converted into a document graph. In our current prototype, different gIG-Builders and I-Matchers are implemented for image retrieval and text retrieval. After the above unified model is implemented in a future prototype, we will use separate image gIG-Builders to process image data, but we will need only one kind of I-Matcher for both image and text.

5. CONTROL SYSTEMS To verify the effectiveness of I-FGM, we compare it to current methodologies for information retrieval. Two control systems are constructed to represent the current state of the art, one is called the baseline system, and another one is called partially intelligent system. The control systems have the same components as the IFGM system, but they use different resource allocation strategies. To be specific, the baseline system selects images for processing with a uniformly distributed random generator. Each partial step in baseline system is of equal duration. The partial intelligent system uses the first-order similarity to pick the next image for processing, which means that the processing is based on the relevancy measure provide by third party tools used in the I-Foragers. By comparing I-FGM to the baseline system, we prove that computing resource allocation strategy according to images’ relevancy during processing is necessary. By comparing I-FGM to the partially intelligent system; we prove that resource allocation based only on the first-order similarity of images is unable to achieve the optimal performance. The control systems of I-FGM are implemented on a set of 17 machines. Each machine has a 2 GHz Pentium 4 processor with 512 MB RAM. The machines are interconnected with 1 Gigabit Ethernet network. Three machines are used as I-Foragers, 12 as gIG-Builders, one machine as an I-Matcher, and the remaining machine as the Blackboard and gIG-Soup.

6. EVALUATION I-FGM is evaluated by comparing its performance with the mentioned control systems. A set of five queries are given to each of the three systems and results are analyzed by two performance metrics, recall and the waiting time of relevant images. Recall is a standard metric in information retrieval and is defined as the ratio of the relevant images retrieved to the total number of relevant images. Waiting time of a relevant image is defined as the length of time between its arrival in the gIG Soup and the appearance on the Blackboard. Our objective is to achieve full recall quickly. As to the time that a relevant image waits (in the gIG-Soup) before it is retrieved, I-FGM seeks to minimize it for all relevant images. In order to compare the performance of I-FGM and control systems, a testbed of images has to be created and the procedure is described below. 6.1 Testbed For the experiments, a set of five queries related to natural disasters are chosen, which are: 1. Building damage by hurricane Katrina 2. Firefighters fight wildfires 3. Heavy snow storms in the winter 4. Rains cause mudslides 5. Houses damaged by a tornado

The testbed is created for each query by storing the top 50 results from each of the three search engines used in the IForagers. In order to use recall as a performance metric, we designate a small subset of documents in the testbed with the highest similarity values as the set of relevant documents or the target set. 6.2 Procedure for evaluating retrieval performance For each query: Run all three systems until all images have been completely processed. For each run, record the following data: - Blackboard content: Define an interval amount of time ti (ti= 3 seconds in this experiment). We record the contents of the blackboard after each interval time ti. - The similarity values at the end of each processing step are stored for later analysis. From the recorded Blackboard content, we calculate the recall for a query and the waiting time for each image. A graph depicting the recall over the period of the simulation is drawn for each query. The waiting times of the relevant documents are depicted as a bar graph for each query.

7. RESULTS AND ANALYSIS In order to compare the performance of I-FGM with Baseline and Partial control systems, we use two performance metrics: recall over time and waiting time of the relevant images. On the whole, I-FGM outperforms the other control systems in 3 out of the 5 queries i.e. queries 1, 2 and 3. We will now analyze its performance in detail for each query. For query1, I-FGM has lowest waiting times (Fig 2(a)) among the control systems for 4 for the documents while partial and baseline have it for 3 documents. In the recall graph (Fig. 2(b)), we see that reaching the recall of 0.2, I-FGM maintains the highest recall for the rest of the simulation period. One document for which I-FGM performed badly is MSN.27. It is a mid ranked document (as ranked by the I-Forager). On analysis of its similarity values over the course of many partial steps, we see that initially it has a small similarity value of around 0.3. It remains around 0.3 for a few partial steps, before it increases with a low gradient. This situation is not ideal for the current priority function in I-FGM, as it penalizes the documents for low similarity value in the initial steps. But when we also see that the priority value of the document increases when its similarity starts to increase. The priority function will be refined in the later work to better deal with these kinds of documents. For query2, I-FGM is a clear winner, getting 9 out of the 12 relevant documents (Fig. 3(a)) faster than the other control systems. Its recall (Fig. 3(b)) values are also consistently better than the other control systems over the simulation period. For query 3, I-FGM gets 4 documents within the shortest time and partial and baseline tie with 3 documents each. The recall graph shows that after a recall value of 0.4, I-FGM dominates the other system achieving full recall the earliest. One document that I-FGM loses to partial system is Google.21. On analyzing the trend of its similarity value over time, we see that it has a high initial similarity (around 0.7), Despite this I-FGM lost to the other control systems. This was because I-FGM processed other documents that had higher priority values at that time. This demonstrates the difficulty in modeling the priority function and there are situation when I-FGM will make the wrong choices in choosing documents for processing. But when we analyze the recall graph, we see that IFGM achieves full recall the earliest and has a consistently higher recall value over the range [0.4-1]. Que ry1: Re call Vs Time Baseline

Que r y 1: Tim e to Appe ar on Black boar d

Partial

I-FGM

Baseline

partial

if gm

1

2000 1500

(a)

(b) Fig 2: Query 1

Yahoo.26

Yahoo.22

Google.14

Yahoo.29

Yahoo.19

M SN.27

Tim e (s )

Google.33

1621

1375

1274

934

641

307

213

173

93.4

0 40.1

500

0

Yahoo.32

1000

0.2

Yahoo.17

0.4

Yahoo.21

Tim e(s)

0.6

0

Recall

0.8

Que r y2 : Tim e to appe ar on Black board

Query2: Recall Vs Time Baseline

Partial

I-FGM

Baseline

partial

if gm

(a)

MSN.12

Google.1

MSN.1

Yahoo.21

Yahoo.29

Yahoo.27

Tim e (s )

Yahoo.17

523

464

438

332

305

279

239

199

166

120

106

66.6

0

40.1

0

MSN.5

0.2

Yahoo.37

0.4

MSN.4

0.6

500 400 300 200 100 0

MSN.14

Time(s)

Recall

0.8

MSN.9

1

(b) Fig 3: Query 2

Query3: Recall Vs Time Baseline

Query3: Time to Apear on Blackboard

Partial

I-FGM

Baseline

Partial

I-FGM

1

1000 800 Time(s)

Recall

0.8 0.6 0.4

600 400

(a)

Google.56

MSN.24

Yahoo.23

Yahoo.34

Google.21

Google.31

MSN.10

767

569

463

413

373

240

172

120

93.4

80.1

66.7

0

40.1

Tim e (s )

Yahoo.38

0

0

Google.22

200 0.2

(b) Fig 4. Query 3 Query4: Time to appear on Blackboard

Time (s)

(a)

MSN.4

Yahoo.29

980

811

692

664

624

519

505

452

399

371

320

253

199

146

80.1

0

40.1

0

if gm

Yahoo.26

0.2

partial

Google.46

0.4

Google.37

0.6

1200 1000 800 600 400 200 0

Google.26

Time(s)

Recall

0.8

Google.40

Baseline

1

Google.19

I-FGM

Google.24

Partial

Google.25

Bas eline

Google.54

Query4: Recall Vs Time

(b) Fig 5. Query 4

Query5: Recall Vs Time Baseline

Partial

Que ry5: Time to appe ar on Blackboard

I-FGM

Bas eline

Partial

I-FGM

1

(a)

Google.37

MSN.39

Yahoo.25

Google.58

MSN.48

Time (s)

Yahoo.37

1218

1112

627

384

293

266

240

172

132

119

92.3

0

0

Yahoo.23

0.2

Yahoo.4

0.4

1400 1200 1000 800 600 400 200 0 Google.41

0.6

Yahoo.43

Time (s)

Recall

0.8

(b) Fig 6. Query 5

For query 4 (Fig. 5(a), (b)), I-FGM loses to partial system. I-FGM has lowest waiting time for 4 documents while partial wins in 6 documents. But partial wins by small time periods in 4 of these documents. I-FGM loses to partial by larger time periods for Google.40 and MSN.4. Google.40 is a large document which though has a large initial similarity value; its value fluctuates around this value. I-FGM will lower the priority value in such a case which is the reason why I-FGM fails. For MSN.4 it is easy to see why partial wins. The document has a large first order similarity and is a small image. Both these conditions are ideal for the partial system. For query 4, all the control systems perform equally, finding 4 documents each in the shortest time. But when we look at the recall graph (Fig 6(b)), we see that I-FGM dominates the

other systems, reaching every point in the recall quicker than other system and also reaching the full recall with the shortest time. In summary, we see that I-FGM has better performance than other systems. There are conditions where IFGM does not perform well. They can be remedied by refining the priority function implemented in I-FGM.

8. SYSTEM PERFORMANCE ANALYSIS The experimental results contained in this paper combined with our previous papers1, 2, show that I-FGM is a general framework for effective and efficient information retrieval on large, dynamic, and heterogeneous search spaces. For both texts and images, I-FGM outperforms its control systems. However, purely experimental demonstrations alone are far from satisfying. Without rigorous theoretical analysis, there is no guarantee of our system performance. Here, we will provide our theoretical analysis focused on the performance of the systems including Baseline, Partial-Intelligent, and IFGM as well as the systems without partial processing mechanisms. With the proliferation of electronic data, numerous information retrieval techniques and systems have been developed. They are built based on different models such as vector space model5, probabilistic model5, language model4 etc. However, most of them, if no all, share a common feature that their similarity results can be obtained only after documents are fully processed. We call these kinds of systems FP (Full-Processing) systems. An FP system has the disadvantage that it can introduce a long delay before obtaining the relevant documents due to the great amount of time spent on fully processing irrelevant or unimportant documents. Therefore, in the I-FGM system, we try to partially and incrementally process data and refine the computational resource allocation based on the analysis of obtained partial results and the system condition. We believe that with an intelligent resource allocation strategy we can improve the system’s efficiency so that the I-FGM system can find relevant documents within less time than FP systems. In the following subsections, we will theoretically analyze the average performance of FP systems, the I-FGM system, and the two control systems (Baseline system and Partial-Intelligent system) on static databases. Usually, the essential goal of an information retrieval system is to find and present, as soon as possible, the complete set of the documents that are relevant to the query. Therefore, we use the probability, that at time t a relevant document d is stably ranked as relevant, to measure the system performance. Here, we assume the documents ranked in top b in the Blackboard are taken as relevant to the user’s query. For I-FGM and its two control systems, the similarity results of documents are partially and incrementally built. The rank of a document will change as time evolves so that a document may jump in and out of the set of relevant documents. Thus, we measure these three systems based on their stable results. A document is called stably ranked as relevant when this document is contained in the set of relevant documents and never falls back out of the Blackboard. Information retrieval is a complex task whose performance is very hard to analyze. Thus, we make some simplifications: 1. All documents in the search space have the same length. They are evenly partitioned into K parts in Baseline and Partial Intelligent systems. 2. Omitting the partial processing overhead such as the cost for communication and synchronization and the cost for loading documents or document graphs into gIG-Builders. Therefore, there will be a perfect speedup for our distributed systems. The performance analysis results of the system which has only one processor can be easily applied to the system with P processors by improving the performance results P times. The system performance analysis in following paragraphs is for the systems with only one processor. There are some parameters frequently used in our system performance analysis. We list them in the following table. Table 1: System parameters Parameter N M S K Td Ts ai PX pX c

Description The total number of documents contained in the database (gIG-Soup) The number of documents system has processed The number of relevant documents contained in the database (gIG-Soup) The number of slices that a document is partitioned into in I-FGM and its two control systems The time for a system to fully process a document The time for the system to process one slice of a document. In the simplified model, Td=KTs The minimum number of processed slices for a relevant document di to be stably ranked as relevant. Usually, ai≤ K The probability for a relevant document to be stably ranked as relevant in the system X The probability for a relevant document to be processed at any time in the system X The comparison between B system and I-FGM system, represents that I-FGM system can stably rank a document as relevant c times faster than B system

8.1 FP system Typically, FP systems treat every document uniformly. They randomly pickup a document and fully process it to obtain its similarity value. In static search spaces, after the full processing of a relevant document, this document will be stably ranked as relevant. At time t=mTd, m documents will be fully processed. The probability that a relevant document is contained in these m processed documents can be formed as: PFP (mTd | d i ) = m / n (1) 8.2 B (Baseline) System From the design of the B system, we can see that the B system treats each document slice uniformly. It randomly picks up and processes one “un-touched” document slice, and then updates the similarity value of the document which owns this slice. The probability that a relevant document di is stably ranked as relevant at time t=mTd can be form as:

⎛ K ⎞ ⎛ nK − ai ⎞ ⎜⎜ ⎟⎟ ⋅ ⎜⎜ ⎟ ai ⎠ ⎝ mK − ai ⎟⎠ ⎝ PB (mTd | d i ) = PB (mKTs | di ) = ⎛ nK ⎞ ⎜⎜ ⎟⎟ ⎝ mK ⎠

(2)

where nK is the total number of document slices contained in the gIG-Soup. By time t=mTd, the B system has processed mK document slices. Therefore, denominator represents the total number of possible candidates if the B system has processed mK document slices. In order to stably rank a relevant document di as relevant in the Blackboard, the B system should process at least ai slices of this document. Thus, the numerator represents the number of total possible candidates if at least ai slices of document di have been processed. From this formula we can see that ai is an important factor which significantly affects the probability PB for document di. Intuitively, large ai represents that the system should put more effort to distinguish this relevant document from irrelevant documents, and vice versa. Since our goal is to analyze system average performance, we define aB as the average of ai:

aB = ∑i =1 ai / s s

(3)

The B system average performance can be formed as:

⎛ K ⎞ ⎛ nK − aB ⎞ ⎜⎜ ⎟⎟ ⋅ ⎜⎜ ⎟ aB ⎠ ⎝ mK − aB ⎟⎠ ⎝ PB (mTd | di ) = PB ( mKTs | di ) = ⎛ nK ⎞ ⎜⎜ ⎟⎟ ⎝ mK ⎠

(4)

However, only considering the absolute value of aB is not enough for determining the system performance. We should also consider K, the total number of slices for each document. In fact, the system performance is determined by the ratio r of aB versus K. rB = aB / K (5) In order to compare the performance of the B system with the FP system, we can analyze the following inequality:

PB > PPF (6) It is hard to mathematically solve system performance parameters from this inequality. Thus, we take n, K, and aB as system configuration, draw two curves described by (1) and (4), and find the intervals of m which satisfies (6). Based on the size of these intervals, we can measure the percentage of the time range during which B system outperforms FP system. We call it as the percentage of better range of B system versus FP system and use pc to represent it. We simulate these two curves on a computer with several sets of system configurations. The sets of system configurations are shown in the following table.

Table2: System configuration parameters n 1000 10000 100000

K 5, 10, 20, 40, 60, 80, 100 5, 10, 20, 40, 60, 80, 100 5, 10, 20, 40, 60, 80, 100

rB 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1

From the simulation results we find that, in fact, the value of n almost does not affect pc. pc changes with rB in such a way that it increases as rB decreasing. Different values of K generate different curves of pc versus rB. These curves are shown in the following figure. 1.2 K=5

1

K=10

pc

0.8

K=20 K=40

0.6

K=60 K=80

0.4

K=100 0.2

pc=0.5

0 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

r

Fig 6: Percentage of better range changes with the ratio r and K

From this figure we can see that these curves are contained in a narrow band. Based on the real condition of search spaces employed in I-FGM and its two control systems, a reasonable value of K for real search space is between 10 and 20. In the figure, when rB is larger than 0.3, the curves of K=10 and K=20 are almost superimposed. Thus, we believe the curve of K=10 is a good approximation for our system. From the figure we also can see that no matter what the value of K is, the B system will always outperform FP systems when rB is less than 0.7. rB is determined by the query and the nature of real search spaces we employed. This parameter can be measured by our system experimental results. 8.3 PI (Partial-Intelligent) system In the PI system, we rank documents by their FOS (First-Order Similarity). The PI system processes documents according to the decrease order of their FOS. Therefore, the performance of the PI system is totally determined by how well our FOS is formulated. In the PI system, the probability that a relevant document is contained in m processed documents can be formed as: PPI ( mTd | d i ) = y (m) / Y (7) where y(m) is the number of the relevant documents which have been handled when the PI system has fully processed m documents, and Y is the total number of relevant document contained in search spaces. Y and y(m) can be measured by FOS. 8.4 I-FGM system In the I-FGM system, we assign priorities for documents and refine them based on the analysis of partial results during system partial processing. Comparing with the PI system, this advantage can be shown as the fact that when the FOS is not good, the refined priority can identify the promising documents and intelligently reallocate the resources accordingly. Comparing with the B system, the advantage can be shown that the probability to pickup and process a relevant document is larger than the corresponding probability for an irrelevant document. Assuming pI-FGM=cpB, I-FGM system can stably rank relevant document c times faster than the B system. The I-FGM system is quite complicated due to the nature of the data and the complexity of the priority function. Here, we simplify the I-FGM system in the following way: 1. In our original design, for each partial processing step, the amount of data to be processed changes with the priority of the document. In our simplified model, we fixed this amount. We believe that, in order to stably rank

2.

3.

relevant documents as relevant, our original system, with a good priority function, would require less number of document slices to be processed than the simplified model. In our simplified model, instead of considering a lot of other factors, such as the history of the document’s similarity value, the increment of the document similarity value, and etc, we set the document priority as the obtained document similarity value at the current partial step. Therefore, pdi(t)=simdi(t), where p represents priority and sim represents document similarity value. In previous papers2,3, the I-FGM system uses hybrid mechanism to select the document to be processed at each partial step. Processors are partitioned into two sets. One set of processors select documents strictly according to the rank of document priorities. The other set of processors select documents based on the probability which is proportional to the normalized value of documents priorities. This probability can be formed as

γ di (t ) = pd (t ) / ∑ j =1 pd (t ) = simd (t ) / ∑ j =1 simd (t ) where γdi(t) represents the probability that system n

n

i

j

i

j

select document di at time t. In order to easily comparing with the B system, in our simplified model, all processors employ the second mechanism. We believe that with a good priority function, our original design is better than the simplified model. Even after these simplifications, the I-FGM system performance is determined by the nature of the search spaces which mainly depends on two kinds of distributions. One distribution is the distribution of “nuggets” (elements which are relevant to the user’s query) in a document. The other one is the distribution of documents in search spaces according to their similarity values. The nuggets distribution within a document can be arbitrary. Studying this distribution is a research topic which requires a great amount of effort. However, if we can assume this distribution is uniform, then averagely, the probability that the I-FGM system selects document di to process at a partial step can be formed as

γ d = simd / ∑ j =1 simd n

i

i

(8)

j

Also, the distribution of documents according to their similarity values in search spaces can be arbitrary. To make the life easier, we assume this distribution is uniform too. If we take top 10% documents as relevant, the average similarity value for relevant documents is 0.95. The average probability for I-FGM system to choose a relevant document at a partial step is

γ relevant document = 0.95 / ∑ j =1 simd n

j

(9)

The average similarity value for irrelevant documents is 0.45. The average probability for I-FGM system to choose an irrelevant document at a partial step is

γ irrelevant document = 0.45 / ∑ j =1 simd n

j

(10)

From (9) and (10) we can see that the average probability for a relevant document to be processed at a partial step is slightly more than two times of that probability for an irrelevant document. p I − FGM ≈ 2 p B & c ≈ 2.0 (11) This means that averagely at any given time, the number of processed slices of relevant documents is about 2 times of the corresponding number of irrelevant documents. Therefore, we can obtain that a I − FGM = a B / c ≈ a B / 2 (12) 8.5 Experimental results and analysis In this subsection, a set of 4 queries are run on our systems, and system performance results are collected and analyzed. The PI system performance is completely determined by the First-Order-Similarity, which depends on the design of the mechanism for measuring the reliability of search engines. Thus, we did not include the performance results for the PI system. We will provide the PI system performance results and analysis in our future work of studying reliability design. The queries used here are chosen from our previous experiments2, and listed in the following: 1. Asian Tsunami disaster in South East Asia 2. Tsunami survivors in Indonesian islands 3. Damages caused by tsunami in Phuket beach, Thailand 4. Tsunami victims in Phi Phi Island.

The measurement of aB (the comparison between the B system and FP systems) The values of K and aB for all four queries are measured and shown in the following table. From the analysis in Subsection 8.2 we can see that the B system outperforms FP systems and the percentage of better range of the B system versus FP system is about 70%. Table 3: The measurement of aB and r

Query Query1 Query2 Query3 Query4 Average

K 6.450 11.537 20.754 5.870 11.15

aB 4.091 4.545 11.2 4.3 6.034

r 0.634 0.394 0.540 0.733 0.575

The measurement of c (comparison between the I-FGM system and the B system) To estimate c, we first measure the number slices zi of document di have been processed when di is stably ranked as relevant. The average probability for a system to process this document can be formed as:

p(d i ) =

zi

mi

=

zi

ti / Ts

=

ziTs

ti

where mi is the number of slices processed by the system at the current time ti. The average probability to choose a relevant document is s s 1 zT p (d ) = ∑i =1 p ( d i ) / s == ∑i =1 ⋅ i s s ti

Therefore, c can be estimated as

p (d ) = c = I − FGM p B (d )



s

i =1

z I − FGMi ⋅ Ts /(t I −FGMi ⋅ s)



s

z ⋅ Ts /(t Bi ⋅ s) i =1 Bi

∑ =

s

i =1

z I − FGMi / t I − FGMi



s

z / t Bi i =1 Bi

(13)

The values of z and t of every relevant document for each query are shown in the following tables: Table 4: The measurement of z and t for queries 1 and 2 Relevant document Teoma.38 Teoma.35 Looksmart.0 Looksmart.9 Teoma.3 Yahoo.25 Msn.21 Teoma.29 Yahoo.1 Looksmart.39 Looksmart.40

Query1 Baseline Slices Time 1 11 1 5 11 91 3 36 1 37 11 91 1 14 1 15 1 11 7 74 7 52

I-FGM Slices 1 1 11 3 1 11 1 1 1 7 7

Time 32 4 91 22 2 91 4 54 5 91 94

Relevant document Looksmart.16 Google.24 Google.15 Yahoo.11 Looksmart.12 Looksmart.1 Google.21 Looksmart.33 Msn.18 Google.19 Yahoo.25

Query2 Baseline Slices Time 5 77 2 27 3 47 3 55 2 42 3 34 4 48 10 100 10 118 3 30 5 62

I-FGM Slices 4 2 3 3 2 1 4 11 10 3 2

Time 29 89 50 68 69 2 50 100 89 81 20

I-FGM Slices 6 3 8 6 2 4 6 2 1 3

Time 50 54 99 44 23 31 76 18 14 50

Table 5: The measurement of z and t for queries 3 and 4 Relevant document Looksmart.18 Teoma.34 Looksmart.7 Looksmart.33 Yahoo.35

Query3 Baseline Slices Time 7 68 23 255 22 184 2 29 2 18

I-FGM Slices 12 22 12 2 2

Time 110 161 114 15 19

Relevant document Looksmart.10 Looksmart.40 Google.27 Msn.7 Looksmart.9 Google.12 Google.5 Google.26 Msn.23 Teoma.26

Query4 Baseline Slices Time 6 69 3 65 8 95 7 101 2 31 4 75 7 70 2 27 1 59 3 29

Based on formula (13), we can obtain c

Query1 1.615

Query2 1.465

Query3 1.196

Query4 1.33

Average 1.40

The value of c we measured in experimental results is different from the estimated value from our theoretical analysis. This is because in our real search space, the distribution of “nuggets” in a document is not a uniform distribution. In our experiments we found that many documents will achieve their final similarity value within the first (or second) partial processing step. This means that in many documents the “nuggets” are contained in the front of the document. This reduces the advantage of refining document priorities in I-FGM. Taking the extreme case, if all documents can obtain their final similarity values at their first partial processing step, there is no advance to implement I-FGM. From the analysis of our experimental results, we can see that the B system and the I-FGM system both outperform the FP system. The percentage of better range of the B system versus FP system is about 70%. The I-FGM system outperforms the B system. The distribution of “nuggets” in a document will affect the performance comparison between the I-FGM system and the B system. Theoretically, with the uniform distribution of “nuggets,” the I-FGM system can find relevant documents about 2 times faster than the B system. With the real distribution of “nuggets” in our experimental data, the I-FGM system is about 1.40 times faster than the B system.

9. CONCLUSION AND FUTURE WORK In this paper, we present the initial application of the I-FGM framework in image retrieval. We adopt a region-based Wavelet Image Retrieval algorithm called WALRUS for retrieving information from images. In our I-FGM system, we partially and incrementally extract information from images and refine our computational resource allocation based on the analysis of partial results. The I-FGM system is tested with its two controls system. The results are collected and analyzed, and it shows that I-FGM is effective in making use the computational resource efficiently and consequently retrieving relevant images quickly. Also, in this paper, we provide theoretical analysis and experimental validation for the performance of different systems. We provide the performance comparison of partial processing systems with fully processing systems, as well as comparison between partial processing systems, i.e. I-FGM system and B system. Based on the measurement of real data in our system, we demonstrate that partial processing systems outperform fully processing systems. Within partial processing systems, the I-FGM system can find the relevant information about 1.4 times faster than the B system. In this paper, we discuss and validate the I-FGM framework for image retrieval, which is only one of the fundamental problems for dealing with heterogeneous data in information retrieval. We plan to continue to work on handling heterogeneity in I-FGM. We intend to implement our unified framework using the concept graph idea, for determining the relevance of images and texts. As a starting point for this, we will study the work done in automatic annotation of images, such as CAMEL8 and ALIPR19. We will then design, implement, and validate the intelligent computational resource allocation strategy for heterogeneous data. We will also provide theoretical analysis of system performance on dynamic and heterogeneous data.

ACKNOWLEDGEMENTS The work presented in this paper was supported in part by the National Geospatial Intelligence Agency Grant Nos. HM1582-04-1-2027 and HM1582-05-1-2042.

REFERENCES

[1]

E. Santos, Jr., E. E. Santos, H. Nguyen, Q. Zhao, L. pan, and J. Korah, "Large-scale Distributed Foraging, Gathering, and Matching for Information Retrieval: Assisting the Geospatial Intelligent Analyst," Proceedings of SPIE Defense and Security Symposium, Vol. 5803, pp. 66-77, 2005.

[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]

E. Santos, Jr., E. E. Santos, H. Nguyen, L. Pan, J. Korah, Q. Zhao, and M. Pittkin, “I-FGM Information Retrieval in Highly Dynamic Search Spaces,” Proceedings of SPIE Defense and Security Symposium, Vol. 6229, pp. 1-12, 2006. E. Santos, Jr., E. E. Santos, and E. S. Santos, "Distributed Real-Time Large-Scale Information Processing and Analysis," Proceedings of SPIE Defense and Security Symposium, Vol. 5421, pp. 161-171, 2004 F. Song, and W. B. Croft, "A General Language Model for Information Retrieval", Proceedings of Eighth International Conference on Information and Knowledge Management, pp. 279-280, 1999. R. B. Yates, and B. R. Neto, Modern Information Retrieval, Addison Wesley, May 15, 1999. A. Natsev, R. Rastogi, and K. Shim, "WALRUS: A Similarity Retrieval Algorithm for Image Databases", IEEE Transactions on Knowledge and Data Engineering, vol. 16, pp. 301-316, 2004. T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," ACM SIGMOD International Conference on Management of Data, Montreal, Canada, 1996. A. Natsev, "Multimedia Retrieval By Regions, Concepts, and Constraints," PhD Thesis, Computer Science, Duke University, 2001. J. Li and J. Z. Wang, ``Real-time Computerized Annotation of Pictures,'' Proceedings of the ACM Multimedia Conference, pp. 911-920, 2006. P. Enser and C. Sandom, "Towards a comprehensive survey of the semantic gap in visual image retrieval," Lecture Notes in Computer Science, vol. 2728, pp. 279-287, 2003. A. W. M. Smeulders, M. M. Worring, S. Santini, A. Gupta, and R. Jain, "Content-based image retrieval at the end of the early years," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 1349-1380, 2000. R. C. Veltkamp and M. Tanase, "Content-Based Image Retrieval Systems: A Survey," Dept. of Computing Science, Utrecht University Technical Report UU-CS-2000-34, 2000. Y. Rui, T. Huang, and S. Chang, "Image retrieval: current techniques, promising directions and open issues," Journal of Visual Communication and Image Representation, vol. 10, pp. 39-62, 1999. I. Daubechies, Ten Lectures on Wavelets. Philadelphia: SIAM, 1992. C. Carson, S. Belongie, H. Greenspan, and J. Malik, "Blobworld: Image segmentation using expectation-maximization and its application to image querying," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 1026-1038, 2002. J. Li, J. Z. Wang, and G. Wiederhold, "IRM: Integrated Region Matching for Image Retrieval," Proceedings of the ACM Multimedia Conference, Los Angeles, CA, 2000. W. Y. Ma and B. Manjunath, "NaTra: A Toolbox for Navigating Large Image Databases," IEEE International Conf. Image Processing, 1997. J. Z. Wang, J. Li, and G. Wiederhold, "SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture Libraries," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, pp. 947-963, 2001. Jia Li, James Z. Wang, "Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1075-1088, 2003.