CommIT (Communication and Information Technology) Journal - BINUS ...

4 downloads 26195 Views 415KB Size Report
... strategic information system (business intelligence, decision support system, ... Authors (from BinaNusantara University Only), whom their articles have been ...
CommIT (Communication & Information Technology) 10(1), 15–26, 2016

IMPROVING SCALABILITY OF JAVA ARCHIVE SEARCH ENGINE THROUGH RECURSION CONVERSION AND MULTITHREADING Oscar Karnalim Program of Information Technology, Faculty of Information Technology, Maranatha Christian University, Bandung 40164, Indonesia Email: [email protected]

Abstract—Based on the fact that bytecode always exists on Java archive, a bytecode based Java archive search engine had been developed [1, 2]. Although the system is quite effective, it still lacks of scalability because many modules applying recursive calls and the system only utilizes one core (single thread). In this research, Java archive search engine architecture is redesigned in order to improve its scalability. All recursions are converted into iterative forms although most of these modules are logically recursive and quite difficult to convert (e.g., Tarjan’s strongly connected component algorithm). Recursion conversion can be conducted by following its respective recursive pattern. Each recursion is broken down to four parts (before and after actions of current and its children) and converted to iteration with the help of caller reference. This conversion mechanism improves scalability by avoiding stack overflow error caused by method calls. System scalability is also improved by applying multithreading mechanism which successfully cut off its processing time. Shorter processing time may enable system to handle larger data. Multithreading is applied on major parts which are indexer, vector space model (VSM) retriever, low-rank vector space model (LRVSM) retriever, and semantic relatedness calculator (semantic relatedness calculator also involves multiprocess). The correctness of both recursion conversion and multithread design are proved by the fact that all implementation yield similar result. Keywords: Scalability; Recursion Conversion; Multithreading; Java Archive Search Engine; Multiprocess.

I. INTRODUCTION Scalability is a prominent factor in a search engine development since the size of index data the search engine may grow rapidly [3]. Many techniques are applied to handle data growth such as algorithm optimization, preprocessing, multithreading, and parallelism. Received: March 22, 2016; received in revised form: March 28, 2016; accepted: March 30, 2016; available online: April 4, 2016.

Based on the fact that bytecode always exists on Java archive, a bytecode based Java archive search engine had been developed [1]. The system utilizes bytecode on Java archive as its primary information source and extracts many textual information on class files as document terms through reverse engineering mechanism (e.g., class name, field name, method name, control flow weighted string literals in method content, and method calls). Its retrieval model is also embedded with relatedness in order to improve its recall [2]. Although the system works well, it still lacks of scalability since many modules are applying recursive calls and the system is single threaded. Since recursion generates many function calls and each function call is pushed on stack to keep the track of the program flow [4], recursive calls may yield stack overflow error in Java environment. Stack overflow error occurs when memory stored in JVM stack is larger than its size [5]. This error can be avoided by converting all recursive algorithms to its iterative form. However, some algorithms are logically recursive and inconvenient enough to design it on iterative manner (e.g., Tarjan algorithm for detecting strongly connected components on graph [6]). Although logically recursive algorithms are exceptional cases where recursive approach is more beneficial than iterative approach in software development [7], these algorithms are still needed to be converted in order to avoid stack overflow error. As hardware technologies advances, many regular computers are backed up with multi core processors which enable operating system to complete many task at a time using multithreading [8]. Liu & Wang [9] applied multithreading on their ensemble learning in

Cite this article as: O. Karnalim, “Improving Scalability of Java Archive Search Engine Through Recursion Conversion and Multithreading”, CommIT (Communication & Information Technology) Journal 10(1), 15–26, 2016. order to filter spam in Short Message Services (SMS). Abbrevations are also given for each description to Their multithread design runs faster than single thread simplify the illustration. design which strengthens the proof that multithread may reduce processing time. This mechanism may B. Loop Encapsulation also be utilized to reduce time latency in many search Loop encapsulation encapsulates loops based on engine tasks such as indexing and retrieving docu- Miecznikowski’s algorithm [1, 14]. Loop encapsulaments [10–12]. tion example can be seen in Fig. 1. This module In this research, a bytecode based java archive takes control flow graph as its argument and utilizes search engine is designed in more scalable way which Miecznikowski’s to detect loop candidates. All nodes involve recursion conversion and multithreading. Three which is a member of certain loop are wrapped and main recursive algorithms which are converted to its merged as one loop node. Loop node is a wrapper iterative form are loop encapsulation, recursive method which consists of loop member nodes and inherits elimination, and method expansion. These algorithms its successors / predecessors. This loop replacement are needed at indexing phase in order to extract method mechanism is conducted repeatedly from inner-most contents. The search engine architecture is also re- loops until no more loop exists. designed as multithread search engine to diminish its Recursive implementation of loop encapsulation processing time. module can be seen in Alg. 1. Control Flow Graph (CFG) is represented as array of Control Flow (CF) II. METHODS nodes which will be broke down as a bunch of subA. Recursion Conversion It is known that all algorithms can be implemented recursively or iteratively [13]. Some algorithms are more convenient to be implemented with iterative approach whereas the others are more convenient to be designed as recursive one. Many nave and straightforward algorithms such as sequential searching, sorting, and string processing are easier to be implemented iteratively. However, some permutation and combinatorial problems are easier to solve recursively (e.g., map coloring, insertion on binary tree, and traversing all nodes in graph). Recursive algorithms converted in this research are logically recursive. Loop encapsulation is designed similar to Depth First Search (DFS) pattern and utilizes Tarjans Strongly Connected Component (SCC) algorithm for loop detection. Recursive method elimination also utilizes similar Tarjan’s algorithm to detect recursive methods although they do not share the same data type. Therefore, object oriented technique called Generics is applied in order to enable Tarjan’s SCC algorithm takes various data types on its process without explicit typecasting. Method expansion is designed in recursive dynamic programming manner which never re-expand already-expanded method to cut off its processing time. Since stack overflow error is the biggest issue in recursive implementation, especially on large scale recursion [4, 5]. Therefore, all recursive implementation in this research will be converted as iterative one to improve the program scalability. For clarity at conversion phase, some recursive parts of algorithm in both implementation are marked with different color which details can be seen in Table I.

TABLE I R ECURSIVE PARTS COLOR DETAILS .

Color

Description

Blue

Current process before processing its children (B) Current process after processing its children (A) Current children’s process before recursion (CB) Current children’s process after recusion (CA)

Red Green Brown

Initial Control Flow Graph 2

5

8

9

18

21

24

27

30

39

42

R

27

30

39

42

R

30

39

42

R

Loop Detector Result 2

5

8

9

18

21

24

Inner Loop Encapsulation 2

5

A

21

24

27

Outer Loop Encapsulation B

C

Fig. 1. Loop encapsulation example.

16

R

nation, and e needed at od contents. designed as minish its

ms can be [13]. Some mplemented rs are more one. Many s such as processing y. However, roblems are p coloring, all nodes in

ed in this ve. Loop depth first n’s strongly m for loop ation also to detect ot share the t oriented in order to arious data ypecasting. recursive never recut off its error is the ementation, n [4, 5]. on in this ive one to

ase, Some lementation details can so given for on.

ates loops , 14]. Loop Fig. 1. This s argument

no more loop exists. Table 1: Recursive parts color details Color Description Blue Current process before processing its Cite this article as:(B) O. Karnalim, “Improving Scalability of Java Archive Search Engine Through Recursion children Conversion and Multithreading”, CommIT Red Current process after processing its (Communication & Information Technology) Journal 10(1), 15–26, 2016. children (A) Green Current children’s process before recursion (CB) graph basedCurrent on Tarjan’s algorithm function). is converted as an edge from caller to called method. Brown children’s process(getSCC after Each subgraph which consists of more than one node Since this module only apply recursion on its Tarjans recursion (CA)

is considered as loop candidates and its respective conditional node is removed. Removing conditional nodes is required to detect inner loop on remaining nodes by executing similar mechanism recursively. Converting loop encapsulation to iterative form is quite simple since this algorithm follows simple DFS pattern (without A and CA parts). Simple DFS pattern can be remodeled to iterative form with the help of a stack and a while loop (which are used to keep track of recursive calls). Iterative loop encapsulation can be seen in Alg. 2 where each recursive call is replaced by pushing its recursive argument on a stack. B and CB are translated as loop action whereas each element on stack is popped and processed until the stack is empty.

SCC algorithm, This module will not be discussed further. D. Tarjan’s SCC Algorithm

Since loop encapsulation and recursive method elimination involve SCC detection on their graph respectively, Tarjan’s SCC algorithm is also converted to iterative form. Tarjan’s SCC algorithm applies Generic which enables this module to take various data types on its process without explicit typecasting. Generics is a Java feature which enables developer to use same class for many kind of data types without using explicit typecasting [15]. Tarjans SCC algorithm is logically recursive which is quite difficult to convert it as iterative algorithm. This algorithm also has all recursive algorithm parts Fig 1: Loop encapsulation example C. Recursive Method Elimination which declared at Table I (B, A, CB, and CA). Tarjans Recursive elimination removes all recursive recursive implementation can be seen in Alg. 3. This Recursivemethod implementation of loop methods from method encapsulation module canexpansion be seen candidates in Fig. 2.since all algorithm can be conducted by calling getSCC. Index recursion endless method expansion Control flow may graphyield (CFG) is represented as array (each and lowlink of each node are used to detect SCC of expansion control flow (CF) nodes which will be broke consists of additional recursive call). This wherein −1 stands for unprocessed nodes. Index down as a bunch of subgraph based on Tarjan’s module takes method list as its argument, augments variable in this algorithm is considered as global varialgorithm (getSCC Each subgraph it as method graph,function). detects recursion candidates, and able. which consist of more than one nodes is considered Despite of its iterative conversion difficulty, this alremoves them from method expansion candidates. as Method loop candidates and its respective conditional gorithm still can be converted by imitating its recursive graph is augmented from method calls where nodes are removed. Removing conditional nodes is patterns with the helpexpansion of caller consists reference. Fig 2: Recursive loop encapsulation method expansion (each of Tarjan’s each method is converted to a node and a method call required to detect inner loop on remaining nodes by additional call). can This bemodule iterative recursive implementation seen intakes Alg. 4. This executeConverting similar mechanism recursively. to iterative loop encapsulation method list as its argument, it as method implementation follows augments some rules which are: Algorithm Recursive Encapsulation. form is quite 1 simple since Loop this algorithm follows graph, detects recursion candidates, and removes • Each data tuple is encapsulated as IterTuple which procedure encapsLoopR(nodeList them from method expansion candidates. Method simple DFS pattern (without A and :CF[]) CA parts). has an additional field called caller reference. | CF[][] sccList= getSCC(nodelist) graph is augmented from method calls where each Simple DFS pattern can be remodeled to iterative | for each scc in sccList Caller reference is used track method is converted to a node andtoa keep method callofis its recurform the help of >a 1) stack and a while loop | |with if(scc.length sive pattern and to perform its “after converted as an edge from caller to called method.recursion” | | |are encapsulate as of loop node calls). (which used to keepscc track recursive partmodule (A and only CA). apply recursion on its | | | detectLoopType(scc) Since this Iterative loop encapsulation can be seen in Fig. 3 | | | removeConditionalNodes(scc) • Procedure call of constructSCC is replaced Tarjan’s SCC algorithm, This module will not be where each recursive calls are replaced by pushing | | | encapsulateLoopR(scc) discussed further. by a loop which perform similar pattern as its |recursive argument on a stack. B and CB are | end if | end for translated as loop action whereas each elements on constructSCC. end procedure Tarjan’s SCC Algorithm stack are popped and processed until the stack is • Current process is stored in cur and each reSince andassigning recursive cur with empty. cursiveloop callencapsulation is replaced by method elimination involve SCC detection on their Algorithm 2 Iterative Loop Encapsulation. its recursive process (which is similar to let this graph respectively, Tarjan’s SCC algorithm is also procedure encapsLoopI(nodeList : CF[]) process visit its child converted to iterative form.first). Tarjan’s SCC | Stack s • A and CA parts conducted | s.push(nodeList) algorithm applies Genericarewhich enablesafter thisall of its | while(s is not empty) successor are “recursively” processed. module to take various data types on its processA part is | | CF[] tmp = s.pop() first before CA part without conducted explicit typecasting. Generics is asince Java CA part | | CF[][] sccList = getSCC(tmp) feature which enable same class conducted in developer this phasetois use its parent process’s CA | | for each scc in sccList | | | if(scc.length > 1) for many(parent kind ofprocess data types without using using caller explicit is accessed reference). | | | | encapsulate scc as loop node | | | | detectLoopType(scc) | | | | removeConditionalNodes(scc) | | | | s.push(scc) | | | end if | | end for | end while end procedure

Fig 3: Iterative loop encapsulation

Recursive Method Elimination Recursive method elimination removes all recursive methods from method expansion candidates since all recursion may yield endless function getSCC(nodeList:T[]) : T[][] | T[][] sccList; Stack s; int index = 0

typecasting [15]. Tarjan’s Expansion SCC algorithm is logically E. Method recursive which is quite difficult it as Method expansion unwrap toallconvert method encapsulaiterative algorithm. This algorithm also has all tion by replacing all method calls with its respective recursive algorithm parts which declared at Table 1 method contents until no more method call exists (B, A, CB, and CA). Tarjan’s recursive (which is quite to method [16]). This implementation cansimilar be seen in Fig.inlining 4. This algorithm can be conducted by calling getSCC. Index and lowlink of each node are used to detect SCC wherein -1 stands for unprocessed nodes. Index variable in this algorithm is considered as global variable.

17

recursive which is quite difficult to convert it as iterative algorithm. This algorithm also has all recursive algorithm parts which declared at Table 1 (B, A, CB, and CA). Tarjan’s recursive implementation can be seen in Fig. 4. This Fig 3: Iterative loop encapsulation algorithm be conducted by calling getSCC. Cite this article as: O. Karnalim, “Improving Scalability of Java can Archive Search Engine Through Recursion Index and lowlink of each node are used to detect Conversion and Multithreading”, CommIT (Communication & Information Technology) Journal 10(1), 15–26, Recursive Method Elimination SCC wherein -1 stands for unprocessed nodes. 2016.Recursive method elimination removes all Index variable in this algorithm is considered as recursive methods from method expansion global variable. candidates since all recursion may yield endless | | | | s.push(scc) | | | end if | | end for | end while end procedure

Algorithm 3 Recursive Tarjan’s SCC Algorithm.

function getSCC(nodeList:T[]) : T[][] | T[][] sccList; Stack s; int index = 0 | for each node in nodeList | | if(node.index = -1) | | | constructSCC(node, sccList, s, index, nodeList) | | end if | end for | return sccList end function procedure constructSCC(node : T, sccList : T[][], s : Stack, index : int, nodeList : T[]) | node.index = index; node.lowlink = index | stack.push(node); index = index + 1 | for each successor suc of nodein nodeList | | if(suc.index == -1) | | | constructSCC(suc, sccList, stack, index, nodeList) | | | node.lowlink = min(suc.lowlink, node.lowlink) | | else if(suc.onStack) | | | node.lowlink = min(suc.index, node.lowlink) | | end if | end for | if(node.index = node.lowLink) | | T[] scc; T n | | do{ | | | n = stack.pop(); | | | scc.add(n) | | }while(node != n) | | sccList.add(scc) | end if end procedure

Fig 4: Recursive Tarjan’s SCC algorithm

must apply multithreading mechanism which let program to do tasks at many different cores at the same time. Therefore, program developed in this research is redesigned to multithread manner in order to cut off its processing time. Two major parts are redesigned which are indexer and retriever. Indexer part conversion is quite simple since its jobs can be split based on documents (Java archives). Retriever part involves two retriever mechanism which are VSM and LRVSM. Unlike indexer part, this module requires some global calculation after multithreading to yield similar result as sequential one. In addition, semantic relatedness calculator between term pairs is also redesigned in multithread manner since semantic relatedness is required at EVSM retriever.

mechanism is applied to reweight terms since terms used in frequently called methods should have greater weight based on its occurences. Each inserted method content is also weighted by its prior method call weight in order to keep its relevancy to its caller method. Since recursive calls may yield unlimited loop during expansion phase, recursive methods are marked and only expanded at N times (N is defined as a parameter at indexing phase). Method expansion applies dynamic programming to speed up its process. Each method is only expanded once and all unexpanded method which is called in method content is expanded first. This module is initially implemented with recursive approach because of its natural recursive logic. Recursive implementation of method expansion can be seen in Alg. 5 where its iterative form can be seen in Alg. 6. Conversion of this module imitates Tarjans SCC algorithm conversion where each tuple is encapsulated as IterTuple with an additional field caller reference. Although copying all term in method with nID to mtt is considered both A and CA parts, this instruction still can be converted by duplicating its instruction (The first one is for A part whereas the second one is for CA part).

G. Multithread Indexer Since document-partitioned indexes are more frequently used than term-partitioned indexes in most search engines [17], multithread indexing developed in this research are based on document-partitioned indexes. Documents (Java archives) are split and indexed separately using many threads at the same time. Because document-partitioned indexes assures that all indexes are not tightly-coupled to each other, errorneous indexes only affect the indexed documents and the re-index phase will not affect remaining indexes. Therefore, it may yields faster index error correction.

F. Multithread Design It is obvious to state that single thread program in multi core processor is not efficient enough since it only utilizes one core. To utilize all cores, program

18

call exists (which is quite similar to method Procedure call of constructSCC is inlining [16]). This mechanism is applied to replaced by a loop which perform similar reweight terms since terms used in frequently pattern as constructSCC. called methods should have greater weight based c. Current process is stored in cur and each on its occurences. Each inserted method content is recursive call is replaced by assigning also weighted by its prior method call weight in Cite as: O. Karnalim, “Improving Scalability Java Search Through Recursion order to of keep its Archive relevancy to itsEngine caller method. cur this witharticle its recursive process (which is Conversion and Multithreading”, CommIT (Communication & Information Technology) Journal Since recursive calls may yield unlimited loop 10(1), 15–26, similar to let this process visit its child during expansion phase, recursive methods are 2016. first). marked and only expanded at N times (N is defined as a parameter at indexing phase). b.

Algorithm 4 Iterative Tarjan’s SCC Algorithm.

function getSCC(nodeList : T[]): T[][] | T[][] scc | int index = 0 | Stack stack | for each node in nodeList | | if(node.index == -1) | | | IterTuple cur | | | cur.node = node; cur.caller = null | | | cur.node.index = index; cur.node.lowlink = index | | | stack.push(cur); index = index + 1 | | | while(true) | | | | if(cur.node has unprocessed successors which is nodeList member) | | | | | IterTuple next | | | | | next.node = cur.node.getNextUnprocessedNode(); next.caller = cur | | | | | if(nodeList.contains(next.node)) | | | | | | if(next.node.index == -1) | | | | | | | next.node.index = index; next.node.lowlink =index | | | | | | | stack.push(next); index = index + 1 | | | | | | | cur = next | | | | | | else if(next.node.onStack) | | | | | | | cur.node.lowlink = min(cur.node.lowlink, next.node.index) | | | | | |end if | | | | else | | | | | if(cur.node.index = cur.node.lowlink) | | | | | | T[] scc; IterTuple top | | | | | | do{ | | | | | | | top = stack.pop() | | | | | | | scc.add(top.node) | | | | | | }while(top.node.index != cur.node.index) | | | | | | sccList.add(scc) | | | | | end if | | | | | IterTuple caller = cur.caller | | | | | if(caller != null) | | | | | | caller.node.lowlink = min(caller.node.lowlink, cur.node.lowlink) | | | | | | cur = caller | | | | | else break | | | | | end if | | | |end if | | |end while | | end if | end for | return sccList end function

Fig 5: Iterative Tarjan’s SCC algorithm

Documents are partitioned based on greedy loadbalance mechanism which can be seen in Alg. 7. The number of initialized stacks is similar with the number of expected jobs wherein each document is placed on the lowest document size stack at that time. This mechanism intends to distribute documents evenly among all jobs with greedy approach. Although greedy approach does not always yield the best result, it may yield fairly good distribution among all jobs at linear time. Greedy approach is extremely faster than brute force approach which complexity is O(N!). Multithread indexer design can be seen in Fig. 2. All documents are listed and distributed based on greedy load-balance algorithm. Since each job yields an index based on its given documents, this mechanism will result many indexes rather than one. These indexes need not to be merged as one since split indexes may yields faster index error correction with the help of distribution list. Distribution list is a comma separated

Document Enlistment

Distribution List

Indexer 3 Index3

Indexer N

Indexer 2 Index2

...

IndexN

Indexer 1 Index1

Job Distribution

Fig. 2. Multithread Indexer Design.

values file that contain document list for each job. This file is also generated at indexing phase. Any weighting schemes conducted in previous sequential design are delayed to retriever part based on these reasons: •

19

Storing raw indexes is more scalable than processed one since various weighting schemes may

cores, program must apply multithreading the number of expected jobs wherein each mechanism which let program to do tasks at many document is placed on the lowest document size different cores at the same time. Therefore, stack at that time. This mechanism intends to program developed in this research is redesigned to distribute documents evenly among all jobs with multithread manner in order to cut off its greedy approach. Although greedy approach does processing time. Two major parts are redesigned not of always the best result, it may yield fairly Cite are this article O. retriever. Karnalim, Indexer “Improving Java yield Archive Search Engine Through Recursion which indexeras:and part Scalability good distribution among all jobs at linear time. Conversion and Multithreading”, CommIT (Communication & Information Technology) Journal 10(1), 15–26, conversion is quite simple since its jobs can be split Greedy approach is extremely faster than brute based on documents (Java archives). Retriever part 2016. force approach which complexity is O(N!). involves two retriever mechanism which are VSM and LRVSM. Unlike indexer part, this module

Algorithm 5 Recursive Method Expansion Algorithm.

function expand(beforeList : Data) : Data | Data afterList | for each method with id curID in beforeList | | methodExpansion(curID, beforeList, afterList) | return afterList end function procedure methodExpansion(curID : MethodID, beforeList : Data, afterList : Data) | if(afterList does not contain method with curID) | | ExpandedMethod mtt | | ScoreTuple[] sList = get score tuple list of method with curID from beforeList | | for each ScoreTuple st in sList | | | if(st is string literal) | | | | mtt.add(st) | | | else // st is method call | | | | MethodID nID = st.getMethodID() | | | | if(beforeList contains method with nID) // need to be expanded | | | | | if(afterList does not contain method with nID) // not expanded yet | | | | | | methodExpansion(curID, beforeList, afterList) | | | | | end if | | | | | copy all term in method with nID to mtt// also included as CA part | | | | end if | | | end if | | end for | | afterList.add(mtt) | end if end procedure

Fig 6: Recursive method expansion algorithm

Algorithm 6 Iterative Method Expansion Algorithm.

function expand(beforeList : Data) : Data | Data afterList | for each method with id curID in beforeList | | methodExpansion(curID, beforeList, afterList) | return afterList end function procedure methodExpansion(curID : MethodID, beforeList : Data, afterList : Data) | if(afterList does not contain method with curID) | | IterTuple cur | | cur.id = curID; cur.caller = null | | cur.sList = get score tuple list of method with cur.id from beforeList | | while(true) | | | if(cur.sList has unprocessed ScoreTuple) | | | | ScoreTuple st = cur.sList.getNextUnprocessedScoreTuple() | | | | if(st is string literal) | | | | | cur.mtt.add(st) | | | | else// st is method call | | | | | MethodID nID = st.getMethodID() | | | | | if(beforeList contains method with nID) // need to be expanded | | | | | | if(afterList does not contain method with nID) // notexpanded yet | | | | | | | IterTuple next | | | | | | | next.id = nID; next.caller = cur | | | | | | | next.sList = get score tuple list of method with next.id from beforeList | | | | | | | cur = next | | | | | | else | | | | | | | copy all term in method with nID to cur.mtt | | | | | | endif | | | | | endif | | | | end if | | | else | | | | afterList.add(cur.mtt) | | | | if(cur.caller != null) | | | | | insert all term in method with cur.id to cur.caller.mtt at its respective pos | | | | else break | | | | end if | | | end if | | end while | end if end procedure

Fig 7: Iterative method expansion algorithm

functionsplitJob(path : String, n : int) | Stack[] jobs = new Stack[n] | for eachfile finpath | | if(f is Java archive) | | | minIdx = get lowest sJobs index | | | jobs[minIdx].add(f) | |end if | end for end function

Fig 8: Greedy load-balance algorithm

b.

20

c.

schemes may be applied without reindexing. Index error correction may be simplified since it only focus on errorneous indexes and its respective documents. Many additional indexes can be embedded during retrieval phase without modifying preexisting indexes.

Fig 7: Iterative method expansion algorithm On

Index2

Index3 Retriever 3

IndexN

Index1

Retriever 2

Retriever N

Relatedness Calculator K. Multithread and Multiprocess Semantic RelatedFig 10: Multithread VSMmulindex reader Multithread LRVSM works quite similar with Since semantic relatedness between terms is ness Calculator pre-computed and takes a long time, this module is tithread VSM except LRVSMMultithread involvesLRVSM pre-calculated also redesigned by involving multithread and Multithread LRVSM works quite similar semantic relatedness. Since semantic relatedness beSince semantic relatedness terms is premultiprocess which can be seenbetween in Fig. 14. This with multithread VSM except LRVSM involves takes all indexes as its input, generates semantic relatedness. Since semantic tween terms is stored in binarypre-calculated file, index reader should computed module and takes a long time, this module is also distinct terms, and calculates semantic relatedness relatedness between terms is stored in binary file, onbyseparate processes. Separate processes is load that file and document retriever should involve it redesigned involving multithread and multiprocess index reader should load that file and document retriever should involves it on its retrieval mechanism. Multithread LRVSM index reader can be seen in Fig. 12. This module enlists all indexes and relatedness file in order to load it on memory. Following VSM index reader design, each file is considered as one job and all index terms are weighted using tf-idf scoring before stored in

21

implemented by executing many standalone executable files. Processes are chosen instead of threads because: a. Many third-party semantic relatedness libraries involve synchronized and static methods which may yield bottlenecks if implemented only in separate threads (e.g.

Index3 Retriever 3

Multithread and Multipro Relatedness Calculator Since semantic relat pre-computed and takes a l also redesigned by invo multiprocess which can be module takes all indexes distinct terms, and calculat on separate processes. implemented by execut executable files. Processes threads because: a. Many third-party libraries involve s methods which m implemented only

Relatedness Reader N

Relatedness Reader 3

Relatedness Reader 2

Relatedness Reader 1

Index Reader N

Multithread and Multiprocess Semantic

IndexN

Fig 12: Multithread LR

RelatednessN

Relatedness3

Relatedness2

Relatedness1

IndexN

Index3

Index Reader 3

Index Reader 2

Index Reader 1

J. Multithread LRVSM

Index Reader N

Retriever B

on its retrieval mechanism. Multithread LRVSM index reader can be seenIndexinandFig. 5. Enlistment This module enlists all Relatedness indexes and relatedness file in order to load it on memory. Following VSM index reader design, each ... ... file is considered as one job and all index terms are weighted usingTF-IDF tf-idf scoring before stored in memory. Scoring Multithread LRVSM document retriever can be seen in And Relatedness Store Fig. 6. Relatedness Index data are stored in array and shared among all retrievers. Retriever This Builder mechanism is thread-safe since only read action permitted onreader shared data. Fig 12: Multithread LRVSM index

Each index is listed and read separately in different jobs. After all indexes are read, their respective terms are weighted using tf-idf scoring and stored in separate retrievers. Multithread VSM document retriever can be seen in Fig. 4. Since each retriever is responsible for an index, the number of jobs is equivalent to the number of indexes. Each retriever retrieves its relevant documents based on query input and merges it to global result.

Retriever 2

Retriever 1

Index And Relate

Fig. 4. Multithread VSM Document Retriever. ...

...

TF-IDF Scoring

Result Merger

...

Index3

Index and Related

Fig 11: Multithread VSM document retriever

Index2

job and each job is handled by one retriever (N indexes = N jobs = N retriever). Multithread VSM index reader can be seen in Fig. 10. Each index are I. Multithread VSM listed and read separately in different jobs. After all indexes are read, respective terms On VSM retriever, multithreading is its conducted atare weighted using tf-idf scoring and stored in separate reading indexes and retrievingretrievers. documents. Both Multithread VSM tasks document retriever be seen in Fig. 11. Since each retriever is assume each index as one job can and each job is handled responsible for an index, the number of jobs is to the of indexes. Each retriever by one retriever (N indexes =equivalent N jobs = number N retriever). retrieves its relevant documents based on query Multithread VSM index reader can be seen in Fig. input and merges it to global result. 3.

. . .stored in binary file, relatedness between terms is index reader should load that file and document retriever should involves it on its retrieval mechanism. Multithread LRVSM index reader can be seen in Fig. 12.Query ThisProcessing module enlists all indexes and relatedness file in order to load it on memory. Following VSM index reader design, each file is considered as one job and. .all index terms are . weighted using tf-idf scoring before stored in

Index1

of many other retrieval models whereas is in-memory of all indexed files asLRVSM its input and build retrieval models. the most effective extended VSM found in previous research [2, 3]. Both retrievers takesVSM directory path Multithread On VSM retriever, Multithreading is which consists of all indexed files as its input build conducted at reading and indexes and retrieving in-memory retrieval models. documents. Both tasks assume each index as one

Retriever 1

be applied without Multithread indexer re-indexing. design can be seen in error correction mayandbedistributed simplified since Fig. • 9.Index All documents are listed based itononly greedy load-balance algorithm. Since and its focuses on errorneous indexes each job yields documents. an index based on its given respective documents, mechanism many • Many this additional indexeswill canresult be embedded durindexesing rather than one. These indexes need not to retrieval phase without modifying preexisting be merged as one since split indexes may yields indexes. faster index error correction with the help of distribution list. Retriever Distribution list is a comma H. Multithread separated values file that contain document list for thisThis research, two generated retrieval models are redereachInjob. file is also at indexing signed in multithread manner.model These retrieval model phase. are standard and low-rank vector space memory. Multithread LRVSM document retriever indexer design Any weighting schemes conducted in (LRVSM) model. Low-rankmodel. VSM be seen in Fig. 13. Relatedness data are stored are standard and low-rank vector space Low- is extended Fig 9:canMultithread Fig 10: VSMallindex reader This VSM that utilizes semantic relatedness in order to in array andMultithread shared among retrievers. previous sequential design are delayed to retriever rank VSM (LRVSM) is extended improve VSM its recall.that VSM utilizes is selected since this mechanism is thread-safe since only readReader. action Fig. 3. Multithread VSM Index Multithread Retriever Multithread part based onrelatedness these reasons isimprove a benchmarkitsof recall. many other retrieval permitted on LRVSM shared data. semantic in : ordermodel to Multithreadtwo LRVSM works quite similar modelsscalable whereas LRVSM is the most effectiveIn this research, retrieval models are a. is Storing raw indexes is more with multithread VSM extended in previous research [2, 3]. VSM selected since this modelVSM is found a than benchmark Indexexcept Store LRVSM involves redersigned in multithread manner. These retrieval processed one since various weighting Both retrievers takes directory path which consists pre-calculated semantic relatedness. Since semantic

...

Index Reader 3

Fig 8: Greedy load-balance algorithm

Index2

c.

Result M

Fig 11: Multithread VSM

Index1

b.

Index Reader 2

functionsplitJob(path : String, n : int) | Stack[] jobs = new Stack[n] | for eachfile finpath | | if(f is Java archive) | | | minIdx = get lowest sJobs index | | | jobs[minIdx].add(f) | |end if | end for end function

Query Pro

Index Reader 1

Algorithm 7 Greedy Load-Balance Algorithm.

Multithread VSM VSM retriever, Multithreading is conducted at reading indexes and retrieving documents. Both tasks assume each index as one schemes may be applied without rejob and each job is handled by one retriever (N indexing. indexes = N jobs = N retriever). Multithread VSM index readercorrection can be seen in may Fig. 10.be Each index are Index error simplified listed and read separately in different jobs. After all since it only focus on errorneous indexes indexes are read, its respective terms are weighted and using its respective documents. tf-idf scoring and stored in separate retrievers. Multithread VSMcan document retriever Many additional indexes be embedded can be seen in Fig. 11. Since each retriever is during retrieval phase without modifying responsible for an index, the number of jobs is equivalent indexes. to the number of indexes. Each retriever preexisting retrieves its relevant documents based on query input and merges it to global result.

Index2

Index1

| | | | | endif | | | | end if model are standard and low-rank vector space memory. Multithread LRV | | | else model. Low-rank VSM (LRVSM) is extended can be seen in Fig. 13. Rel | | | | afterList.add(cur.mtt) VSM that utilizes semantic relatedness in order to in array and shared amo | | | | if(cur.caller != null) improve its at recall. is selected since mechanism is thread-safe | | | | | insert all term in method with cur.id to cur.caller.mtt itsVSM respective pos this model is a benchmark of many other retrieval permitted on shared data. | | | | else break models whereas LRVSM is the most effective | |this | |article end ifas: O. Karnalim, “Improving Scalability of Java Archive Cite Search Engine Through Recursion extended VSM found in previous research [2, 3]. Index S | | | end if Both retrievers takes directory path which consists Conversion and Multithreading”, CommIT (Communication & Information Technology) Journal 10(1), 15–26, | | end while of all indexed files as its input and build in-memory | end if 2016. retrieval models. end procedure

Cite this article as: O. Karnalim, “Improving Scalability of Java Archive Search Engine Through Recursion Conversion and Multithreading”, CommIT (Communication & Information Technology) Journal 10(1), 15–26, 2016.

which can be seen in Fig. 7. This module takes all indexes as its input, generates distinct terms, and ank vector space calculates memory. Multithread LRVSM document retriever semantic relatedness on separate processes. SM) is extended Separate canprocesses be seen in Fig. 13. Relatedness data are stored many is implemented by executing edness in order to in array and shared among all retrievers. This executable files. since Processes areaction chosen inlected since this standalone mechanism is thread-safe only read ny other retrieval stead of permitted on because: shared data. threads Many third-party Index semantic relatedness libraries Store involve synchronized and static methods which .. may yield bottlenecks if. implemented only in separate threads (e.g., Ws4J: WordNet Similarity for Java [18]). Multithread design with bottlenecks Query Processing time than the naive may yield longer processing one (single-thread) since multithreading needs additional time to split and merge jobs. ... Standalone executable files may be built in many programming language other than Java as long as it follows its input and output template. IndexN

IndexN

Index3

Index2

Index1

...

IndexN

Index3

Index2

...

Reader N Index1

Reader 3

Reader 1

process i is assigned to job i and N-i where N represent the number of processes.

Distinct Term Generator

Calculator N

Relatedness Reader 2

...

Reader N

Reader 3

Relatedness Calculator N

Relatedness Calculator 3

RelatednessN

Reader 1

Relatedness Calculator 2

Relatedness Relatedness3 Calculator 1

RelatednessN

Relatedness3

RelatednessN

Relatedness2

Relatedness3

Relatedness1

Retriever 3

IndexN

Retriever N

Relatedness1

Relatedness2

Retriever 2

Retriever 3

IndexN

Index3 Retriever 2

Retriever 1

Index2

Index3

Index2

...

RelatednessN

Relatedness Calculator 2 Relatedness2

Relatedness Calculator 3

Relatedness Calculator 1 Relatedness1

Index Enlistment

Multithread and Multiprocess c. Semantic relatednessSemantic calculator may be ... Relatedness Calculator Index and Relatedness Store freely designed by its developer. Internal Since semantic relatedness between terms is savingand mechanism alsomodule be added as ... pre-computed takes a long may time, this is since calculating also redesigned by feature involving multithread and .additional .. multiprocess whichrelatedness can be seen. . in This time semantic takes a long . Fig. 14. module(calculating takes all indexes as its input, generates semantic relatedness of Distinct Term Generator distinct terms, and calculates semantic relatedness 40.978processes. distinct terms in previous on separate Separate processes research is using single thread takes five days). implemented by Merger executing manyabout standalone Relatedness executable files. Processes are chosen instead of Fig 14: Multithread semantic relatedness calculator ... threads because: and Relatedness Store QueryIndex Processing a. Many third-party semantic relatedness RESULTS AND DISCUSSION libraries involve synchronized and static Efficiency and effectiveness are two major methods which may yield bottlenecks if . . . only in separate threads (e.g. measurements which are commonly used to implemented Index1

Index1

ndex reader

Semantic relatedness may be naive (single-thread) since TF-IDF Scoring onecalculator freely designed by its developer. Internal multithreading needs additional time to saving mechanism may also beStore Indexmerge And Relatedness split and jobs. added as additional feature since calculating b. Standalone executable filestime may be built in semantic relatedness takes a long Retriever Builder many programming language (calculating semantic relatedness of other than Java asterms longin as it follows its input and Fig 12: Multithread LRVSM reader 40.978 distinct previous research Fig. 5. Multithread LRVSM index Index Reader. using single thread takes about five days). output template.

Reader 2

RelatednessN

Relatedness Reader N

Retriever N

Relatedness3

Relatedness Reader 2

Relatedness Reader 1

Relatedness2

Index3 Retriever 3

Index Reader N

Relatedness Reader 3

Index2

Retriever 2

IndexN

Relatedness1

Index1 Retriever 1

Index3

Index Reader 3

Index2

Index1

yield longer processing time than the

c.

rks quite similar LRVSM involves ss. Since semantic red in binary file, ile and document on its retrieval M index reader can enlists all indexes oad it on memory. esign, each file is index terms are before stored in

Semantic relatedness calculator may be freely designed by its developer. Internal saving mechanism may also be added as additional feature since calculating semantic relatedness takes a long time (calculating semantic relatedness of 40.978 distinct terms in previous research using single thread takes about five days).

Standalone executable file built for this module must follow input and output template. It takes six input arguments which are source file (CSV file which consists all distinct terms), lower and upper bound of first and second job, and target file. The program should take all distinct term from source file, calculate semantic relatedness between terms on given job, and store it to • target file in binary format of hash map. Target hash map uses double as its value and string as its key. The key represents concatenated term pair separated Result Merger by vertical bar (“|”) whereas its value represents term Fig 11: Multithread VSM document retriever pair relatedness. Since distinct terms are stored in array and each ... term i is only with terms with larger Ws4J: WordNet Similarity for Java [18]). array and the second one ispaired from the endremaining of an ... index than semantic relatedness Multithread design with bottlenecks may array. Distinct terms are i, split to 2*N jobs and each calculation at the end yield longer processing time than the process i isofassigned to job i be andfaster N-i where array should than Nthe beginning part. To naive one (single-thread) since represent the number of processes. distribute term calculation tasks evenly, each process Index and Relatedness Enlistment multithreading needs additional time to (executable file) is given two jobs. First one is from split and merge jobs. the beginning. . .of an array and the second one is from b. Standalone executable files may be built in Ws4J: WordNet Similarity for. . Java [18]). andanthe second one isterms fromare thesplit end toof2an . . . language . many programming other than thearray end of array. Distinct ×N designitswith may array. Distinct terms are split to 2*N jobs and each Java as Multithread long as it follows inputbottlenecks and jobs and each process i is assigned to job i and N −i Index Enlistment output template. Index Reader 1

Multithreading is s and retrieving ach index as one one retriever (N Multithread VSM 10. Each index are rent jobs. After all erms are weighted red in separate ocument retriever each retriever is umber of jobs is xes. Each retriever based on query lt.



Index Reader 2

he most effective us research [2, 3]. ath which consists d build in-memory



Retriever N

Retriever 1

Relatedness3

Relatedness2

Relatedness1

determine the feasibility of a research. Since this research focuses on module conversion in order to improve its scalability, efficiency is measured based on processing time and scalable design Result Merger whereas effectiveness is measured on converted Relatedness Merger modules correctness. Evaluation conducted in this Fig 13: Multithread LRVSM document retriever Fig 14: Multithread semantic relatedness calculator fromSemantic [1] asRelatedness a Fig. 6. Multithread LRVSM Document Retriever. research uses default Fig. 7. dataset Multithread Calculator. Query benchmark. For clarity in each table, blue mark Standalone executable fileProcessing built for this RESULTS AND DISCUSSION represents the best result whereas red mark module must follow input and output template. It represents the worst result for each factor. takes 6 input arguments which are source file (CSV Efficiency and effectiveness are two major file which consists all distinct terms), lower and measurements which are commonly used to Evaluating Recursion Conversion upper bound of first and second job, and target file. determine the feasibility of a research. Since this . . . from To evaluate recursion conversion, some The program should take all distinct term research focuses on module conversion in order to schemes22 are augmented which are shown in Table source file, calculate semantic relatedness between scalability, efficiency is measured 2. Each type improve consists ofitsthree symbols which terms on given job, and store it to target file in based on processing time and scalable design represent module implementation (I = Iteration and binary format of hash map. Target hash map use Result whereas effectiveness is measured on converted R = Recursion). Its symbol order is equivalent to double as its value and string as Merger its key. Key

Cite this article as: O. Karnalim, “Improving Scalability of Java Archive Search Engine Through Recursion Conversion and Multithreading”, CommIT (Communication & Information Technology) Journal 10(1), 15–26, 2016.

where N represent the number of processes.

Although its processing time is less efficient, it is more scalable than recursive approach. Stack overflow error rarely appears on iterative approach since the number of function call is greatly reduced. RRR scheme on default dataset generates a stack overflow error when run on JRE 1.8 whereas it runs well on JRE 1.7. This is caused by many update in JRE 1.8 which detects some processes in this program as endless recursion (although it is not). Iterative form is also a better approach than recursive form since the number of recursive calls conducted on dataset is uncertain. III scheme takes the longest indexing time since all modules are implemented iteratively. Recursion conversion correctness is measured by comparing the result of iterative and recursive approaches. Its correctness is proved when both implementation yield similar results on various dataset. In this evaluation, 11 dataset are tested where the first dataset is default dataset in Ref. [1] and the rest are sub dataset split from default dataset. Sub dataset are resulted by splitting default dataset to 10 parts evenly. Since these modules are indexer part, result comparison is conducted based on indexes generated by both implementation. As a result, both implementation yield similar results on all schemes.

III. RESULTS AND DISCUSSION Efficiency and effectiveness are two major measurements which are commonly used to determine the feasibility of a research. Since this research focuses on module conversion in order to improve its scalability, efficiency is measured based on processing time and scalable design whereas effectiveness is measured on converted modules correctness. Evaluation conducted in this research uses default dataset from Ref. [1] as a benchmark. For clarity in each table, blue mark represents the best result whereas red mark represents the worst result for each factor. A. Evaluating Recursion Conversion To evaluate recursion conversion, some schemes are augmented which are shown in Table II. Each type consists of three symbols which represent module implementation (I = Iteration and R = Recursion). Its symbol order is equivalent to module order shown in Table II columns. RRR and III schemes are the benchmarks of this evaluation since RRR represent recursive approach in all modules and III represent iterative approach in all modules. Type IRR, RIR, and RRI are used to measure conversion impact of certain module. Indexing time of each schemes on default dataset can be seen in Table III. Since processing time is operating system dependent, each scheme is measured five times using the same dataset and its average result is assigned as its result to reduce its dependency bias. As seen in Table III, each scheme that involves iterative implementation takes more processing time rather than RRR scheme. Following recursive logic in iterative approach may yield longer processing time since it requires many additional objects on its process (e.g. IterTuple in Iterative Tarjans SCC algortihm).

B. Evaluating Multithread Design Multithread indexer, VSM, EVSM, and semantic relatedness calculator are measured in term of efficiency and effectiveness. These evaluation are conducted in Windows 7 Ultimate 32-bit with 4 GB RAM, and Intel(IR) CoreTM i7-3770 CPU @ 3.40 GHz and 3.90 GHz as its processor. Each evaluated module excepts semantic relatedness calculator is tested by various number of jobs which are 1, N, N × 2, N × 5, N × 10, N × 20, N × 40, and 552. N is the number of physical cores −1, which is 3 in this evaluation environment whereas 552 is the number of Java archives indexed in our dataset. To reduce its dependency bias, each evaluation scheme is measured five times using the same dataset and its average result is assigned as its result.

TABLE II R ECURSION C ONVERSION E VALUATION S CHEMES .

Module

Type Loop Encapsulation RRR IRR RIR RRI III

Rec Iter Rec Rec Iter

Recursive Method Elimination Rec Rec Iter Rec Iter

TABLE III I NDEXING T IME OF R ECURSION C ONVERSION S CHEMES .

Method Expander Rec Rec Rec Iter Iter

23

Type

Indexing Time (s)

RRR IRR RIR RRI III

417.297 423.964 436.958 429.307 433.995

Cite this article as: O. Karnalim, “Improving Scalability of Java Archive Search Engine Through Recursion Conversion and Multithreading”, CommIT (Communication & Information Technology) Journal 10(1), 15–26, 2016.

C. Evaluating Multithread Indexer



The quantitative efficiency measurement of multithread indexer can be seen in Table IV. The best indexing time is gained at N = 3 since all physical cores are utilized (4 cores, 1 core is used for main thread and the rest are used for side threads). N = 1 takes the longest indexing time since it only utilizes 1 side thread (which is quite similar to sequential process). Index size is increased proportionally with the number of jobs since each job generates an index file and each index has its own header file. Since the contents of generated indexes for each scheme are equivalent with the contents of index generated in previous research, multithread design on this module is proved correct. Time reduction in multithread indexer also proves that multithreading may improve scalability since it enables this system to process larger dataset.

Average query latency is affected by the number of index partitioning. Retrieving a term from one big chunk of index takes longer time than retrieving it from many small chunks since large index may yield more complicated structure than the small one. Small indexes are also easier to be cached in memory. But too many small indexes may yield longer processing time since it needs to iterate through these indexes. As seen in Table VI, average query latency at N = 552 takes longer time than N = 120 although it has more indexes with smaller size.

Conversion correctness are proven by the fact that both retriever yields similar result with its respective sequential form for all queries from default dataset (1860 queries). Multithread VSM and LRVSM is also more scalable since shorter processing time may enable system to handle larger data.

D. Evaluating Multithread VSM and LRVSM Multithread VSM and LRVSM are measured with similar factors since both of them are retriever modules. Measured retriever’s efficiency factors are index load time and average query latency whereas effectiveness is only measured based on its correctness. The time efficiency measurement of multithread VSM and LRVSM can be seen in Tables V and VI. Since LRVSM evaluation is assumed to take relatedness file as a single file, both results yield similar conclusions which are: •

TABLE V T IME E FFICIENCY OF M ULTITHREAD VSM.

N 1 3 6 15 30 60 120 552

Index load time runs fastest at N = 3 since all physical cores are utilized. The number of jobs is proportional to index load time since job transfer in thread takes time. N = 1 takes longer time than N=3 since it only utilizes one thread instead of three.

TABLE IV T IME AND M EMORY E FFICIENCY OF M ULTITHREAD I NDEXING .

N 1 3 6 15 30 60 120 552

326.851 153.913 164.822 169.354 172.387 174.642 180.711 240.053

Indexing Time (s)

Ave. Query Latency (s)

1.749 0.831 0.926 0.961 1.118 1.310 1.572 2.558

0.462 0.099 0.045 0.024 0.011 0.007 0.007 0.024

TABLE VI T IME E FFICIENCY OF M ULTITHREAD LRVSM.

Efficiency Measurement Indexing Time (s)

Efficiency Measurement

N

Index Size (MB) 4,894 5,213 5,559 5,860 6,525 6,973 7,623 9,142

1 3 6 15 30 60 120 552

24

Efficiency Measurement Indexing Time (s)

Ave. Query Latency (s)

56.041 55.563 55.759 56.603 57.128 57.287 57.884 59.208

0.091 0.047 0.046 0.043 0.042 0.041 0.039 0.152

Cite this article as: O. Karnalim, “Improving Scalability of Java Archive Search Engine Through Recursion Conversion and Multithreading”, CommIT (Communication & Information Technology) Journal 10(1), 15–26, 2016.

Recursion conversion conducted in this research involves three main recursive modules which are loop encapsulation, recursive method elimination, and method expansion. Although these modules are inconvenient enough to be redesigned as iterative one, it still can be converted by following its recursive pattern with the help of caller reference. Recursive pattern in iterative implementation takes more time than its recursive form since these modules are logically recursive and require many additional objects during its execution. The correctness of recursion conversion described is also proved by the fact that both implementation yield similar result. Multithread design has been successfully implemented in Java archive search engine which involves indexer, VSM retriever, LRVSM retriever, and semantic relatedness calculator. This mechanism cuts off its respective processing time since it utilizes all physical cores. Index partitioning may yield faster retrieval model although too many small indexes may also yield longer processing time. All multithread modules are also proved correct by black box testing its results.

TABLE VII S EMANTIC R ELATEDNESS E VALUATION DATASET.

Java Archive

Distinct Terms

javassist-2.5.1.jar lucene-1.2.jar jxl-2.4.2.jar

Related Term Pairs 646 553 1.102

22.770 15.434 63.415

TABLE VIII P ROCESSING T IME OF S EMANTIC R ELATEDNESS C ALCULATOR ( S ).

Java Archive

Single Thread Design

javassist-2.5.1.jar lucene-1.2.jar jxl-2.4.2.jar

2.237,249 1.457,311 5.249,131

Multithread Design 1.355,221 828,732 3.372,989

E. Evaluating Multithread Semantic Relatedness Calculator Since semantic relatedness calculator requires standalone executable files for completing its task, a Javabased standalone executable file (JAR) is build for this research. This program follows algorithm template given by multithread semantic relatedness calculator. Since calculating semantic relatedness is time consuming, multithread design of this module is only tested to three Java archives instead of entire default dataset. These Java archives characteristics can be seen in Table VII. Processing time of multithread and single-thread design of semantic relatedness calculator can be seen in Table VIII. Each scheme is evaluated five times and its average result is considered as that scheme result. Evaluation is conducted by split distinct terms to three jobs and utilize Lin’s semantic relatedness algorithm described in Ref. [19]. As seen in Table VIII, multithread design is more scalable since it takes less time than single-thread by utilizing all physical cores. This design also yields the same result as single-thread design which proves its correctness.

R EFERENCES [1] O. Karnalim and R. Mandala, “Java archives search engine using byte code as information source,” in Data and Software Engineering (ICODSE), 2014 International Conference on. IEEE, 2014, pp. 1–6. [2] O. Karnalim, “Extended vector space model with semantic relatedness on java archive search engine,” Jurnal Teknik Informatika dan Sistem Informasi, vol. 1, no. 2, 2015. [3] W. B. Croft, D. Metzler, and T. Strohman, Search engines: Information retrieval in practice. Addison-Wesley Reading, 2010, vol. 283. [4] D. Grune, K. Van Reeuwijk, H. E. Bal, C. J. Jacobs, and K. Langendoen, Modern compiler design. Springer Science & Business Media, 2012. [5] T. Lindholm, F. Yellin, G. Bracha, and A. Buckley, The Java virtual machine specification. Pearson Education, 2014. [6] R. Tarjan, “Depth-first search and linear graph algorithms,” SIAM journal on computing, vol. 1, no. 2, pp. 146–160, 1972. [7] M. Carrano and T. Henry, Data Abstraction & Problem Solving with C++: Walls and Mirrors, 6th ed. Prentice Hall, 2012. [8] A. S. Tanenbaum, Modern Operating Systems, 4th ed. Prentice Hall, 2014.

IV. CONCLUSIONS Based on evaluation in this research, scalability of Java archive search engine can be improved through recursion conversion and multithreading. Recursion conversion improves scalability by avoiding stack overflow error whereas multithreading improves scalability by reducing its execution time (which enables system to process larger dataset).

25

Cite this article as: O. Karnalim, “Improving Scalability of Java Archive Search Engine Through Recursion Conversion and Multithreading”, CommIT (Communication & Information Technology) Journal 10(1), 15–26, 2016.

240. [14] J. Miecznikowski and L. Hendren, “Decompiling java using staged encapsulation,” in Reverse Engineering, 2001. Proceedings. Eighth Working Conference on. IEEE, 2001, pp. 368–374. [15] M. Naftalin and P. Wadler, Java generics and collections. ” O’Reilly Media, Inc.”, 2006. [16] C. Kustanto and I. Liem, “Automatic source code plagiarism detection,” in Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, 2009. SNPD’09. 10th ACIS International Conference on. IEEE, 2009, pp. 481–486. [17] C. D. Manning, P. Raghavan, H. Sch¨utze et al., Introduction to information retrieval. Cambridge university press Cambridge, 2008, vol. 1, no. 1. [18] H. Shima. (2015) Ws4j : Wordnet similarity for java. Accessed on November 24, 2015. [Online]. Available: https://code.google.com/p/ws4j/ [19] D. Lin, “An information-theoretic definition of similarity.” in ICML, vol. 98, 1998, pp. 296–304.

[9] W. Liu and T. Wang, “Index-based online text classification for sms spam filtering,” Journal of Computers, vol. 5, no. 6, pp. 844–851, 2010. [10] W. Premchaiswadi and A. Tungkatsathan, “Online content-based image retrieval system using joint querying and relevance feedback scheme,” WSEAS Transactions on Computers, vol. 9, no. 5, pp. 465–474, 2010. [11] C. Bonacic, C. Garcia, M. Marin, M. Prieto, F. Tirado, and C. Vicente, “Improving search engines performance on multithreading processors,” in High Performance Computing for Computational Science-VECPAR 2008. Springer, 2008, pp. 201–213. [12] C. Bonacic and M. Marin, “Simulation study of multi-threading in web search engine processors,” in String Processing and Information Retrieval. Springer, 2013, pp. 37–48. [13] V. Skylarov, I. Skilarova, and B. Pimentel, “Fpgabased implementation and comparison of recursive and iterative algorithms,” in Field Programmable Logic and Applications, 2005. International Conference on. IEEE, 2005, pp. 235–

26