Generating Test Cases of Desired Hardness

22 downloads 0 Views 553KB Size Report
Test case generation plays an important role in solving optimization problems. However, the current approaches of test case generation in the literature are ...
Generating Test Cases of Desired Hardness Xi Li, Andrew Lim and Zhou Xu Department of IELM, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, {xili,iealim,xuzhou}@ust.hk

Test case generation plays an important role in solving optimization problems. However, the current approaches of test case generation in the literature are inefficient at tuning the parameters in the generators, for generating desired instances. In this paper, we propose a new framework of test case generation based on the Design and Analysis of Engineering Experiment method, which can analyze the effects of parameters by conducting a small number of experiments. Under this framework, the test cases of desired hardness can be generated efficiently and accurately. We examine our new framework by studying the test case generation in the Multiple Unit Combinatorial Auction problem. Key words: optimization; benchmark generation; parameter tuning; design of experiment; combinatorial auctions

1.

Introduction

Optimization is one of the most studied topics in research. Many optimization problems are NP -hard. It is widely believed that for NP -hard problems, there is no efficient algorithm that can solve all the instances, unless the hypothesis P = NP holds. Therefore the evaluation of such algorithms is usually based on experimental results. In the experiments, several instances are generated and solved by the algorithms, and the evaluation is based on the quality of the solutions, the solving time, the working memory consumption, etc. This is widely applied to all kinds of optimization problems such as the Bin Packing Problem (BPP) [Burke et al., 2004, Hopper and Turton, 1999, Valenzuela and Wang, 2001], the Vehicle Routing Problem (VRP) [Cordeau et al., 1997, Lim and Wang, 2005b, Augerat et al., 1998], the Transshipment Problem (TP) [Li et al., 2004], the Satisfiability Problem (SAT) [Selman et al., 1995], the Machine Scheduling Problem (MSP) [Sivrikaya-Serifoglu and Ulusoy, 1999, Lim et al., accepted], the Graph Coloring Problem (GCP) [Lim and Wang, 2005a, Yanez and Ramirez, 2003], and the Winner Determination Problem (WDP) [Sandholm et al., 2005]. In those above examples, heuristics or branch & bound algorithms were proposed to solve the 1

corresponding problems, and then compared with the other algorithms based on the experimental result from running the test cases. A three-step flow of studying such problems is summarized as follows: • Step 1. Algorithm design and implementation:

When the problem is identi-

fied, an algorithm or method is proposed to solve it, whether exactly, non-exactly, or approximately. For example, to solve the WDP, the exact algorithms [Sandholm, 2002, Sandholm et al., 2005, Fujishima et al., 1999] and heuristics [Hoos and Boutilier, 2000, Guo et al., 2005, Sakurai et al., 2000, Harvey and Ginsberg, 1995, Lim and Tang, 2005] were proposed. • Step 2. Test case generation:

Test data is generated to examine the proposed

algorithm. For classical problems, usually there are many test data available, such as the real data, the standard benchmarks, or the instances generated by other researchers previously; while for newly-identified problems, researchers have to generate test data by introducing new generators, or collecting real data by themselves. For example, for the WDP, the legacy generator [Sandholm, 2002], CATS generator [Leyton-Brown et al., 2000a], and PBP generator [Tang, 2005] were introduced to generate the test instances. • Step 3. Performance evaluation:

To evaluate the performance of the proposed

algorithm, the most popular method is to run the same data obtained in Step 2 with other algorithms or commercial solvers such as CPLEX, then compare the experimental results, and draw conclusions about the performance of the new algorithms. For example, in solving the WDP, the algorithms are evaluated by comparing with previous heuristics or CPLEX [Hoos and Boutilier, 2000, Guo et al., 2005, Sakurai et al., 2000, Harvey and Ginsberg, 1995, Lim and Tang, 2005, Sandholm et al., 2005, Fujishima et al., 1999]. As shown in Figure 1, the three steps involve three element: the proposed solver, the test case, and the previous solvers to be compared with. The test case plays an important role in evaluating the performance of the proposed solver. To make the experimental result comparable, benchmarks are set up for each problem. The OR-Library is a collection of benchmarks for more than fifty classical optimization problems such as Bin Packing Problem, Location Problem, etc. Every optimization problem can have many different packages of test 2

Figure 1: Elements in optimization problem

cases. For example, there are at least nine packages of benchmarks available for Capacitated VRP [VRP-Web]. None of them has been studied for its hardness. However, knowing the hardness is very important: if the test cases are too easy so that every algorithm can easily solve it, the computational result cannot be used to compare the performance of the algorithms in terms of the solution quality. For example, for the WDP, the experimental result of running the test cases of scheduling distribution [Leyton-Brown et al., 2000a] is seldom used to compare the performance of the algorithms, because almost all algorithms can solve it optimally [Tang, 2005]. The hardness of test instances is often measured by the following three criteria: • Time criteria: the longer time it takes to solve a test case optimally, the harder the case is. This is usually used in evaluating the easy cases [Leyton-Brown et al., 2000a]. • Optimality criteria: those cases that can be solved optimally in an acceptable time are called easy cases, and the rest are hard cases. This is for differentiating the hard cases from the easy cases [Selman et al., 1996]. • Gap criteria: the gap between the current solution and the optimal solution can be used to evaluate the quality of the current solution, i.e. the higher the quality of solution, the easier the case is. This idea is widely used in approximation algorithms [Vazirani, 2001]. To have a quantitative measurement of the hardness, the time criteria and the gap criteria are commonly used . However, as mentioned before, for NP -hard problems, it is very difficult or even impossible to get the optimal solution in reasonable time. Therefore, to use the gap 3

criteria, some values to approximate the optimal solutions are needed. For example, in a minimization problem, the upper bound can be used to approximate the optimal solution, so the gap between the current best lower bound and the upper bound can be used to evaluate the quality of the solution. Based on the literature, different optimization problems work with different criteria to evaluate the quality of the solutions of the algorithms. A summary of the applicability of the hardness criteria for different optimization problems is shown in Table 1. The time criteria and optimality criteria can be used for any problem, however, in some problems, such as the VRP, it is very difficult to get the optimal solution in reasonable time [Augerat et al., 1998], so using the time or optimality criteria is not practical. The same situation happens in solving the TP [Li et al., 2004], MSP [Lim et al., 2005], and GCP [Lim and Wang, 2005a]. The gap criteria is applicable for problems that have feasible solutions of different qualities. For problems such as Satisfiability Problem, the gap criteria is not applicable. Table 1: Applicability of the hardness criteria Problem Time Optimality Gap Bin Packing ✓ ✓ ✓ Vehicle Routing ✓ Transshipment ✓ ✓ Machine Scheduling ✓ ✓ Graph Coloring ✓ ✓ Winner Determination ✓ ✓ ✓ Satisfiability ✓ ✓ ✗

2.

Review of the Drawbacks in Test Case Generation

As described in Section 1, the framework shown in Figure 1 has already been a well accepted evaluating methodology in optimization research. On the other hand, it is widely believed that some particular instances of NP -hard problems can be solved efficiently [Monasson et al., 1999]. Therefore, the conclusion on the performance drawn from this framework is not accurate if the hardness of the test cases is not well studied, e.g. one cannot draw the conclusion that two algorithms have equal performance if both algorithms solve the easy cases optimally. Consequently, the overlooking of the elements that affect the hardness can also mislead the conclusion. 4

As it is well known that test case generators often contain several parameters, which can affect the hardness of the instances generated. However, most researchers have not adopted any method in tuning those parameters. For the VRP, it is well know that the parameters in the generator can affect the characteristics and hardness of the instances generated [Cordeau et al., 1997], as the BPP, TP, MSP, and GCP. In the above problems, several combinations of parameters were set by best-guess, i.e. guessing the value of parameters based on experience [Cordeau et al., 1997, Burke et al., 2004, Li et al., 2004, Lim et al., accepted, Lim and Wang, 2005a]. However, the best-guess cannot guarantee the hardness of the test cases generated. Only a few researchers have tried to study how the parameters affect the hardness of test cases. In the Satisfiability Problem, the phase transition was discovered based on many experiments and experiences. Knowing the phase transition, parameter settings for generating hard cases were introduced [Selman et al., 1996, Achlioptas et al., 2000]. For the WDP, Leyton-Brown et al. [2005] did a study on the hardness of test cases. However, they tuned parameters manually without any systematic method, so that their experiment was very time consuming, i.e. it took three years of CPU time but covered only three combinations of the item number and bid number, without tuning on the distribution related parameters. Table 2 shows a brief survey on the current situation of test case generation in optimization problems. Table 2: Test case generation in several famous problems Parameters Tuning Problem effects Method CPU Time BPP Yes No N/A VRP Yes No N/A TP Yes No N/A MSP Yes No N/A GCP Yes No N/A WDP Yes Manually 3 years

Figure 2 illustrates the current commonly used method of test case generation. Firstly, the configuration for parameters is determined, then instances are generated under this setting. Only after the instances are solved by the solver can they know the hardness of the instances. If the hardness is not desirable, i.e. too easy or too hard, the parameters have to be adjusted manually. There are two main drawbacks in this process of data generation. • Low efficiency: The parameters may have to be adjusted again and again to avoid 5

Configuration

Test Case Generator

Test Case

Optimization Solver

Output Hardness

Figure 2: The legacy framework of test case generation undesirable instances. However, when there are a lot of parameters, e.g. five or more, it is impossible to set the parameters manually to properly consider the effects of all the parameters. Furthermore, the different parameters may interact with each other, therefore a satisfied configuration of the parameters is almost intractable. As shown above, in WDP [Leyton-Brown et al., 2000a], the manual adjusting took three years of CPU time but covered only three combinations of the item number and bid number. • Bias towards their own algorithms: As different parameters may have different effects on different algorithms, the parameters manually adjusted invoke a situation in which any one could declare that his or her algorithm is better than others. This is because one could set the parameters to certain values so that the instances generated are solved better by one algorithm than other algorithms. For example, for the WDP, the binomial distribution has three parameters, the item number, bid number, and density. In Tang [2005], two settings were tested, binomial 2 (item number 150, bid number 1500, and density 0.205) and binomial 4 (item number 30, bid number 3000, and density 0.228). The experiment running binomial 2 test cases shows that SAGII [Guo et al., 2005] is faster than LAHA [Tang, 2005] with the same solution quality, however, the experiment running binomial 4 test cases shows an opposite result that SAGII is slower than LAHA and gets worse solution quality than LAHA. Since it seldom happens that one algorithm dominates another, there is always the chance that some instances are solved better by one particular algorithm. This study aims to improve the test case generation, so that test cases can be generated with different hardness based on the effects of different parameters. To overcome the drawbacks, an efficient and systematic way of analyzing the effects of parameters is required. The effect of parameters is not an issue that exists only in studying optimization problems. It has been studied in the area of quality control for many years. The Design and Analysis of Engineering Experiment (DOE) methodology is a mature tool for studying the effects of different factors in a system. In this study, we will apply DOE method to discover the effect of the parameters. 6

The rest of the paper is organized as follows: in Section 3, we introduce our new framework of test case generation. To deploy the framework, some basic tools of DOE method are introduced. Section 4 demonstrates how to use DOE method by a particular problem, the Multiple unit Combinatorial Auction (MUCA) problem. We apply our new framework to this problem to generate test cases satisfying our goal. The DOE models are defined. In Section 5, we illustrate the effects of parameters by utilizing DOE method, and present the desired test cases as well. The paper is summarized in Section 6. All the experiments are conducted on an HP ProLiant DL360 server, which has dual Xeon CPU of 3.0GHZ and 2G RAM.

3. 3.1.

DOE Mechanism The New Framework of Test Case Generation

To achieve our goal of generating test cases of different hardness based on the parameters, we propose a new framework, as shown in Figure 3. The framework contains two parts, the actual running flow and the hardness teller. The actual running flow is the same as the old framework in Figure 2: given the configuration of the parameters, the test case is generated by the generator, then run by the solver, which then outputs the hardness of the test case. While the actual flow is running, by monitoring the configurations and the output hardness, a hardness teller is built. The configuration corresponding to desire hardness is directly provided by the hardness teller, without any actual running. This can save much effort in adjusting the parameters. Configuration

Test Case Generator

Hardness Teller

Hardness Predicted

Test Case

Optimization Solver

Output Hardness

DOE Mechanisms

Hardness Expected

Figure 3: The proposed completed framework for test case generation

7

To conduct the test case generation in our new proposed framework, some basic tools in DOE method are introduced in Section 3.2.

3.2.

Introduction to DOE method

DOE method is widely used in production systems in agriculture and industry. In such a system, there are input, output, and controllable and uncontrollable factors that can affect the output. The purpose of the activities is to obtain the desired output by configuring the controllable factors, as shown in Figure 4. DOE method can tell which factors are significant in this system, and how to control the output by configuring the significant factors. We adopt DOE method in our study of hardness, because the test case generation can be regarded as a production system: the hardness is the desired output, and the parameters are the factors that can affect the hardness. Our purpose is to control the hardness of the instances generated by configuring the parameters. Controllable factors X1

X2

Xp

... Inputs

Output

Production System

y

... Z1

Z2

Zq

Uncontrollable factors

Figure 4: General model of a production system

In Figure 4, the output y is determined by both controllable and uncontrollable factors: y = f (X, Z) Where X = (x1 , x2 , · · · , xp ) is the vector of controllable factors, and Z = (z1 , z2 , · · · , zq ) is the vector of uncontrollable factors. In DOE, the output y is called response. DOE method tells how to control or predict the response y approximately by configuring the controllable factors X appropriately: y ≈ f (X)

(1)

In the rest of this section, we are going to introduce some basic tools of DOE method, including the analysis of variance(ANOVA), two sample T-test, factorial design, and response surfaces methods(RSM). 8

3.2.1.

ANOVA and Two Sample T-test

To study whether the effect of a factor is significant, the commonly used technique is to set the factor to different levels, then use ANOVA or T-test to check whether the response values are different under different levels of the factor. If different levels of the factor lead to different means in the response values, the factor is regarded to have significant effect. ANOVA Given r sets of data, each set is called a treatment. Each treatment contains n samples, and the mean for each treatment is µi (i = 0, 1, · · · , r). To test whether the r treatments of samples are of equal mean values, Fisher developed a statistic tool named ANOVA. ANOVA tells which of the following two hypotheses is true: H0 : µ 1 = µ 2 = · · · = µ r H1 : µi 6= µj

for at least one pair (i, j)

where H0 indicates the treatments are of the same mean, so that the effect of the corresponding factor is insignificant, and H1 indicates the factor has significant effect. The idea of ANOVA is to compare the variance within each treatment with the variance between the treatments. Let yij be the jth sample in treatment i, y .. be the average of the samples over all the treatments, and y i. be the average of treatment i. Assume that yij follow the normal distribution with mean µ and variance σ 2 . The sum of squares due to treatments (i.e. between treatments) is SST reatments , and the sum of squares due to error (i.e. within treatments) is SSE , as shown in the following equations: SST reatments = n

r X

(y i. − y .. )2

i=1

SSE =

r X n X

(yij − y i. )2

i=1 j=1

The variance of all the data, σ 2 , is estimated by the mean square MSE = SSE /(N − r), and the variance between the treatments is estimated by MST reatments = SST reatments /(r−1). The quantity (N − r) is the number of degree of freedom of SSE , and the quantity (r − 1) is the number of degree of freedom of SST reatments . The number of degree of freedom of a sum of squares is equal to the number of independent elements in that sum of squares [Montgomery, P 2001]. For example, the right hand side of the equation SST reatments = n ri=1 (y i. − y .. )2 9

consists of the sum of squares of r elements y 1. − y .. , y 2. − y .. , · · ·, y r. − y .. . These elements are not independent, because they sum up to zero; actually r − 1 of them are independent, implying that SST reatments has r − 1 degree of freedom. If the hypothesis H0 is true, i.e. all the treatment means are equal, the ratio F0 =

SST reatments /(r − 1) MST reatments = SSE /(N − r) MSE

is distributed as F with degrees of freedom r − 1 and N − r. Therefore, H0 is rejected if F0 > Fα,r−1,N −r

(2)

where α is the probability of the error that the factor is regarded as significant but actually insignificant: α = P (reject H0 |H0 is true). Therefore, the smaller the α is, the less probability the error occurs that an insignificant factor is taken as significant. The commonly used value of α is 5% or 0.05. In this paper we use this value. The critical values of F distribution can be checked from F-table or calculated by statistic softwares such as MINITAB. F0 is called the F-value, and the P-value denotes the corresponding probability of the error that H0 is rejected when H0 is actually true. In another words, (2) holds iff. the P value is less than α. Finally, the conclusion has to be justified by normal probability plot of residuals to validate whether the normality assumption is satisfied [Chambers et al., 1983], i.e. the yij conform to the independent identical normal distribution. The residual of observation j in treatment i is defined as eij = yij − y i. . If the points on the normal probability plot form a linear pattern, i.e. the points are almost in a straight line, the normal distribution is a good assumption to approximate the samples. For example, in Figure 5(a), the residuals satisfy the normal assumption: the points are almost in a straight line. In Figure 5(b), the points do not fit well in a line, therefore the normal assumption is not a good approximation against the samples, and the models based on normal assumptions may not be suitable for analyzing the samples. By the plots of residuals versus fitted value, the stability of the variance of the samples can be checked. When there is no obvious pattern in the plot of residuals versus the fitted 10

value, it can be accepted that the variance of the samples is stable, as shown in Figure 5(c). If there is some pattern in the residuals versus the fitted, e.g. the variance gets larger and larger as the fitted value increases, as shown in Figure 5(d), where the normal assumption is not satisfied because the variance is not stable. To fit the model well, the BOX-COX transformation method [Box and Cox, 1964] is commonly used. In our study, we use the residual plots to verify the normality assumption of the models that we utilize in Section 5.

(a)

(b)

(c)

(d)

Figure 5: Normal probability plot of residuals

Two sample T-test Two sample T-test is a special case of ANOVA: it tests whether two groups of samples are of the same mean value. Two sample T-test is for testing the following two hypotheses: H0 : µ 1 = µ 2 H1 : µ1 6= µ2 11

where H0 indicates the two groups of samples have equal mean value, so that the effect of the corresponding factor is insignificant; while H1 indicates the factor has significant effect. As there are only two groups, the idea of T-test is very straight forward: to directly compare the means. The statistic of T-test is based on t-distribution, which was first developed by William Gosset [Pearson, 1990]. The T-value is defined as t0 =

y − y2 q1 Sp n11 + n12

with (n1 − 1)S12 + (n2 − 1)S22 Sp2 = , n1 + n2 − 2

Si2

=

Pn

j=1 (yij

− y i )2

(ni − 1)

where ni is the number of samples in group i, yij is the value of the jth sample in group i. The Si2 is the variance of group i, and the Sp2 is the estimated common variance of all the samples. If t0 > tα,n1 +n2 −2 , H0 is rejected. The critical value tα,n1 +n2 −2 can be checked from the tdistribution table or calculated by the statistic softwares such as MINITAB. The test statistic of two sample T-test is equivalent to ANOVA. However, since there are only two groups, by checking the critical values in t-distribution, a (1 − α) confidence interval can be developed, i.e. the y 1 − y 2 goes out of the confidence interval iff. t0 > tα,n1 +n2 −2 iff. P -value< α. In this paper, the experiments dealing with two groups are tested with two sample T-test, and those dealing with more groups are tested with ANOVA. 3.2.2.

Factorial Design and Response Surface Methods

Factorial designs are widely used in experiments involving several factors where it is necessary to study the joint effect of the factors on a response. It was first introduced by Fisher and Eden [1929]. A regression model like (1) can be addressed based on the factorial designs. However, such regression model drawn from factorial design is usually not accurate. The RSM is a necessary and efficient tool to improve the regression model. 2k Factorial design The most important case in factorial designs is that of k factors, each at only two levels, denoted by low and high, or − and +. The complete running of such an experiment requires 2 × 2 × 2 × · · · × 2 = 2k runs, so it is called 2k factorial design. To have enough degrees

12

Table 3: The 22 factorial design sample data A B AB replication I replication II − − + ··· ··· + − − ··· ··· − + − ··· ··· + + + ··· ···

of freedom, several runs are conducted for each combination of the factors, which is called replication. For example, in Table 3, there are two factors, A and B; the − and + value are assigned to column A and B to form a full combination of 22 = 4. The values for column AB are assigned with the product of the values in column A and column B: in row one, the product of column A(−) and column B(−) is +, so column AB gets the value + in row one. In the same way the values on other rows for column AB are assigned. T-test or ANOVA is conducted to see the effects of the factors and the interaction effects of multiple factors, i.e. AB( the interaction between A and B). For factor A, the samples in the rows where column A gets a − form a group, and those in rows where column A gets a + form another group. If T-test or ANOVA shows the two groups are of the same mean, the effect of factor A is insignificant, otherwise A has significant effect. In the same way the significance of effects of B, or the interaction of AB, can be checked. When all the significant factors and interactions are figured out, a regression model for the experiment can be built like (1): y = f (x1 , x2 ) = β0 + β1 x1 + β2 x2 + β12 x1 x2 + ǫ

(3)

Where x1 is the value of factor A, and x2 is the value of factor B. The ǫ denotes the error. In order to make the model more robust, the model only uses the factors with significant effects to estimate the response y, and the coefficients of the insignificant factors are set to zero. With the regression model, the response of factor values can be predicted. RSM The RSM is a collection of mathematical and statistical techniques that are useful for the modeling and analysis of problems in which a response of interest is influenced by several variables and the objective is to optimize this response. 13

A commonly used model in RSM is the second-order model. y = β0 +

k X

βi xi +

i=1

k X

βii x2i +

i=1

X

βij xi xj + ǫ

(4)

i 0 with probability d. If aij > 0, randomly generate its value through a uniform distribution of [1, v]; • si : randomly generate its value through a uniform distribution of [v, v + u]. This guarantees the feasibility of every single bid, because si > aij , ∀j. • pj : to generate the bid price pj , first produce an item price vi for each i ∈ M by uniP P formly sampling in a range of [0.9 j∈N aij , 1.1 j∈N aij ], and let pj deviates around P P the total value of items in bid j, by uniformly sampling in a range of [0.9 i∈M aij vi , 1.1 i∈M aij vi ]. Since the item price vi is proportional to the quantity requested, it is called PBP (Proportional Bid Price).

4.3.

Define the DOE Model

In order to study the effects of parameters of PBP generator, we have to formulate the DOE model, including the response and the factors. In the MUCA problem, we are interested in the hardness of test cases, so we define the response value as the hardness, and the factors are those parameters in PBP generator. 4.3.1.

Response

Based on the previous description, we adopt the gap criteria in Section 1 to evaluate the hardness of the test cases for the MUCA problem. Therefore, the response value is defined as the gap provided by CPLEX in time T . To be consistent with the experiments in the 17

literature, we adopt the average gap, over ten instances with the same configurations of factors, as the response value for our experiments. Let lbi and ubi denote the lower and the upper bound output by the CPLEX 9.0 for instance i. The response value rv can be represented as rv =

10 X ubi − lbi i=1

10ubi

× 100%.

We particularly choose the running time T = 1800 seconds, i.e. half an hour, in our experiments. The reason for using half an hour as a time limit is based on the anytime performance of CPLEX. The anytime performance profile describes how the quality of the output of an algorithm gradually increases as a function of the computation time. Generally, the gap and lower bound output by CPLEX stop improving after running for thirty minutes. An example of anytime performance profile of CPLEX in Figure 7 shows that when CPLEX solves a PBP instance with m = 100, n = 4000, d = 0.3, u = 40 and v = 10, the lower bound and the gap provided by CPLEX have little improvement after twelve minutes, and converge to stable values after thirty minutes.

Figure 7: Illustration of the anytime performance of CPLEX solver

4.3.2.

Basic Factors

We take the five parameters, m (the item number), n (the bid number), d (the density of bid matrix), u (the average of total quantity of each item), and v (the average of requesting 18

quantity for each item in a bid) as our basic factors. rv = f (m, n, d, u, v)

(8)

Our objective is to model the response of the gap with a function of the parameters m, n, d, u and v, so that we can generate the test instances of desired hardness by configuring the parameters, as shown in model (8) and Figure 8. Controllable Factors m

n

d

u

v

MUCA

Test Case

CPLEX GAP

..... (Randomness in Computing)

Uncontrollable Factors

Figure 8: Factors and response value

5. 5.1.

Tuning the Parameters in DOE Method Experiment Framework

In this section, we first identify the range of values for each factor, then preliminarily screen out the factors that have significant effects on the response value. After identifying the major factors, we apply a factorial design to examine their effects further and remove those insignificant factors. To make our model efficient and robust, we expect the model to contain only the factors of significant effect. The process will be conducted iteratively as shown in Figure 9. In each iteration, we aim to pick the appropriate model to reduce the factors with insignificant effects. The whole process is finished when a reasonable model is reached. With that model, we can generate instances of desired hardness accurately.

5.2.

Range of Factors

The ranges of factors are summarized in Table 5. The minimum and the maximum values of the number of items m are 50 and 500 respectively, while the minimum values and the maximum values of the number of bids n are 2000 and 8000. The density d varies from 0.1 to 0.3. The values of u and v, i.e., the expected units supplied for each item and the expected 19

Identify Range of Factors

Testing Effects of Major Factors (Factorial Design)

Preliminary Screening of Major Factors (ANOVA)

Remove unsignificant factors and iterate

Figure 9: Framework of experiments

units of an item demanded in each bid, vary respectively from 40 and 10 to 80 and 20. The values are not rigorous, while the aim of setting the ranges is to generate test cases in a wide enough range of hardness. Based on our experimental result, when all the parameters are set to the minimum values, the test cases get an average gap about 1.7%. While in the opposite extreme situation, when all the parameters are set to the maximum values, the test cases get an average gap about 125.9%. Therefore we can believe that the current ranges of factors provide enough width of the hardness, because it covers at least the interval of [1.7%, 125.9%], which is flexible enough. Table 5: Range of Factors Factor m n d u v Min 50 2000 0.1 40 10 Max 500 8000 0.3 80 20

5.3.

Preliminary Screening on m and n

The item number m and the bid number n are two factors that determine the size of the bid matrix A = {aij }m×n , so the first interesting target is to identify how important the two factors are. As shown in Figure 10, with fixed m = 100, d = 0.3, u = 40 and v = 10, when n is increased from 2000 to 16000, the increment of CPLEX gap is from 29% to 36%; when m = 500, n is increased from 2000 to 8000, the increment of CPLEX gap is from 93% to 126%. This observation indicates that the bid number n has limited effect on the response value. Actually this is consistent with the natural characteristic of the Set Packing Problem (SPP): for given m, the total number of bids, n, is no more than 2m , therefore n is trivial. To test such a hypothesis and screen the potential major factors, we conduct a preliminary 20

experiment only on the effects of the scale of instances, by choosing the item number m = 50, 100, 500 and the bid number n = 2000, 4000, 6000, 8000 but fixing u = 40 and v = 10. To have a wide picture of the effects of m and n, we conduct the experiment in two sessions, in each session we take different values of the density d, i.e. d = 0.1 and d = 0.3 separately. Therefore, for each session, a total of 4 × 3 = 12 configurations have been examined. The response values have then been summarized in Table 6.

(a) fixed m = 100, d = 0.3, u = 40, v = 10

(b) fixed m = 500, d = 0.3, u = 40, v = 10

Figure 10: CPLEX Gap varies with the bid number n

Table 6 shows that the response value varies little when the bid number changes, while the change in item number causes a large difference in response value. To validate the observation quantitatively, we conduct an ANOVA on the factors m and n. The F -value for factor m is P 4 3i=1 (y i. − y .. )2 /2 SSm /(3 − 1) = 145.08 > F0.05,2,11 = 3.98. Fm = = P3 P4 2 /11 SSE /(12 − 1) (y − y ) ij .. i=1 j=1 Since Fm > F0.05,2,11 , the hypothesis H0 (all the means of response value under different m are equal) is rejected. The factor m has significant effect on the response. Similarly, the F -values for factor n and the interaction between m and n are SSn /(4 − 1) = 0.41 < F0.05,3,11 = 3.59, SSE /(12 − 1) SSm,n /((3 − 1)(4 − 1)) = = 0.41 < F0.05,6,11 = 3.09. SSE /(12 − 1) Fn =

Fm,n

According to the F -values, both the factor n and the interaction between m and n do not have significant effect on the response value. 21

Table 6: Summary of response values in preliminary experiments response value(%) m n d = 0.1 d = 0.3 50 2000 1.70 13.91 50 4000 1.00 13.52 50 6000 0.78 13.42 50 8000 0.53 13.33 100 2000 11.65 28.87 100 4000 9.99 31.96 100 6000 8.88 33.19 100 8000 8.29 34.01 500 2000 63.81 92.90 500 4000 68.19 109.81 500 6000 69.84 113.89 500 8000 69.27 125.91

As presented in Section 3, to justify the above observation, we need to verify the assumption of ANOVA that the residuals are independent identical normal distributed. We present the normal plots and plots of residuals against fitted values in Figure 11.

(a) Normal probability plot

(b) Residuals versus the fitted values

Figure 11: Residual plots for ANOVA

The plots imply that residuals conform to an independent identical normal distribution: the points in Figure 11(a) are almost in a straight line, and the points in Figure 11(b) do not show an obvious pattern. Therefore, we can conclude that the bid number n does not have significant effect on the response value, and can be ignored in further experiments.

22

5.4.

Reduce Insignificant Factors by Factorial Design

According to the preliminary screening, the effect of the number of bids, n, can be ignored. We thus fix n = 4000 and conduct an experiment to focus on the effects of m, d, u, v. Table 7 illustrates our testing schema, where we have two levels for each factor: − denotes the low level and + denotes the high level. The item number m has the low level 100 and high level 500; the density d has the low level 0.1 and high level 0.3; and the low and high levels for u and v are 40 and 80, 10 and 20 respectively. Therefore, a 24 factorial design, Dm,d,u,v is conducted. To obtain enough degree of freedom, two replicates are conducted. Table 7: Testing schema of m, Level m d u − 100 0.1 40 + 500 0.3 80

d, u, v v 10 20

Table 8: The T -values and P -values of design Dm,d,u,v Factor T -value P -value m 52.27 < 0.001 d 24.12 < 0.001 u −23.16 < 0.001 v 22.77 < 0.001 m·d 7.70 < 0.001 m·u −15.49 < 0.001 m·v 14.84 < 0.001 d·u −6.39 < 0.001 d·v 6.43 < 0.001 u·v −4.41 < 0.001

To evaluate the effect of the factors, the T -value and P -value for each factor and interactions between factors are presented, as shown in Table 8. All the P -values are less than the α = 0.05, and the residual plots in Figure 12 show the normality assumption is satisfied. Now we can conclude that all the factors and interactions in the design Dm,d,u,v are significant. However, the model can be further simplified based on the following observations. Figure 13(a) shows the response values with different combinations of u and v. In the experiment, the item number m is fixed to 100; the bid number n is fixed to 2000; the density 23

(a) Normal probability plot

(b) Residuals versus the fitted values

Figure 12: Residual plots for design Dm,d,u,v

d varies from 0.1 to 0.5. The points form into three clusters. For convenience we define r as the ratio of u to v. Those points of the same ratio r are very close to each other and separated from points of other r values. Figure 13(b) shows the same pattern, where the item number m and bid number n are 500 and 4000 separately.

(a) Fixed m = 100, n = 2000

(b) Fixed m = 500, n = 4000

Figure 13: The pattern of CPLEX gap against u and v

To validate the observation above, we conduct the design Dm,d,u,r by replacing v with r in Dm,d,u,v . Since the factors u, v, and r are dependent, any two of them provide the same information, so our new design Dm,d,u,r will not lose any information compared with Dm,d,u,v . The testing schema is shown in Table 9, which is also a 24 factorial design with two replicates, the low and high level for r is 2 and 4, while the other three factors, m, n and u, have the same low and high levels as that in design Dm,d,u,v . We examine the main effects and all the high-order interactions. The statistics results 24

are summarized in Table 10, which shows that almost all the interactions related with u are insignificant, having P -values greater than 0.05. Table 9: Testing schema of m, Level m d u − 100 0.1 40 + 500 0.3 80

Table 10: The T -values Factor m d u r m·d m ·u m·r d ·u d·r u ·r m ·d ·u m·d·r m ·u ·r d ·u ·r

d, u, r r 2 4

and P -values of design Dm,d,u,r T -value P -value 206.98 < 0.001 93.19 < 0.001 0.92 0.373 −85.97 < 0.001 27.22 < 0.001 −2.11 0.050 −56.45 < 0.001 −0.96 0.351 −21.75 < 0.001 −2.11 0.050 −2.10 0.051 −4.70 < 0.001 0.10 0.920 1.11 0.283

Since Dm,d,u,r leads fewer significant factors than Dm,d,u,v does, we adopt Dm,d,u,r and remove the insignificant factor u, so as to focus our examination of effects on factors m, d, and r only. Therefore the 24 factorial design with two replicates becomes a 23 factorial design with four replicates. Accordingly, the statistics results of the new experiment Dm,d,r is shown in Table 11. The P -values are smaller than 0.05, confirming the significance of the three factors. Finally, Figure 14(a) shows the residual is normally distributed, and Figure 14(b) confirms the independency towards fitted values. Therefore, the assumption of the residuals to be independent identical normally distributed is satisfied. Based on Dm,d,r , we can obtain a linear regression model to estimate the response value, as shown in (9). y = β0 + β1 x1 + β2 x2 + β3 x3 + β1,2 x1 x2 + β1,3 x1 x3 + β2,3 x2 x3 + β1,2,3 x1 x2 x3 + ǫ 25

(9)

Table 11: The T -values and P -values of design Dm,d,r Factor T -value P -value m 163.59 < 0.001 d 73.65 < 0.001 r −67.95 < 0.001 m·d 21.52 < 0.001 m·u −44.61 < 0.001 m·v −17.19 < 0.001 m·d·r −3.71 0.001

(a) Normal probability plot

(b) Residuals versus the fitted values

Figure 14: Residual plots for design Dm,d,r

where x1 , x2 and x3 are the values of factors m, d, and r. Fitting the coefficients, the linear regression model for the response value rv can be presented as follows: rv = (−37.5 + 0.327m + 215d − 6.27r +0.461md − 0.0525mr − 32.81dr −0.0524mdr + ǫ)%.

5.5.

(10)

Examine the Performance of the Models

To understand whether (10) well illustrates the truth of the relationship between the response value and major factors or not, we further attempt to examine the accuracy of the predicted values from (10) compared with the experimental result. Three points that have not been observed in the previous experiments are selected, as described in Table 12. Two observations are collected on each point and their response values are shown in Table 13. 26

In Table 13, to compare with the experimental results, we also report fitted values and 95% confidence intervals predicted by the regression model (10) for each configuration. The comparison analysis reveals that the regression model (10) does not predict the response values accurately enough, since no actual response value falls into its 95% confidence interval. Table Observation 1, 2 3, 4 5, 6

12: Points r v 2.4 25 4.67 15 2.5 20

for u 60 70 50

prediction d m 0.13 200 0.22 370 0.15 600

n 2000 5500 3500

Table 13: Comparison of response values and predicted values Observation rv (%) Fit (%) 95%CI 1 54.82 44.302 ( 43.377, 45.227) 2 54.76 44.302 ( 43.377, 45.227) 3 70.48 53.51 ( 52.288, 54.732) 4 67.46 53.51 ( 52.288, 54.732) 5 139.40 145.596 ( 144.286, 146.905) 6 138.04 145.596 ( 144.286, 146.905)

To improve the accuracy of the regression model, the second-order model in RSM is adopted by introducing the quadratic terms. We need to add center points into the original data set by CCD, which has been presented in Section 3. The testing schema is shown in Table 14. One cube center point and six face center points are added for response surface analysis. However, the point (−36, 0.200, 3.0) suggested by CCD is not applicable, because a negative item number is meaningless in the MUCA problem. We adjust this point to (50, 0.200, 3.0). In Table 15, all the P -values are smaller than 0.05, therefore it can be concluded that the effects of the pure quadratic terms are significant. Figure 15 shows the normality assumption is also satisfied. The following second-order regression model can thus be drawn. rv = (−6.592 + 0.428m + 493.830d − 37.126r −0.000122m2 − 555.281d2 + 7.464r 2 +0.327md − 0.063mr − 50.969dr + ǫ)%

27

(11)

Table 14: The center points based on CCD Point Type m d r Replicates Cube Center 300 0.200 3.0 6 Face Center 50 0.200 3.0 2 Face Center 636 0.200 3.0 2 Face Center 300 0.030 3.0 2 Face Center 300 0.368 3.0 2 Face Center 300 0.200 1.3 2 Face Center 300 0.200 4.6 2

Table 15: The T -values and P -values of design Dm,d,r with pure quadratic terms Factor T -value P -value m 16.196 < 0.001 d 9.565 < 0.001 r -5.989 < 0.001 m·m -4.508 < 0.001 d·d -6.234 < 0.001 r·r 8.064 < 0.001 m·d 6.005 < 0.001 m·r -11.581 < 0.001 d·r -4.681 < 0.001

Now we can compare the actual response values of the six points shown in Table 12 with values predicted by the new regression model (11). The results are shown in Table 16. The accuracy of prediction by second-order model (11) has improved a lot, since for four out of six points, the actual response values fall into the confidence interval predicted by model (11). The probable reason for failure of prediction for point 1 and point 2 is mainly as follows. Because we have to modify the negative m in the points suggested by the CCD, therefore the prediction of the quadratic regression model is less accurate for those instances with a small value of m, such as point 1 and point 2 shown in Table 16. To make the model more accurate, more center points with small values of m should be needed. However, since the generation of easy cases is out of our interest, the inaccuracy for cases with small m is acceptable in our study.

28

(a) Normal probability plot

(b) Residuals versus the fitted values

Figure 15: Residual plots for design Dm,d,r

Table 16: Comparison of response values and predicted values Point rv (%) Fit (%) 95% CI 1 54.82 45.189 (41.955, 48.424) 2 54.76 45.189 (41.955, 48.424) 3 70.48 71.646 (66.231, 77.061) 4 67.46 71.646 (66.231, 77.061) 5 139.41 137.725 (132.779, 142.671) 6 138.04 137.725 (132.779, 142.671)

5.6.

Generate Test Instances with Expected Gaps

Since the second-order regression model (11) has been examined to have accurate performance, the model can be used to generate test cases with desired hardness for the MUCA problem. We generate sixteen points with target response value trv from 50% to 200%. The values of insignificant factors n and u are fixed at 4000 and 40 respectively. Since there are three factors in the model (11), one of them has to be fixed in order to use the contour plot from RSM to find the points with given response value. For example, in Figure 16, we fix m = 300, then get the point with target value trv = 100% at d = 0.300 and r = u/v = 2.69, and the point with trv = 110% at d = 0.171 and r = u/v = 1.62. Table 17 and Figure 17 summarize the performance of the test case generated by the contour plots. Every number in brackets indicates the fixed value of corresponding factors for the target point. From column |ǫ| in Table 17 and Figure 17 we can see the deviation from the actual response value rv to the target response value trv is fairly small. In most 29

Figure 16: Use contour plots to obtain d and u/v regarding to expected gaps

cases (10 out of 16), the deviation is less than 5%, while the maximum deviation is 8.9%. In a word, the contour plot we obtained from RSM is able to generate test cases with desired hardness accurately.

Figure 17: Deviation of generated response value from the target response value

Now we have obtained the parameters for generating test cases with desired hardness, from gap 50% to 200%. As a comparison, we generate one hundred test cases by simply picking the parameters from uniform distribution within the ranges defined in Table 5.We plot the histogram of the response values, the gaps provided by CPLEX, of the instances. As shown in Figure 18, about eighty percent of the instances have gaps smaller than 65%; the instances with gaps greater than 100% are quite rare, i.e. only two out of one hundred have 30

Table 17: Point 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of performance for instances generated m d v trv rv |ǫ|% (250) 0.085 19 50% 52.9% 2.9% (250) 0.132 17 60% 68.2% 8.2% (250) 0.182 16 70% 76.1% 6.1% 236 (0.250) 16 80% 80.9% 0.9% 330 (0.250) 13 90% 87.7% 2.3% (300) 0.300 15 100% 96.8% 3.2% (300) 0.170 25 110% 118.9% 8.9% 474 (0.250) 13 120% 115.3% 4.7% 390 0.200 (20) 130% 125.1% 4.9% 528 (0.250) 14 140% 131.7% 8.3% 400 0.270 (20) 150% 143.8% 6.2% 435 0.199 (25) 160% 156.2% 3.8% 507 0.230 (20) 170% 168.8% 1.2% 528 0.183 (25) 180% 182.7% 2.7% 523 0.213 (25) 190% 184.7% 5.3% 560 (0.200) 26 200% 203.5% 3.5%

gaps greater than 100%. Therefore, without properly tuning the parameters, it is difficult to generate test cases with desired hardness. Moreover, DOE method can generate hard cases even for small scales of instance. Among the randomly generated cases, the largest gap is only 130.7%. However, DOE method can produce test cases with gap 200% accurately.

5.7.

Summary of Efficiency

To illustrate the efficiency of DOE method, we summarize the CPU time for each step in Table 18 according to the experiment framework in Figure 9. The preliminary screening using ANOVA takes 120 hours running 240 instances, each of 30 minutes as described in Section 4.3. The first iteration of factor reducing, the design Dm,d,u,v , takes 160 hours running 320 instances. The second iteration of factor reducing, involving the factors m, n, u, and r, can utilize most instances in the design Dm,d,u,v , therefore the design Dm,d,u,r and Dm,d,r takes only 40 more hours running 80 additional instances that do not appear in the design Dm,d,u,v . To improve the prediction performance, the RMS needs to run another 180 instances taking 90 hours. Finally, another 80 hours are needed for running the 160 instances, verifying whether the instances generated provide the desired hardness. The total running time for the whole experiment is only 490 hours, no more than three weeks of CPU time. Compared

31

18

16

14

Counts

12

10

8

6

4

2

0

0

20

40

60

80

100

120

140

Gap(%)

Figure 18: Histogram of hardness of randomly generated test cases

with the work of Leyton-Brown et al. [2005], in which three years of CPU time was taken testing the hardness of test cases with only three combinations of m and n, our framework based DOE method is much more efficient. Table 18: Summary of CPU running time Experiment steps Preliminary screening (ANOVA) Reducing 1 (Dm,d,u,v ) Reducing 2 (Dm,d,u,r and Dm,d,r ) Performance improvement (RMS) Generate and verify desired hardness Total

6.

#instance 240 320 80 180 160 980

CPU time(h) 120 160 40 90 80 490

Conclusions

In this paper we propose a new framework on how to generate and analyze test cases for optimization problems. With the help of DOE method, the test cases with desired empirical hardness are generated systematically, and the hardness of instances are successfully predicted by our model. To demonstrate our methodology, we take the test case generator, PBP, for the MUCA problem, as an example. We iteratively examine the five parameters 32

in the PBP generator. For reducing the insignificant factors, we utilize ANOVA, factorial design; and for improving the prediction of the regression model, we adopt the RSM. In the end our experiment result shows the second-order regression model well satisfies the needs of test case generation. The experiments are very efficient, costing no more than 3 weeks of CPU time. For optimization problems such as combinatorial auctions, the real test data, as well as instances produced by generators, can be examined to see whether they fall into an area that is easy or difficult to solve, and what factors or parameters affect its hardness significantly. This helps researchers to judge whether the algorithms they proposed are efficient enough to solve problems in real situations. Furthermore, our method is general and helps to enhance the testing methodology. It provides a more scientific way to conduct empirical experiment. Any generator that deals with several parameters can be analyzed with this method, and the solver can be any algorithm or software package other than CPLEX, as long as it offers some output, such as gap, running time, etc., which can be used to evaluate the hardness. To our knowledge, this is the first study to apply DOE method in analyzing the test case generation in optimization problems. Actually, DOE method is well known in quality control. There are softwares such as MINITAB and Design-Expert which can be used to do the calculations in ANOVA, factorial design, etc. In this paper, we have only utilized some basic tools in DOE, such as ANOVA, factorial design, and RSM. There are many advanced methods in DOE which can be used to improve our methodologies of studying optimization problems. In our case, the price schema in PBP can also be considered as a factor affecting the hardness of the test cases, as well as the parameters in CPLEX, such as the branch direction, clique adding strategy, etc. Some of the factors have more than two levels, i.e. there are three clique adding strategies provided in CPLEX 9.0. To handle such parameters, we need advanced techniques in DOE, such as mixed-level fractional factorial design [Montgomery, 2001]. As we know, the random number generated by computer programme can also affect the hardness of the test cases, or the solution quality by random algorithms such as simulated annealing algorithm, genetic algorithm, etc. Although the random number is usually uncontrollable, we can avoid undesirable cases by setting controllable parameters. A robust design [Taguchi, 1986, Taguchi and Konishi, 1987, Bisgaard and Steinberg, 1997] can help us in this situation.

33

References D. Achlioptas, C. Gomes, H. Kautz, and B. Selman. Generating satisfiable problem instances. In AAAI: 17th National Conference on Artificial Intelligence. AAAI / MIT Press, 2000. P. Augerat, J. Belenguer, E. Benavent, A. Corberan, and D. Naddef. Separating capacity constraints in the cvrp using tabu search. European Journal of Operational Research, 106: 546–557, 1998. S. Bisgaard and D. M. Steinberg. The design and analysis of 2k-p prototype experiments. Technometrics, 39(1):52–62, 1997. G. Box and K. Wilson. On the experimental attainment of optimum conditions. Journal of the Royal Statistical Society, 13:1–45, 1951. G. E. P. Box and D. R. Cox. An analysis of transformations. Journal of the Royal Statistical Society, 13:211–243, discussion 244–252, 1964. G. E. P. Box and N. R. Draper. Empirical Model Building and Response Surfaces. John Wiley & Sons, 1987. E. Burke, G. Kendall, and G. Whitwell. A new placement heuristic for the orthogonal stock-cutting problem. Operations Research, 52(4):655–671, 2004. J. Chambers, W. Cleveland, B. Kleiner, and P. Tukey. Graphical Methods for Data Analysis. Wadsworth, 1983. J. Cordeau, M. Gendreau, and G. Laporte. A tabu search heuristic for periodic and multidepot vehicle routing problems. Networks: An International Journal, 30, 1997. Design-Expert. http://www.statease.com/. R. Fisher. The Design of Experiments. Oliver and Boyd, 1935. R. Fisher. Statistical Methods for Research Workers. Oliver and Boyd, 1958. R. Fisher and T. Eden. Studies in crop variation. vi. experiments on the response of the potato to potash and nitrogen. Journal of Agricultural Science, 19:201–213, 1929.

34

Y. Fujishima, K. Leyton-Brown, and Y. Shoham. Taming the computational complexity of combinatorial auctions: optimal and approximate approaches. In Proceedings of the Sixteen International Joint Conference on Artificial Intelligence, pages 548–553. Stockholm, Sweden, 1999. 1999. Y. Guo, A. Lim, B. Rodrigues, and Y. Zhu. A non-exact approach and experiment studies on the combinatorial auction problem. In HICSS, 2005. W. D. Harvey and M. L. Ginsberg. Limited discrepancy search. In IJCAI (1), pages 607–615, 1995. H. H. Hoos and C. Boutilier. Solving combinatorial auctions using stochastic local search. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 22–29. AAAI Press / The MIT Press, 2000. ISBN 0-262-51112-6. E. Hopper and B. Turton. A genetic algorithm for a 2d industrial packing problem. Computers and Industrial Engineering, 37(1-2):375–378, 1999. R. M. Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations. Plenum press, 1972. K. Leyton-Brown, M. Pearson, and Y. Shoham. Towards a universal test suite for combinatorial auction algorithms. In ACM Conference on Electronic Commerce, pages 66–76, 2000a. K. Leyton-Brown, Y. Shoham, and M. Tennenholtz. An algorithm for multi-unit combinatorial auctions. In the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, 2000b. K. Leyton-Brown, E. Nudelman, and Y. Shoham. Empirical hardness models for combinatorial auctions. In Combinatorial Auction. MIT press, 2005. Y. Li, A. Lim, and B. Rodrigues. Crossdockingjit scheduling with time windows. Journal of the Operational Research Society, 55:1342–1351, 2004. A. Lim and J. Tang. A lagrangian heuristic for winner determination problem in combinatorial auctions. In Australian Conference on Artificial Intelligence, pages 736–745, 2005. 35

A. Lim and F. Wang. Robust graph coloring for uncertain supply chain management. In HICSS. IEEE Computer Society, 2005a. ISBN 0-7695-2268-8. A. Lim and F. Wang. Multi depot vehicle routing problem one stage approach. IEEE Transactions on Automation Science and Engineering, 2(4), 2005b. A. Lim, B. Rodrigues, and Y. Zhu. Airport gate scheduling with time windows. Artificial Intelligence Review, 24(1):5–31, 2005. A. Lim, B. Rodrigues, and X. Zhang. Scheduling sports competitions at multiple venuesrevisited. European Journal of Operational Research, accepted. MINITAB. http://www.minitab.com/. R. Monasson, R. Zecchina, S. Kirkpatrick, B. Selman, and L. Troyansky. Determining computational complexity from characteristic ‘phase transitions’. Nature, 400, 1999. D. C. Montgomery. Design and Analysis of Experiments. Wiley, 5th edition, 2001. OR-Library. http://people.brunel.ac.uk/˜mastjjb/jeb/info.html. E. S. Pearson. Student, A Statistical Biography of William Sealy Gosset. Oxford University Press, 1990. Y. Sakurai, M. Yokoo, and K. Kamei. An efficient approximate algorithm for winner determination in combinatorial auctions. In CECOMM: ACM Conference on Electronic Commerce, 2000. T. Sandholm. Algorithm for optimal winner determination in combinatorial auctions. Artificial Intelligence, 135(1-2):1–54, 2002. T. Sandholm, S. Suri, A. Gilpin, and D. Levine. CABOB: A fast optimal algorithm for combinatorial auctions. In IJCAI, pages 1102–1108, 2001a. T. Sandholm, S. Suri, A. Gilpin, and D. Levine. Winner determination in combinatorial auction generalizations, 2001b. T. Sandholm, S. Suri, A. Gilpin, and D. Levine. CABOB: A fast optimal algorithm for winner determination in combinatorial auctions. Management Science, 51(3):374–390, 2005.

36

B. Selman, H. Kautz, and B. Cohen. Local search strategies for satisfiability testing. In AAAI-92: Proceedings 10th National Conference on AI, DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society, 1995. B. Selman, D. Mitchell, and H. J. Levesque. Generating hard satisfiability problems. AIJ: Artificial Intelligence, 81, 1996. F. Sivrikaya-Serifoglu and G. Ulusoy. Parallel machine scheduling with earliness and tardiness penalties. Computers & Operations Research, 26(8):773–787, 1999. G. Taguchi. Introduction to Quality Engineering. Asian Productivity Organisation, 1986. G. Taguchi and S. Konishi. Orthogonal Arrays and Linear Graphs. ASI press, 1987. J. Tang. A Lagrangian Heuristic for Winner Determination Problem in Combinatorial Auctions. PhD thesis, Hong Kong University of Science and Technology, 2005. C. L. Valenzuela and P. Y. Wang. Heuristics for large strip packing problems with guillotine patterns: an empirical study. In Proceedings of the 4th Metaheuristics International Conference, pages 417–421, 2001. V. V. Vazirani. Approximation algorithms. Springer, 2001. VRP-Web. http://neo.lcc.uma.es/radi-aeb/WebVRP/. J. Yanez and J. Ramirez. The robust coloring problem. European Journal of Operational Research, 148:546–558, 2003.

37