Logarithmic Regret Bounds for Bandits with Knapsacks

3 downloads 65 Views 363KB Size Report
Feb 22, 2016 - DS] 22 Feb 2016. Logarithmic Regret Bounds for Bandits with Knapsacks. Arthur Flajolet. [email protected]. Operations Research Center ...
Logarithmic Regret Bounds for Bandits with Knapsacks Arthur Flajolet

FLAJOLET @ MIT. EDU

Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139

arXiv:1510.01800v2 [cs.DS] 22 Feb 2016

Patrick Jaillet

JAILLET @ MIT. EDU

Department of Electrical Engineering and Computer Science, Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139

Abstract Optimal regret bounds for Multi-Armed Bandit problems are now well documented. They can be classified into two categories based on the growth rate with respect to the time horizon T : (i) small, distribution-dependent, bounds √ of order of magnitude ln(T ) and (ii) robust, distributionfree, bounds of order of magnitude T . The Bandits with Knapsacks model, an extension to the framework allowing to model resource consumption, lacks this clear-cut distinction. While several algorithms have been shown to achieve asymptotically optimal distribution-free bounds on regret, there has been little progress toward the development of small distribution-dependent regret bounds. We partially bridge the gap by designing a general-purpose algorithm with distribution-dependent regret bounds that are optimal in several important cases that cover many practical applications, including dynamic pricing with limited supply, online bid optimization for sponsored search auctions, and dynamic procurement. Keywords: Multi-Armed Bandits; Knapsack Constraints

1. Introduction Multi-Armed Bandit (MAB) is a benchmark model for repeated decision making in stochastic environments with very limited feedback on the outcomes of alternatives. In these circumstances, a decision maker must strive to find an overall optimal sequence of decisions while making as few suboptimal ones as possible when exploring the decision space in order to generate as much revenue as possible, a trade-off coined exploration-exploitation. The original problem, first formulated in its predominant version in Robbins (1952), has spurred a new line of research that aims at introducing additional constraints that reflect more accurately the reality of the decision making process. Bandits with Knapsacks (BwK), a model formulated in its most general form in Badanidiyuru et al. (2013), fits into this framework and is characterized by the consumption of a limited supply of resources (e.g. time, money, and natural resources) that comes with every decision. This extension is motivated by a number of applications in electronic markets such as dynamic pricing with limited supply, see Besbes and Zeevi (2012), Besbes and Zeevi (2009), and Babaioff et al. (2012), online advertising Slivkins (2013), online bid optimization for sponsored search auctions Tran-Thanh et al. (2014a), and crowdsourcing Tran-Thanh et al. (2014b). A unifying paradigm of online learning is to evaluate algorithms based on their regret performance. In the BwK theory, this performance criterion is expressed as the gap between the total payoff of an optimal oracle algorithm aware of how the rewards and the amounts of resource consumption are generated and the total payoff of the algorithm. Many approaches have been proposed to tackle the original MAB problem, where

A. Flajolet & P. Jaillet.

F LAJOLET JAILLET

time is the only limited resource with a prescribed time horizon T , and the optimal regret bounds are now well documented. They can be classified into two categories with qualitatively different asymptotic growth rates. Many algorithms, such as UCB1 Auer et al. (2002a), Thomsom sampling Agrawal and Goyal (2012), and ǫ-greedy Auer et al. (2002a), achieve distribution-dependent, i.e. with constant factors that depend on the underlying unobserved distributions, asymptotic bounds on regret of order Θ(ln(T )), which is shown to be optimal in Lai and Robbins (1985). While these results prove very satisfying in many settings, the downside is that the bounds can get arbitrarily large if a malicious opponent was to select the underlying distributions in an adversarial fashion. In contrast, algorithms such as Exp3, designed in Auer et al. (2002b), achieve distribution-free √ bounds that can be computed in an online fashion, at the price of a less attractive growth rate Θ( T ). The BwK theory lacks this clear-cut distinction. While provably optimal distribution-free bounds have recently been established, see Agrawal and Devanur (2014) and Badanidiyuru et al. (2013), there has been little progress toward the development of asymptotically optimal distribution-dependent regret bounds. To bridge the gap, in this paper we introduce algorithms with proven regret bounds which are asymptotically logarithmic in the initial supply of each resource, in three important cases that cover a wide range of applications: • Case 1, where there is a single limited resource and the amount of resource consumed as a result of making a decision is stochastic. Applications in online advertising Tran-Thanh et al. (2010) and wireless sensor networks Tran-Thanh et al. (2012) fit in this framework; • Case 2, where there are two limited resources, one of which is assumed to be time while the consumption of the other is stochastic. Typical applications include online bid optimization for sponsored search auctions Tran-Thanh et al. (2014a), dynamic pricing with limited supply Babaioff et al. (2012), and dynamic procurement Badanidiyuru et al. (2013); • Case 3, where there are arbitrarily many resources and the amounts of resources consumed as a result of making a decision are deterministic. Some applications in dynamic pricing also fit in this framework, see for instance Besbes and Zeevi (2012).

2. Problem statement and literature review At each time period t ∈ N, a decision needs to be made among a predefined finite set of actions, represented by arms and labeled k = 1, · · · , K. We denote by at the arm pulled at time t. Pulling arm k at time t yields a random reward rk,t ∈ [0, 1] and incurs the consumption of C ∈ N different resource types by random amounts ck,t (1), · · · , ck,t (C) ∈ [0, 1]C . At any time t and for any arm k, the vector (rk,t , ck,t (1), · · · , ck,t (C)) is jointly drawn from a fixed probability distribution νk independently from the past. The rewards and the amounts of resource consumption can be arbitrarily correlated across arms. The consumption of any resource i ∈ {1, · · · , C} is constrained by an initial budget B(i) ∈ R+ . As a result, the decision maker can keep pulling arms only so long as he does not run out of any of the C resources and the game ends at time period τ ∗ , defined as: ∗

τ = min{t ∈ N | ∃i ∈ {1, · · · , C},

t X

caτ ,τ (i) > B(i)}.

(1)

τ =1

Note that time itself may or may not be a limited resource. The difficulty for the decision maker lies in the fact that none of the underlying distributions, i.e. (νk )k=1,··· ,K , are initially known. 2

L OGARITHMIC R EGRET B OUNDS

FOR

BWK

Furthermore, the only feedback provided to the decision maker upon pulling arm at (but prior to selecting at+1 ) is (rat ,t , cat ,t (1), · · · , cat ,t (C)), i.e. the decision maker does not observe the rewards that would have been obtained and the amounts of resources that would have been consumed as a result of pulling a different arm. The goal is to design a non-anticipating algorithm that, at any time t, selects at based on the information acquired in the past so as to keep the pseudo regret defined as: ∗ −1 τX

RB(1),··· ,B(C) = EROPT (B(1), · · · , B(C)) − E[

rat ,t ],

(2)

t=1

as small as possible, where EROPT (B(1), · · · , B(C)) is the maximum expected sum of rewards that can be obtained by a non-anticipating oracle algorithm that has knowledge of the underlying distributions. Here, an algorithm is said to be non-anticipating if the decision to pull a given arm does not depend on the future observations. We develop algorithms with distribution-dependent bounds on RB(1),··· ,B(C) that hold for any choice of the unobserved underlying distributions (νk )k=1,··· ,K . Let us denote by (Ft )t∈N the natural filtration generated by the rewards and the amounts of resource consumption revealed to the decision maker, i.e. ((rat ,t , cat ,t (1), · · · , cat ,t (C)))t∈N . Note that τ ∗ is a stopping time with respect to (Ft )t≥1 . As the mean turns out to be an important statistics, we denote the mean reward and amounts of resource consumption by µrk , µck (1), · · · , µck (C) and their respective empirical estimates by r¯k,t , c¯k,t (1), · · · , c¯k,t (C). These estimates depend on the number of times each arm has been pulled by the decision maker up to, but not including, time t, which we write nk,t. We end with a general assumption that we make throughout the paper. Assumption 1 For any resource i ∈ {1, · · · , C} and for any arm k ∈ {1, · · · , K}, µck (i) > 0. Assumption 1 is meant to have the game end in finite time almost surely. Strictly speaking, we only need the weaker assumption that pulling any arm incurs the consumption of at least one resource. We use the stronger Assumption 1 to simplify the presentation but all the proofs can be easily adapted to accommodate the weaker version. Moreover, all the proofs are deferred to the Appendix. Literature Review. The Bandits with Knapsacks framework was first introduced in its full generality in Badanidiyuru et al. (2013), but special cases had been studied before, see for example Tran-Thanh et al. (2010), Tran-Thanh et al. (2012), and Babaioff et al. (2012). Since the standard MAB problem fits in the BwK framework, with time being the only scarce resource, the results listed in the introduction tend to suggest that regret bounds with logarithmic growth with respect to the budgets may be possible for BwK problems but very few such results are documented. When there are arbitrarily many resources and a time horizon, Badanidiyuru et al. (2013) and Agrawal and Devanur (2014) obtain q p ˜ EROPT (B(1), · · · , B(C)) + EROPT (B(1), · · · , B(C))/ min B(i)) O( i

distribution-free bounds on regret that hold on average as well as with√high probability, where the ˜ notation hides logarithmic factors. These bounds simplify to O( ˜ T ) for the standard MAB O problem. Johnson et al. (2015) extend Thompson sampling √ to tackle the general BwK problem and ˜ T ) when one of the limited resources is obtain distribution-depend bounds on regret of order O( time. Combes et al. (2015) consider a more general framework that allows to model any historydependent constraint on the number of times any arm can be pulled along with a time horizon T and obtain O(ln(T )) regret bounds. However, the benchmark oracle algorithm they use to define 3

F LAJOLET JAILLET

regret only has knowledge of the distributions of the rewards, as opposed to the joint distributions of the rewards and the amounts of resource consumption. As a result, the difference between the total payoff of their benchmark oracle √ algorithm and EROPT (B(1), · · · , B(C)) is of order Θ(T ). Babaioff et al. (2012) establish a Ω( T ) distribution-dependent lower bound on regret for a particular BwK problem with a time horizon, a stochastic resource, and a continuum of arms. This lower bound does not apply here as we are considering finitely many arms and it is well known that there is an exponential separation between the best possible expected regret when we move from finitely many arms to uncountably many arms for the standard MAB problem, see Kleinberg and Leighton (2003). Tran-Thanh et al. (2012) tackles BwK problems with a single limited resource whose consumption is deterministic and constrained by a global budget B. Although the algorithm proposed in Tran-Thanh et al. (2012) can be shown to incur O(ln(B)) regret, as we show in Section 7, the analysis is incorrect because the stopping time and the rewards obtained are correlated in a non-trivial way through the budget constraint and the decision rule used to select arms. Hence, conditioning on the stopping time, the sequence (rk,t )t∈N may not be i.i.d. with distribution νk , for any k. Ding et al. (2013) extend the model of Tran-Thanh et al. (2012) by considering a stochastic resource. However, the analysis is incorrect and their algorithm incurs Θ(B) regret as we show in Section 4. Contributions.

We design a general-purpose algorithm with asymptotic regret bounds of:

(a) O(ln(B(1))) when: (i) C = 1 and (ii) a lower bound on mink=1,··· ,K µck (1) is available to the decision maker. If a lower bound is not available, we obtain a O(ln(B(1))1+γ ) bound on regret for any γ > 0; (b) O(ln(min(B(1), B(2)))) when: (i) C = 2, (ii) resource i = 1 is time, (iii) the ratio fixed, (iv) there is no arm k such that µck (2) = |µck (2)

B(2) B(1) |

B(2) B(1) ,

B(2) B(1)

is

and (v) a lower bound on mink=1,··· ,K

mink=1,··· ,K µck (2)

is available to the decision maker. If (v) is not satand − isfied and regardless of whether the decision maker knows if (iv) holds or not, we obtain a 1+γ ) bound on regret for any γ > 0 when (iv) holds and the regret O(ln(min(B(1), p B(2))) bound is O( min(B(1), B(2))) otherwise;

(c) O(ln(mini=1,··· ,C B(i))) when: (i) ck,t (1), · · · , ck,t (C) are deterministic for any time t and any B(i) arm k and (ii) the ratios ( minj=1,··· )i=1,··· ,C are fixed. If (ii) is not satisfied, we get a regret ,C B(j) bound of r min B(i) · ln( min B(i))). O( i=1,··· ,C

i=1,··· ,C

Assumptions (iii) of case (b) and (ii) of case (c) are widely used in the dynamic pricing literature where the inventory scales linearly with the time horizon, see Besbes and Zeevi (2012). We present the algorithmic ideas underlying our approach in Section 3 and apply theses ideas to Cases (a), (b), and (c) in Sections 4, 6, and 5 respectively.

3. Algorithmic ideas To handle the exploration-exploitation trade-off, an approach that has proved to be particularly successful hinges on the optimism in the face of uncertainty paradigm. The idea is to consider all plausible scenarios consistent with the information collected so far and to select the decision that

4

L OGARITHMIC R EGRET B OUNDS

FOR

BWK

yields the most revenue among all the scenarios identified. Concentration inequalities are intrinsic to the paradigm as they enable the development of systematic closed form confidence intervals on the quantities of interest, which together define a set of plausible scenarios. We make repeated use of the following result. Lemma 1 Hoeffding’s inequality, Hoeffding (1963) Consider X1 , · · · , Xn n random variables with support in [0, 1]. 2 If ∀t ≤ n E[Xt | X1 , · · · , Xt−1 ] ≤ µ, then P[X1 + · · · + Xn ≥ nµ + a] ≤ exp(− 2an ) ∀a ≥ 0. 2 If ∀t ≤ n E[Xt | X1 , · · · , Xt−1 ] ≥ µ, then P[X1 + · · · + Xn ≤ nµ − a] ≤ exp(− 2an ) ∀a ≥ 0. Auer et al. (2002a) follow the optimism in the face of uncertainty paradigm to develop the Upper Confidence Bound algorithm (UCB1). UCB1 is based on the following observations: (i) the optimal strategy always consists in pulling the arm with the highest mean reward when time is rk,t − ǫk,t, r¯k,t + ǫk,t] at the only limited resource, (ii) informally, Lemma 1qshows that µrk ∈ [¯

2·ln(t) time t with probability at least 1 − t23 for ǫk,t = nk,t , irrespective of the number of times arm k has been pulled. Based on these observations, UCB1 always selects the arm with highest UCB index, i.e. at ∈ argmaxk=1,··· ,K Ik,t, where the UCB index of arm k at time t is defined as Ik,t = r¯k,t + ǫk,t . The first term can be interpreted as an exploitation term, the ultimate goal being to maximize revenue, while the second term is an exploration term, the smaller nk,t, the bigger it is. This fruitful paradigm go well beyond this special case and many extensions of UCB1 have been designed to tackle variants of the MAB problem, see for example Slivkins (2013). Agrawal and Devanur (2014) embrace the same ideas to tackle BwK problems. The situation is more complex in this all-encompassing framework as the optimal oracle algorithm involves pulling several arms. In fact, finding the optimal pulling strategy given the knowledge of the underlying distributions is already a challenge in its known, see Papadimitriou and Tsitsiklis (1999) for a study of the computational complexity of similar problems. This raises the question of how to evaluate EROPT (B(1), · · · , B(C)) in (2). To overcome this issue, Badanidiyuru et al. (2013) upper bound the total expected payoff of any non-anticipating algorithm by the optimal value of a linear program, which is easier to compute.

Lemma 2 Adapted from Badanidiyuru et al. (2013) The total expected payoff of any non-anticipating algorithm is no greater than the optimal value of the linear program: K X

sup (ξk )k=1,··· ,K

k=1

subject to

K X k=1

µrk · ξk µck (i) · ξk ≤ B(i),

ξk ≥ 0,

(3) i = 1, · · · , C

k = 1, · · · , K

µrk . c µ k=1,··· ,K k (i) i=1,··· ,C

plus the constant term max

In this paper, we use standard linear programming notions such as the concept of a basis or of a basic feasible solution. We refer to Bertsimas and Tsitsiklis (1997) for an introduction to linear 5

F LAJOLET JAILLET

programming. For x a feasible basis for (3), we denote by (ξkx )k=1,··· ,K the corresponding basic P x r feasible solution and by objx = K k=1 ξk · µk its objective function. From Lemma 2, we derive: ∗ −1 τX

RB(1),··· ,B(C) ≤ objx∗ − E[

rat ,t ] + max

t=1

k=1,··· ,K i=1,··· ,C

µrk , µck (i)

(4)

where x∗ is an optimal basis for (3). Lemma 2 also provides insight into designing non-oracle algorithms. The idea is to incorporate confidence intervals on the mean rewards and the mean amounts of resource consumption into the offline optimization problem (3) and to base the decision upon the resulting optimal solution. There are several ways to carry out this task, each leading to a different algorithm. Agrawal and Devanur (2014) use high-probability lower (resp. upper) bounds on the mean amounts of resource consumption (resp. rewards) in place of the unknown mean based on the resulting optimal solution. They obtain p values in (3) and pull an arm at random ˜ EROPT (B(1), · · · , B(C)) + EROPT √(B(1),··· ,B(C)) ) regret bounds that hold on average as well O( mini B(i)

as with high probability. If we relate this approach to UCB1, the intuition is clear: the idea is to be optimistic about both the rewards and the amounts of resource consumption. We propose the following algorithm also based on the linear relaxation (3). UCB-Simplex Take β ≥ 1 (β will need to be carefully chosen). At each time period t, proceed as follows. Step 1: Find an optimal basis xt to the linear program (e.g. by using the simplex algorithm): sup (ξk )k=1,··· ,K

subject to

K X

k=1 K X k=1

(¯ rk,t + β · ǫk,t ) · ξk (5) c¯k,t (i) · ξk ≤ B(i),

ξk ≥ 0,

i = 1, · · · , C

k = 1, · · · , K

We denote the corresponding basic feasible solution by (ξkxt ,t )k=1,··· ,K . Step 2: Identify the arms involved in the optimal basis, i.e. supp(xt ) = {k ∈ {1, · · · , K} | ξkxt ,t > 0}. There are at most min(K, C) such arms. Use a load balancing algorithm Axt , to be specified, to determine which of these arms to pull. The details of Step 2 are purposefully left out and will be specified for each of the cases treated in this paper. Compared to the approach developed in Agrawal and Devanur (2014), the idea remains to be overly optimistic but only about the rewards, thus transferring the burden of exploration from the constraints to the objective function through the scaling factor β. For each of the situations considered in this paper, we develop tailored load balancing algorithms that are deterministic in the sense that, for any time period t, at ∈ Ft−1 . The algorithm we propose is intrinsically tied to the existence of a basic feasible optimal solution to (3) and (5). We denote by B (resp. Bt ) the set of feasible basis for (3) (resp. (5)). Step 1 of UCB-Simplex can be interpreted as an extension of the index-based decision rule of UCB1. Indeed, Step 1 consists in assigning an index Ix,t to each basis x ∈ Bt and to select xt ∈ argmaxx∈Bt Ix,t , where Ix,t = objx,t + Ex,t with a clear P x,t separation between the exploitation term, objx,t = K ¯k,t, and the exploration term, Ex,t = k=1 ξk · r 6

L OGARITHMIC R EGRET B OUNDS

FOR

BWK

P x,t x,t β· K k=1 ξk · ǫk,t . Observe that for x ∈ Bt that is also feasible for (3), (ξk )k=1,··· ,K and objx,t x are plug-in estimates of (ξk )k=1,··· ,K and objx . Also note that when β = 1 and when time is the only limited resource, UCB-Simplex is identical to UCB1 as Step 2 is unambiguous in this special case, each basis involving a single arm. For any x ∈ B, we define ∆x = objx∗ − objx ≥ 0 as the optimality gap. A feasible basis x is said to be suboptimal if ∆x > 0. At any time t, nx,t denotes the number of times basis x has been selected at Step 1 up to time t while nxk,t denotes the number of times arm k has been pulled up to time t when selecting x at Step 1. For all the cases treated in this paper, we will show that Step 1 of UCB-Simplex guarantees that a suboptimal basis cannot be selected more than O(ln(mini B(i))) times on average, a result reminiscent of the regret analysis of UCB1 carried out in Auer et al. (2002a). However, in stark contrast with the situation of a single limited resource, this is merely a prerequisite to establish a O(ln(mini B(i))) bound on regret. Indeed, a low regret algorithm must also balance the load between the arms as closely as possible to optimality. Hence, the choice of the load balancing algorithms Ax is crucial to obtain O(ln(mini B(i))) regret bounds. For mathematical convenience, we consider that the game carries on even if one of the resources is already exhausted so that at is well-defined for any t ∈ N. Of course, the rewards obtained for t ≥ τ ∗ are not taken account of in the decision maker’s payoff when establishing regret bounds.

4. A single limited resource In this section, we tackle the case of a single resource whose consumption is limited by a global budget B. To simplify the notations, we omit the indices identifying the resources as there is only one, i.e. we write µck , ck,t , c¯k,t , and B as opposed to µck (1), ck,t (1), c¯k,t (1), and B(1). We start by strengthening Assumption 1. Assumption 2 There exists λ > 0, known to the decision maker, such that

min µck ≥ λ.

k=1,··· ,K

Assumption 2 serves two purposes. First, the analysis conducted under this additional assumption can be extended to the general setting at the price of more technicalities, as detailed in Section 7. Second, this leads to better bounds on regret if the decision maker happens to have access to such information. Ding et al. (2013) propose a UCB-based algorithm to solve the problem under Assumption 2 but the analysis is incorrect and the algorithm has Θ(B) regret as we next show. UCB-BV1 from Ding et al. (2013) Pull each arm once. At time t, pull:

at ∈ argmax k=1,··· ,K

r¯k,t c¯k,t

q (1 + λ1 ) · ln(t−1) nk,t q . + λ − ln(t−1) nk,t

Lemma 3 Performance of UCB-BV1 There exists a distribution (νk )k=1,··· ,K such that RB = Θ(B). UCB-BV1 fails in some cases because the exploration term may be negative when nk,t is small, which in turn implies that there is a negative incentive to explore arms that have rarely been pulled. 7

F LAJOLET JAILLET

We implement UCB-Simplex with β = 1 + λ1 . The algorithm is preceded by an initialization step which consists in pulling each arm until the amount of resource consumed as a result of pulling that arm is non-zero. The purpose of this step is to have c¯k,t > 0 for all periods to come and for all arms. We omit this step in the theoretical analysis because the amount of resource consumed is O(1) and the reward obtained is non-negative and not taken account of in the decision maker’s total payoff. Moreover, the initialization step ends in finite time almost surely as a result of Assumption 2. Step 2 of UCB-Simplex is unambiguous here as basic feasible solutions involve a single arm. Hence, we identify a basis x = {k} with the corresponding arm and write x = k to simplify the notations. In particular, k∗ ∈ {1, · · · , K} identifies an optimal arm in the sense defined in Section 3. The exploration and exploitation terms defined in Section 3 specialize to objk,t = B ·

r¯k,t 1 ǫk,t and Ek,t = B · (1 + ) · , c¯k,t λ c¯k,t

µr



ǫ

+ (1 + λ1 ) · c¯k,t . We point so that k∗ ∈ argmaxk=1,··· ,K µkc and that at ∈ argmaxk=1,··· ,K c¯k,t k,t k,t k out that, for the particular setting considered in this section, UCB-Simplex is almost identical to the fractional KUBE algorithm proposed in Tran-Thanh et al. (2012) to tackle the case of a single resource whose consumption is deterministic. It only differs by the presence of the scaling factor β to favor exploration over exploitation, which becomes unnecessary when the amounts of resource consumed are deterministic, see Section 7. Regret analysis of UCB-Simplex. First observe that (4) specializes to: τX −1 µrk rat ,t ]. − E[ µck t=1 ∗

RB ≤ (B + 1) · max

k=1,··· ,K

(6)

To bound the right-hand side, we start by bounding the expected time horizon. Lemma 4 For any non-anticipating algorithm, E[τ ∗ ] ≤

B+1 min µck .

k=1,··· ,K

The next result is crucial. Used in combination with Lemma 4, it shows that any suboptimal arm is pulled at most O(ln(B)) times in expectations, a well-known result for UCB1, see Auer et al. (2002a). The proof is along the same lines as the proof for UCB1, with the additional difficulty of having to deal with the random stopping time and the fact that the amount of resource consumed at each step is stochastic. Lemma 5 For any suboptimal arm k, E[nk,τ ∗ ] ≤ 2βk · E[ln(τ ∗ )] + Ck with: βk = max(

1 2 π2 1 B 1 · (1 + )] ) and C ≤ · (4 + , 32 · [ ). k λ 2 c ∆ · λ λ 6 (µk − 2 ) 1 − exp(−2(µck − λ2 )2 ) k

Building on the last two results, we are now ready to establish a finite-time regret bound. Proposition 6 RB ≤ (2

X

k | ∆k >0

µck ·

∆k B+1 · βk ) · ln( )+ B min µck k=1,··· ,K

where βk and Ck are defined in Lemma 5. 8

X

k | ∆k >0

µck ·

µr ∗ ∆k · Ck + 1 + kc , B µk ∗

L OGARITHMIC R EGRET B OUNDS

Observe that the set of optimal arms, namely argmaxk µrk∗ µck∗

µrk µck

µrk µck ,

FOR

BWK

does not depend on B. Since

∆k B

=

− = O(1) for any suboptimal arm, we conclude that RB = O(ln(B)). Interestingly, the algorithm we propose does not rely on B to achieve this regret bound, much like what happens for UCB1 with the time horizon, see Auer et al. (2002a). This result is optimal up to constant factors as the standard MAB problem is a special case of the framework considered in this section, see Lai and Robbins (1985) for a proof of a lower bound in this context.

5. Arbitrarily many limited resources whose consumption are deterministic In this section, we study the case of multiple limited resources when the amounts of resources consumed as a result of pulling an arm are deterministic and globally constrained by prescribed budgets (B(i))i=1,··· ,C , where C is the number of resources. Because the amounts of resources consumed are deterministic, the exploration P (resp. exploitation) terms defined in Section 3 specialPK x x c ize to objx,t = k=1 ξk · r¯k,t (resp. Ex,t = K k=1 ξk · ǫk,t ) and we can substitute the notation µk (i) for ck (i). We point out that the stopping time need not be deterministic as the decision to select an arm is based on the past realizations of the rewards. We define r1,··· ,C ≤ min(C, K) as the rank of the matrix composed of the first C inequalities in (3). Just like in Section 4, UCB-Simplex is preceded by an initialization step which now consists in pulling each arm r1,··· ,C − 1 times. The motivation behind this step is mainly technical and is simply meant to have: X (7) nk,t ≥ (r1,··· ,C − 1) + nxk,t ∀t, ∀k ∈ {1, · · · , K}. x | k∈supp(x), x6={k}

We discard the initialization step in the theoretical study because the amounts of resources consumed are O(1) and the total reward obtained is non-negative and not taken account of in the decision maker’s total payoff. Compared to Section 4, we choose to take β = 1 and we are now required to specify the load balancing algorithms involved in Step 2 of UCB-Simplex as the feasible basis selected at Step 1 may involve several arms. Although Step 2 will also need to be specified in Section 6, designing good load balancing algorithms is arguably easier when the amounts of resources consumed as a result of pulling arms are deterministic because the optimal load balance is known for each basis from the start. Nonetheless, one challenge remains: we can never identify the (possibly many) optimal basis of (3) with absolute certainty. As a result, every basis selected at Step 1 should be treated as potentially optimal when balancing the load between the arms involved in this basis, but this inevitably causes some interference issues as an arm may be involved in several basis, and worst, possibly several optimal basis. Therefore, one point that will appear to be of particular importance in the analysis is the use of load balancing algorithms that are decoupled from one another. We use the following class of load balancing algorithms. Load balancing algorithm Ax for a feasible basis x ∈ B If basis x is selected at time t, pull any arm k ∈ supp(x) such that nxk,t ≤ nx,t ·

ξx PKk

x l=1 ξl

.

The load balancing algorithms Ax thus defined are decoupled because, for each basis, the number of times an arm has been pulled when selecting another basis is not taken account of. The following lemma shows that Ax is always well-defined and establishes a few properties. 9

F LAJOLET JAILLET

Lemma 7 Ax is always well-defined and moreover, at any time t, for any basis x ∈ B, and for any arm k ∈ supp(x): ξx nx,t · PKk

x l=1 ξl

ξx − (r1,··· ,C − 1) ≤ nxk,t ≤ nx,t · PKk

x l=1 ξl

+ 1 almost surely,

/ supp(x). while nxk,t = 0 for any arm k ∈

Observe that implementing the load balancing algorithms Ax may require a memory storage capacity exponential in C and polynomial in the number of arms, although always bounded by O(mini B(i)) (because we do not need to keep track of nxk,t if x has never been selected). In practice, only a few basis will be selected at Step 1, so that a hash table is an appropriate data structure to store the sequences (nxk,t)k∈supp(x) . Regret Analysis of UCB-Simplex. To begin the analysis, we start by bounding the number of times any suboptimal basis can be selected at Step 1, in the same spirit as in Section 4. Lemma 8 For any suboptimal basis x ∈ B, E[nx,τ ∗ ] ≤ 2βx · E[ln(τ ∗ )] + r1,··· ,C · r 8 · (min1,···c,C(i))2 · ( min∆i xB(i) )2 .

π2 3

with βx =

i,k k

Lemma 8 used in combination with a variant of Lemma 4 shows that a suboptimal basis is selected at most O(ln(mini B(i))) times. To establish the regret bound, it remains to lower bound the expected total payoff derived when selecting any of the optimal basis. This is more involved than in the case of a single limited resource because the load balancing step comes into play at this stage. Proposition 9 RB(1),··· ,B(C) ≤ 16

X r1,··· ,C · ( (mini,k ck (i))2 c

x∈O ∩B

mini B(i) mini B(i) ) · ln( + 1) ∆x mini,k ck (i)

π2 ∆x ) · r1,··· ,C · mini B(i) 3 x∈O c ∩B   µr K + r1,··· ,C 1 + · r1,··· ,C · ( + r1,··· ,C ) + max k + 1, k,i ck (i) K mini,k ck (i)

+(

X

where O denotes the set of optimal feasible basis for (3). Proposition 9 establishes a finite-time regret bound. We now move on to analyze the asymptotic behavior of RB(1),··· ,B(C) when mini B(i) → ∞. Without loss of generality, we may assume that, for every two resources i, j ∈ {1, · · · , C}, maxk ck (j) · (B(i) + 1) ≥ mink ck (i) · B(j). Indeed, resource j would otherwise never be limiting and could be discarded. Hence, we may assume that B(i) the ratios ( B(j) )i,j∈{1,··· ,C} are bounded. As it turns out, the regret bound established in Proposition 9 is not always of order O(ln(mini B(i))), even though this bound appears to be very similar to the one derived in Proposition 6 for a single limited resource. The fundamental difference is that the set of optimal basis may not converge while it is always invariant in the case of a single limited ) may oscillate around a finite limit such that there resource. Typically, the ratios ( minB(i) j B(j) i∈{1,··· ,C} exists two optimal basis for (3) when this limit is taken as the right-hand side of the constraints while 10

L OGARITHMIC R EGRET B OUNDS

FOR

BWK

there is a unique optimal basis for (3) whenever the right-hand side is slightly perturbed around this limit. This alternately causes one of these two basis to be slightly suboptimal, a situation difficult to identify and to cope with for the decision maker. Nevertheless, this difficulty does not arise in several situations of interest. Proposition 10 Suppose that ( minB(i) ) all converge to finite values and either: j B(j) i∈{1,··· ,C} • that there exists a unique optimal basis for (3) when the right-hand side is taken as B(C) B(1) , · · · , min ), lim( min i B(i) i B(i) • or that

B(j) mini B(i)

i B(i)) − lim minB(j) = O( ln(min mini B(i) ) ∀j ∈ {1, · · · , C}, i B(i)

then RB(1),··· ,B(C) = O(ln(mini B(i))). In particular, if the ratios ( minB(i) ) remain constant, Proposition 10 applies. This asj B(j) i∈{1,··· ,C} sumption is widely used in the dynamic pricing literature where the inventory scales linearly with the time horizon, see Besbes and Zeevi (2012).

6. A time horizon and another limited resource In this section, we investigate the case of two limited resources, one of which is assumed to be time, with a time horizon T , while the consumption of the other is stochastic and constrained by a global budget B. To simplify the notations, we omit the indices identifying the resources since the first limited resource is time and we write µck , ck,t , c¯k,t , B, and T as opposed to µck (2), ck,t (2), c¯k,t (2), B(2), and B(1). Moreover, we refer to resource i = 2 as “the” resource. Observe that, in the particular setting considered in this section, τ ∗ = min(τ (B), T + 1) with τ (B) = min{t ∈ Pt N| τ =1 caτ ,τ > B}. Simplifying assumptions.

To focus on the main ideas, we strengthen Assumption 1.

Assumption 3 There exists ǫ > 0, known to the decision maker, such that: • ck,t ≥ ǫ almost surely, for all arms k ∈ {1, · · · , K} and for all time periods t, • |µck −

B T|

≥ ǫ, for all arms k ∈ {1, · · · , K}.

These assumptions can be relaxed to a great extent at the price of more technicalities and the loss of finite-time regret bounds, which turn into asymptotic ones. Essentially, if Assumption 3 does not hold, ǫ should be taken as a vanishing function of min(B, T ), as detailed in Section 7. Note that the budget constraint is not limiting if B ≥ T and conversely the time constraint is not limiting if B ≤ ǫT . In both cases, there is a single limited resource and we are back in the situation of Section 4. Therefore we assume, without loss of generality, that ǫT ≤ B ≤ T .

Technical difficulties. We implement UCB-Simplex with β = 1 + 1ǫ . Because the amount of resource consumed at each time step is a random variable, a feasible basis for (5) may not be feasible for (3) and conversely. This is in contrast to the situation studied in Section 5. As a consequence, x∗ may not be feasible for (5), thus effectively preventing it from being selected at Step 1 of UCBSimplex, and a basis infeasible for (3) may be selected instead. To guarantee that these events, however still possible, occur with low probability, UCB-Simplex is preceded by an initialization 11

F LAJOLET JAILLET

  phase which consists in pulling each arm ǫ12 ln( Bǫ + 1) times. Hence, UCB-Simplex starts at  1 round ti = K · ǫ2 ln( Bǫ + 1) +1. Observe that τ ∗ ≤ Bǫ + 1 almost surely as a consequence of the first part of Assumption 3. This implies that, at any time t ≥ ti , any arm has been pulled at least 1 ǫ2 · ln(t) times, i.e: 1 nk,t ≥ 2 · ln(t), ∀t ≤ τ ∗ , ∀k ∈ {1, · · · , K}. (8) ǫ Load balancing algorithms. We need to specificy Step 2 of UCB-Simplex because the basis selected at Step 1 may involve two arms. When xt involves two arms, say xt = {k, l}, we use a load balancing algorithm specific to this basis, which we recall is denoted by A{k,l} , to determine which of these two arms to pull. Similarly as in Section 5, using load balancing algorithms that are decoupled from one another is crucial because the decision maker can never identify the optimal basis with absolute certainty, which implies that each basis should be treated as potentially optimal when balancing the load between the arms, but this inevitably causes interference issues as an arm may be involved in several basis. To decouple the load balancing algorithms, we impose that they only rely on the historical samples observed when selecting the corresponding basis. For the same reason, the data obtained at the initialization step is not used by any of the load balancing algorithms. Compared to Section 5, we face an additional challenge when designing the load balancing algorithms: the optimal load balances are unknown to the decision maker. It turns out that we can still approximately achieve the unknown optimal load balances by enforcing that the average amount of resource consumed at each step be B T. Load balancing algorithm A{k,l} for any two arms (k, l)   If this is the first time that this basis is selected, pull any arm, say k, ǫ12 ln( Bǫ + 1) times. Define y{k,l} = k if the average per step amount of resource consumed during this initialization phase is {k,l}

as the total larger than B T and y{k,l} = l otherwise. For any subsequent time period t, define bt amount of resource consumed when selecting basis {k, l} after the initialization phase. If basis {k,l} ≤ n{k,l},t · B {k, l} is selected at time t, pull arm y{k,l} if bt T and pull the other arm otherwise.

c Regret Analysis of UCB-Simplex. We use the shorthand P = {(k, l) | µck > B T > µl } to identify B feasible pairs of arms and we denote by Pt = {(k, l) | c¯k,t > T > c¯l,t } its empirical counterpart at time t. We stress that P is not initially known to the decision maker. Observe that the exploration and exploitation terms defined in Section 3 specialize to: ( ( T · ǫk,t if c¯k,t ≤ B T · r¯k,t if c¯k,t ≤ B T T , obj{k},t = B , E{k},t = β · B B ¯k,t if c¯k,t > T if c¯k,t > B c¯k,t · r c¯k,t · ǫk,t T

and: obj{k,l},t = T ·(

B B − c¯l,t − c¯l,t c¯k,t − B c¯k,t − B T T ·¯ rl,t + T ·¯ rk,t ), E{k,l},t = β·T ·( ·ǫl,t + T ·ǫk,t ). c¯k,t − c¯l,t c¯k,t − c¯l,t c¯k,t − c¯l,t c¯k,t − c¯l,t

As stressed at the beginning of this section, UCB-Simplex may sometimes select an infeasible basis since Pt may differ from P. We start by proving that this does not happen very often. Lemma 11 For any basis x ∈ / B, E[nx,τ ∗ ] ≤

π2 . 3·(1−exp(−2ǫ2 ))

12

L OGARITHMIC R EGRET B OUNDS

FOR

BWK

As in Sections 4 and 6, a crucial property is that any suboptimal feasible basis is selected at most O(ln(min(B, T ))) times on average. Lemma 12 For any suboptimal basis x ∈ B, E[nx,τ ∗ ] ≤ 2βx · E[ln(τ ∗ )] + Cx with:   β·T 2 1 π2 B βx = 8 · ( , ) , Cx = 2 ln( + 1) +2 · π 2 + ∆x ǫ ǫ (1 − exp(−2ǫ2 )) if x involves a single arm and βx =

Cx =



1 ǫ+4 β·T 2 + ·8·( ) , ǫ2 ǫ ∆{k,l}

 B 2 π2 π2 1 ln( + 1) + + +2·π 2 + , 2 2 ǫ ǫ ǫ 3 · (1 − exp(−2ǫ )) · (exp(2ǫ(ǫ + 4)) − 1) 1 − exp(−2ǫ2 )

if x = {k, l}.

It remains to lower bound the expected total payoff derived when selecting any of the optimal basis. The major difficulty lies in the fact that the amounts of resource consumed, the rewards obtained, and the stopping time are correlated in a non-trivial way through the budget constraint and the decisions made in the past. This makes it difficult to study the expected total payoff derived when selecting optimal basis independently from the amounts of resource consumed and the rewards obtained when selecting suboptimal ones. However, a key point is that, by design, the decision made at Step 2 of UCB-Simplex is based solely on the past history associated with the basis selected at Step 1 because the load balancing algorithms are decoupled. For this reason, the analysis proceeds in two steps irrespective of the nature of the possibly many optimal basis. In a first step, we study the expected total payoff obtained by a decision maker that systematically selects the same basis. In a second step, we use a coupling argument and Lemma 12 to establish the regret bound. We start with the case of a single-armed basis: Lemma 13 Suppose the decision maker only selects the basis x = {k}. We have: ( T − (1 + ǫ12 ) if µck < B T E[nk,τ ∗ −1 ] ≥ . 1 B if µck > B µc − (1 + exp(−2ǫ2 ) ) T k

We move on to study basis involving two arms: Lemma 14 Suppose that the decision maker only selects the basis x = {k, l} with (k, l) ∈ P. We have: E[nk,τ ∗−1 ] ≥ T · E[nl,τ ∗ −1 ] ≥ T ·

B c 1 1 3 1 2 T − µl · (2 + ) − , − 4 · ln(T ) − 2 − c c 2 µk − µl ǫ 2ǫ 1 − exp(−2ǫ ) ǫ ǫ B c µk − T 1 1 3 1 2 · (2 + ) − . − 4 · ln(T ) − 2 − c c 2 µk − µl ǫ 2ǫ 1 − exp(−2ǫ ) ǫ ǫ

We are now ready to establish a finite-time bound on regret.

13

(9)

F LAJOLET JAILLET

Proposition 15 X T 2 B 4 ( ) ] · ln( + 1) + O(ln(B)), RB,T ≤ 16 · (1 + )3 · [ ǫ ∆x ǫ c x∈O ∩B

where O denotes the set of optimal solutions to (3) and where the constant factors hidden in the O notation do not depend on the unobserved distributions (νk )k=1,··· ,K . These factors are explicitly computed in the proof but are not reproduced here to simplify the presentation. We finally proceed to analyze the asymptotic behavior of RB,T when min(B, T ) → ∞ in the same fashion as in Section 5. Proposition 16 Suppose that

B T

converges to a finite value and either:

• that there exists a unique optimal basis for (3) when the right-hand side is taken as (1, lim B T ), • or that

B T

ln(min(B,T )) − lim B T = O( min(B,T ) ),

then RB,T = O(ln(min(B, T ))). In particular, if the ratio B T remains fixed, we get RB,T = O(ln(min(B, T ))). This assumption is widely used in the dynamic pricing literature where the inventory scales linearly with the time horizon, see Besbes and Zeevi (2012).

7. Extensions 7.1. A single limited resource Deterministic amounts of resource consumption. Because the amounts of resource consumed are deterministic, we can substitute the notation µck for ck . Moreover, we can take β = 1 and, going through the analysis of Lemma 5, we can slightly refine the bound as follows. Observe that, for any 2 arm k such that ∆k > 0, E[nk,τ ∗ ] ≤ 2βk · E[ln(τ ∗ )] + Ck with βk = 8 · ( ∆kB·ck )2 and Ck = π3 . As a result, the regret bound derived in Proposition 6 turns into: RB ≤ (16

X

k | ∆k >0

B+1 π2 B ) · ln( )+ · ∆k · ck mink ck 3

X

k | ∆k >0

ck ·

µr ∗ ∆k +1+ k . B ck∗

Relaxing Assumption 2. Achieving a O(ln(B)1+γ ) bound on regret for any γ > 0 is still possible by systematically adding an offset of 1 γ to the observed amounts of resource consumed and by ln(B) 4

taking λ =

1 γ ln(B) 4

. Proceeding this way, the analysis carried out in Lemma 5 is simplified because

the amount of resource consumed at each step becomes almost surely no smaller than λ, thus making the disjunction c¯k,t ≤ λ2 unnecessary. Furthermore, observe that, for B large enough, we have: k=1,··· ,K

µrk µck +

µrm µcm +



argmax and

1 γ ln(B) 4

1 γ ln(B) 4

⊂ argmax k=1,··· ,K

µrl µcl +

1 γ ln(B) 4

14

>

µrk , µck ∆l , 2B

L OGARITHMIC R EGRET B OUNDS

for any m ∈ argmax k=1,··· ,K

µrk µck

and l ∈ / argmax k=1,··· ,K

µrk µck .

FOR

BWK

As a consequence, the bound derived in Lemma 5

turns into: E[nk,τ ∗ ] ≤ 2βk · E[ln(τ ∗ )] + Ck ,

2

(1 + λ1 )]2 and Ck = 2π3 . Going through the for any k such that ∆k > 0, with βk = 32 · [ ∆2B k ·λ proof of Proposition 6, we obtain the asymptotic regret bound RB = O(maxk | ∆k >0 βk · ln(B)) = γ γ O( [ln(B) 4 · (1 + ln(B) 4 )]2 · ln(B) ) = O(ln(B)1+γ ). Distribution-free regret bound. Starting from the last inequality derived in the proof of ProposiB+1 tion 6, and using the fact that E[nk,τ ∗ ] ≤ E[τ ∗ ] ≤ min c , we have: kµ k

RB ≤

X

[µck · min(

k | ∆k >0

Maximizing on

∆k B

µr ∗ ∆k B + 1 2∆k B+1 ∆k · · βk · ln( · Ck )] + 1 + ck . , )+ c c B mink µk B mink µk B µk ∗

p ∈ [0, λ1 ], we obtain the distribution-free bound RB = O( B · ln(B)).

7.2. Arbitrarily many limited resources whose consumption are deterministic

Distribution-free regret bound. Along the same lines pas for the case of a single limited resource, we can obtain a distribution-free bound on regret of O( mini B(i) · ln(mini B(i))). We start with the following inequality, obtained in the course of proving Proposition 9: RB(1),··· ,B(C) ≤

X

x∈O c ∩B

Since nx,τ ∗ ≤ τ ∗ ≤

µr r1,··· ,C ∆x · E[nx,τ ∗ ] + |B| · ( + (r1,··· ,C )2 ) + max k + 1. k,i ck (i) mini B(i) mini,k ck (i)

mini B(i)+1 mini,k ck (i) .

Assuming mini B(i) ≥ 1, we get:

∆x ∆x · E[nx,τ ∗ ] ≤ 2 . mini B(i) mini,k ck (i) Combining this last inequality with the bounds obtained in Lemma 8, we derive: RB(1),··· ,B(C) ≤

X

x∈O c ∩B

min[2

∆x r1,··· ,C mini B(i) mini B(i) , 16 · · ln( + 1) 2 mini,k ck (i) (mini,k ck (i)) ∆x mini,k ck (i) +

π2 ∆x · r1,··· ,C · ] + O(1). mini B(i) 3

mini B(i) ] yields the claim since: Maximizing each term on ∆x ∈ [0, min i,k ck (i)

∆x ≤

K X k=1

µrk

·

∗ ξkx



K X k=1

ξkx ≤ ∗

mini B(i) , mini,k ck (i)

where the rightmost inequality is derived in the proof of Proposition 9.

15

F LAJOLET JAILLET

7.3. A time horizon and another limited resource Relaxing Assumption 3. In the same fashion as in Section 7.1, we add an offset of 1 observed amounts of resource consumed and take ǫ = ln(B) γ . There are two cases:

1 ln(B)γ

to the

• Case 1, there is no arm k with µck = B T such that the basis {k} is optimal for (3). In this case, the analysis can be adapted to obtain an asymptotic regret bound of O(ln(min(B, T ))1+4γ ). The derivation is more technical because we need to account for the possibility of selecting a suboptimal basis {k, l} such that µck = B T at Step 1 but otherwise follows the same ideas as in Sections 6 and 7.1. that the basis {k} is optimal for (3). In this case, • Case 2, there is an arm k with µck = B T such p the analysis leads to a regret bound of O( min(B, T )). The fundamental reason behind this phase transition is that the upper bound on EROPT (B, T ) derived in Lemma 2 is too loose. Indeed, it is given by T · µrk + O(1), which is the payoff one would obtain by pulling arm k roughly T times. Now observe that, even if the decision maker was only pulling arm k, he would only be able to pull this arm E[min(T, τ (B))] times on average. Hence,√if the random variables (ck,t )t∈N are not deterministic, we have E[min(T, τ (B))] ∼ T − Θ( T ).

8. Concluding remark The existence of an algorithm with an expected bound on regret of order O(ln(mini B i )) in the case of multiple stochastic resources remains an open question. While Step 1 of UCB-Simplex can still be shown to guarantee that a suboptimal basis cannot be selected more than O(ln(mini B(i))) times on average, the development of a general-purpose load balancing algorithm is more challenging.

References S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the 15th ACM conference on Economics and Computation, pages 989–1006, 2014. S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory, volume 23, 2012. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002a. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002b. M. Babaioff, S. Dughmi, R. Kleinberg, and A. Slivkins. Dynamic pricing with limited supply. In Proceedings of the 13th ACM Conference on Electronic Commerce, pages 74–91, 2012. A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. In 54th IEEE Annual Symposium on Foundations of Computer Science, pages 207–216, 2013. D. Bertsimas and J. N. Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific, 1997. 16

L OGARITHMIC R EGRET B OUNDS

FOR

BWK

O. Besbes and A. Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6):1407–1420, 2009. O. Besbes and A. Zeevi. Blind network revenue management. Operations research, 60(6):1537– 1550, 2012. R. Combes, C. Jian, and R. Srikant. Bandits with budgets: Regret lower bounds and optimal algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 245–257, 2015. W. Ding, T. Qin, X. Zhang, and T. Liu. Multi-armed bandit with budget constraint and variable costs. In 27th AAAI Conference on Artificial Intelligence, pages 232–238, 2013. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963. K. Johnson, D. Simchi-Levi, and H. Wang. Online network revenue management using thompson sampling. Working Paper, 2015. R. Kleinberg and T. Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In 44th IEEE Annual Symposium on Foundations of Computer Science, pages 594–605, 2003. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985. C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999. H. Robbins. Some aspects of the sequential design of experiments. Bulleting of the American Mathematical Society, 58(5):527–535, 1952. A. Slivkins. Dynamic ad allocation: Bandits with budgets. arXiv preprint arXiv:1306.0155, 2013. L. Tran-Thanh, A. Chapman, A. Rogers, and N. R. Jennings. ǫ-first policies for budget-limited multi-armed bandits. In 24th AAAI Conference on Artificial Intelligence, pages 1211–1216, 2010. L. Tran-Thanh, A. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal policies for budget-limited multi-armed bandits. In 26th AAAI Conference on Artificial Intelligence, pages 1134–1140, 2012. L. Tran-Thanh, C. Stavrogiannis, V. Naroditskiy, V. Robu, N. R. Jennings, and P. Key. Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, pages 809–818, 2014a. L. Tran-Thanh, S. Stein, A. Rogers, and N. R. Jennings. Efficient crowdsourcing of unknown experts using multi-armed bandits. Artificial Intelligence, 214:89–111, 2014b.

17

F LAJOLET JAILLET

Appendix A. Proof of Lemma 2 The proof can be found in Badanidiyuru et al. (2013). For the sake of completeness, we sketch the proof. The optimization problem (3) is a linear program whose dual reads: C X

inf

(ηi )i=1,··· ,C

i=1 C X

subject to

i=1

B(i) · ηi µck (i) · ηi ≥ µrk ,

ηi ≥ 0,

(10) k = 1, · · · , K

i = 1, · · · , C.

Observe that (3) is feasible therefore (3) and (10) have the same optimal value. Note that (3) is ∗ ∗ bounded as ξk ∈ [0, µB(1) c (1) ] for any feasible point. Hence, (10) has an optimal solution (η1 , · · · , ηC ). k Consider any non-anticipating algorithm. Let Zt be the sum of P the total payoff in PCaccumulated t ∗ · (B(i) − rounds 1 to t plus the “cost” of the remaining resources, i.e. Z = r + η t τ =1 aτ ,τ i=1 i Pt c (i)). Observe that (Z ) is a supermartingale with respect to the filtration (Ft )t as a ,τ t t τ τ =1 P PC ∗ c t ·(µr − t ∈F E[Zt | Ft−1 ] = K p η ·µ (i))+Z ≤ Z where p is determined by t−1 t−1 t−1 k=1 k i=1 i k k k the algorithm and corresponds to the probability of pulling arm t given P the past. Moreover, Pk at time t ·E[|r − C η ∗ ·c (i)|] ≤ (Zt )t has bounded increments since E[ |Zt −Zt−1 | | Ft−1 ] = K p k,t i=1 i k,t k=1 k PC ∗ PC ∗ PK t ∗ ] < ∞ as: η ) < ∞. We also have E[τ p · (1 + η ) = (1 + i=1 i k=1 k i=1 i E[τ ∗ ] =

∞ X t=1



∞ X t=1

≤1+ ≤(

P[τ ∗ ≥ t]

t−1 X caτ ,τ (i) ≤ B(i), i = 1, · · · , C] P[ τ =1 ∞ X t=1

t X caτ ,τ (1) ≤ t · min µck (1) − (t · min µck (1) − B(1))] P[ k

k

τ =1

B(1) + 2) + mink µck (1)

∞ X

exp(−

B(1) c k µk (1)

2(t · mink µck (1) − B(1))2 ) t

t≥ min

< ∞, where the third inequality results an application of Lemma 1. By Doob’s optional stopping P from ∗ · B(i). Observe that: theorem, E[Zτ ∗ ] ≤ E[Z0 ] = C η i=1 i K C X X ∗ τ r ηi∗ · µck (i)) + Zτ ∗ −1 ] E[Zτ ∗ ] = E[ pk · (µk − k=1 C X

≥ E[−

i=1

i=1

ηi∗ +

∗ −1 τX

t=1

18

rat ,t ].

L OGARITHMIC R EGRET B OUNDS

FOR

BWK

µr

Consider l ∈ argmini=1,··· ,C B(i) and define ηl = maxk µc k(l) and ηi = 0 for i 6= l. Since η is a k feasible solution for (10), we have: PC C X B(i) · ηi∗ ηi∗ ≤ i=1 B(l) i=1 PC B(i) · ηi ≤ i=1 B(l) µr ≤ max c k . k µk (l) r Pτ ∗ −1 µk We get E[Zτ ∗ ] ≥ E[ t=1 rat ,t ] − maxk,i µc (i) and finally: k

τ ∗ −1

E[

X t=1

By strong duality,

PC

∗ i=1 ηi

rat ,t ] ≤

C X i=1

ηi∗ · B(i) + max k,i

µrk . µck (i)

· B(i) is also the optimal value of (3).

Appendix B. Proof of Lemma 3 We show that UCB-BV1 does not achieve sublinear regret for the standard MAB problem (which is a particular case of the setting considered in Section 4). We consider K = 3 arms with ck,t = 1 for all arms k = 1, 2, 3 and for all time periods t, hence λ = 1 is a valid choice. The reward distribution of arm 1 is given by ν1r = 32 δ0 + 31 δ1 , with δy defined as the Dirac distribution supported at y. The rewards of the last two arms are almost surely equal, independent from the reward of arm 1, and distributed according to a uniform distribution on [0, 1]. Observe that at time period t = K + 1, i.e. immediately after the initialization step, the exploration terms are identical across arms and r¯1,K+1 > r¯2,K+1 = r¯3,K+1 with probability 13 . If this event occurs, UCB-BV1 pulls arm 1. At the next time step, i.e. t = K + 2, observe that the exploration terms of arms 2 and 3 are negative and strictly smaller than −1, while the exploration term of arm 1 is positive (as n1,t = 2), hence: q q ln(t−1) ln(t−1) 1 1 (1 + ) · (1 + ) · r¯k,t r¯1,t λ nk,t λ n1,t q q + + max . 0. Fact 1

τ X Iat =k · Ink,t ≥βk ln(t) ]. E[nk,τ ∗ ] ≤ 2βk · E[ln(τ )] + E[ ∗



t=1

Proof Define the random variable Tk = βk · ln(τ ∗ ). We have:

E[nk,τ ∗ ] = E[nk,τ ∗ · Ink,τ ∗ objk∗ ,t +Ek∗ ,t · Ink,t ≥βk ln(t) ] Iat =k · Ink,t ≥βk ln(t) ] ≤ E[ E[ ∗



t=1 τ∗ X

t=1

≤ E[

Iobjk,t ≥objk +Ek,t ]

(12)

+ E[

Iobjk∗ ,t ≤objk∗ −Ek∗ ,t ]

(13)

Iobjk∗ objk∗ −Ek∗ ,t while objk,t + Ek,t > objk∗ ,t + Ek∗ ,t , it must be that objk∗ < objk + 2Ek,t . Let us study (12), (13) and (14) separately. Fact 2

τ X Iobjk∗ 0

µck ·

µr ∗ ∆k · E[nk,τ ∗ ] + 1 + kc . B µk ∗

Using the upper bound of Lemma 4, the concavity of ln and Lemma 5, we finally derive the result.

Appendix F. Proof of Lemma 7 Consider a basis x ∈ B and a time period t. For Ax to be well-defined, we need to show that there ξx always exists an arm k ∈ supp(x) such that nxk,t ≤ nx,t · PKk x . Suppose there is none, we have: l=1 ξl

nx,t =

X

nxk,t

k∈supp(x)

>

X

k∈supp(x)

> nx,t ·

ξx nx,t · PKk

X

k∈supp(x)

> nx,t,

25

x l=1 ξl

ξx PKk

x l=1 ξl

F LAJOLET JAILLET

a contradiction. Moreover, we have, at any time t and for any arm k ∈ supp(x): ξx nxk,t ≤ nx,t · PKk

x l=1 ξl

+ 1.

Indeed, suppose otherwise and define t∗ ≤ t as the last time arm k was pulled. Since (nx,τ )τ =1,··· ,t is a non-decreasing sequence, we have: nxk,t∗ = nxk,t − 1

ξx > nx,t · PKk > nx,t∗ ·

x l=1 ξl ξx PKk x , l=1 ξl

which shows by definition that arm k could not have been pulled at time t∗ . We also derive as a byproduct that, at any time t and for any arm k ∈ supp(x): ξx nx,t · PKk

x l=1 ξl

since nx,t =

P

x k∈supp(x) nk,t

− (r1,··· ,C − 1) ≤ nxk,t,

and since a basis involves at most r1,··· ,C arms.

Appendix G. Proof of Lemma 8 Consider any suboptimal basis x ∈ B. The proof is along the same lines as for Lemma 5 and follows the exact same steps. Fact 5

τ X ∗ Ixt =x · Inx,t ≥βx ln(t) ]. E[nx,τ ] ≤ 2βx · E[ln(τ )] + E[ ∗



(15)

t=1

We omit the proof as it is analogous to the proof of Fact 1. As in Lemma 5, we break down the second term in the right-hand side in three terms and bound each of them by a constant: τ τ X X Iobjx,t +Ex,t ≥objx∗ ,t +Ex∗ ,t · Inx,t ≥βx ln(t) ] Ixt =x · Inx,t ≥βx ln(t) ] ≤ E[ E[ ∗



t=1

t=1

τ X ≤ E[ Iobjx,t ≥objx +Ex,t ] ∗

t=1 τ∗ X

+ E[

Iobjx∗ ,t ≤objx∗ −Ex∗ ,t ]

(16)

(17)

t=1

τ X Iobjx∗ 0 for mini B(i) large enough, as ∆x i B(i)) x = O( ln(min ∩ O∞ , we have min∆i B(i) mini B(i) ) as mini B(i) B(C) B(C) mini B(i) − lim mini B(i) ). Bringing everything together,

c , we have x ∈ B and For x ∈ Oc ∩ B ∩ O∞ ∞

shown for the first claim. For x ∈ Oc ∩ B

∆x mini B(i)

>

B(1) is linear in ( minB(1) − lim min ,··· , i B(i) i B(i) we conclude: RB(1),··· ,B(C) = O(ln(min B(i))). i

Appendix J. Proof of Lemma 11 Consider x ∈ / B. Note that single-armed basis are always feasible, so x must involve two arms. Without loss of generality, we can assume that x = {k, l} and µck , µcl > B T (the situation is symmet¯l,t ≤ B ric if the reverse inequality holds). If x is selected at time t, either c¯k,t ≤ B T or c T , otherwise x would have been infeasible for (5). Thus, using (8): τ X Iat =x · Ink,t ≥ 1 E[nx,τ ∗ ] ≤ E[ ∗



t=ti ∞ X t=ti

ǫ2

P[¯ ck,t ≤

ln(t)

· Inl,t ≥ 1

ǫ2

ln(t) ]

B 1 1 B , nk,t ≥ 2 ln(t)] + P[¯ , nl,t ≥ 2 ln(t)]. cl,t ≤ T ǫ T ǫ

Following the same recipe as in the proof of Fact 2, we conclude: E[nx,τ ∗ ] ≤

π2 . 3 · (1 − exp(−2ǫ2 ))

Appendix K. Proof of Lemma 12 The proof is along the same lines as in Lemmas 5 and 8. We break down the analysis in a series of facts where we emphasize the main differences. We start off with an inequality analogous to Fact 1. The only difference lies in the initialization step which essentially guarantees that x∗ ∈ Bt with high probability.

32

L OGARITHMIC R EGRET B OUNDS

FOR

BWK

Fact 9   1 B π2 + 2 ln( + 1) E[nx,τ ∗ ] ≤2βx · E[ln(τ )] + 3 · (1 − exp(−2ǫ2 )) ǫ ǫ ∗

τ X + E[ Ixt =x · Inx,t ≥βx ln(t) · Ix∗ ∈Bt ]. ∗

(19)

t=ti

Proof For any suboptimal feasible basis x, define Tx = βx · ln(τ ∗ ). If x∗ involves a single arm, the proof is exactly the same as the one given in Fact 1, setting aside the initialization step, because x∗ c is always feasible at any time t. If x∗ involves two arms, x∗ = {k∗ , l∗ } with µck∗ > B T > µl∗ , we start along same lines as in Fact 1 of Lemma 5 substituting k for x and using (8) to get: E[nx,τ ∗ ] ≤2βx · E[ln(τ ∗ )]   1 B + 2 ln( + 1) ǫ ǫ

τ X Ixt =x · Inx,t ≥βx ln(t) · Ink∗ ,t ≥ 1 + E[ ∗

ǫ2

t=ti

ln(t)

· Inl∗ ,t ≥ 1

ǫ2

ln(t) ],

where the second term results from the initialization phase associated with basis {k, l}. This further yields: E[nx,τ ∗ ] ≤2βx · E[ln(τ ∗ )]   B 1 + 2 ln( + 1) ǫ ǫ

τ X I(k∗ ,l∗ )∈P + E[ / t · Ink∗ ,t ≥ 1 ∗

ǫ2

t=ti

ln(t)

· Inl∗ ,t ≥ 1

ǫ2

ln(t) ]

τ X Ixt =x · Inx,t ≥βx ln(t) · I(k∗ ,l∗ )∈Pt ]. + E[ ∗

t=ti

We bound the third term on the right-hand side along the same lines as in Lemma 11 to obtain the desired inequality.

33

F LAJOLET JAILLET

The remainder of this proof is dedicated to show that the last term in (19) can be bounded by a constant. This term can be broken down in three terms similarly as in Lemmas 5 and 8. τ τ X X Iobjx,t +Ex,t ≥objx∗ ,t +Ex∗ ,t · Inx,t ≥βx ln(t) · Ix∈Bt ,x∗ ∈Bt ] Ixt =x · Inx,t ≥βx ln(t) · Ix∗ ∈Bt ] ≤ E[ E[ ∗



t=ti

t=ti

τ X ≤ E[ Iobjx,t ≥objx +Ex,t · Ix∈Bt ] ∗

(20)

t=ti

τ X + E[ Iobjx∗ ,t ≤objx∗ −Ex∗ ,t · Ix∗ ∈Bt ] ∗

(21)

t=ti

τ X Iobjx∗ B T > µcl (this only leads to an additive correction of − 1ǫ ). Additionally, the initialization step can only decrease E[nk,τ ∗ −1 ] and E[nl,τ ∗ −1 ] by at most ǫ13 ln( Bǫ + 1) ≤ ǫ14 ln(T ) compared to the situation where the decision maker does not implement it, since pulling an arm incurs an amount of resource consumption of at least ǫ and at most 1. In the sequel, we discard the initialization step and add these additive corrective terms to the final lower bounds. We start by bounding the probability that the global amount of resource consumed deviates from the target, i.e.: {k,l}

P[|bt

−t·

B | ≥ x] for x ≥ 0. T

{k,l}

−t· B If bt T ≥ x, we must have been pulling arm l for the last s ≥ ⌊x⌋ rounds (because the amounts of resource consumption are almost surely bounded by 1) and we must have: t−1 X

τ =t−s

cτl ≥ s ·

B + x − 1. T

Otherwise, since a switch of arms occurred at time τ = t − s, we must have: {k,l}

bt−s − (t − s) · which in combination with:

t−1 X

τ =t−s

cτl < s · 40

B ≤ 1, T

B +x−1 T

L OGARITHMIC R EGRET B OUNDS

FOR

BWK

yields: {k,l}

bt

−t·

B < x. T

Hence, if x ≥ 1: {k,l}

P[bt

−t·

t−1 t X X B B P[ cτl ≥ s · + x − 1] ≥ x] ≤ T T τ =t−s s=⌊x⌋

≤ ≤ ≤ ≤

t X

P[

t−1 X

s=⌊x⌋

τ =t−s

t X

t−1 X

P[

s=⌊x⌋ t X

s=⌊x⌋

τ =t−s

cτl ≥ s · µcl + (s · (

B − µcl ) + x − 1)] T

cτl ≥ s · µcl + s · ǫ]

exp(−2s · ǫ2 )

exp(−2⌊x⌋ · ǫ2 ) . 1 − exp(−2ǫ2 )

With a similar argument, we conclude: {k,l}

P[|bt

−t·

exp(−2⌊x⌋ · ǫ2 ) B | ≥ x] ≤ 2 · . T 1 − exp(−2ǫ2 )

T Defining t∗ = T − 2·ǫ1 3 ln(T ) − B and x = (T − t∗ ) · B T , we conclude that, with probability at least 1 2 1 − 1−exp(−2ǫ2 ) · T , we have:

B−

1 2 {k,l} ln(T ) − ≤ bt∗ ≤ B. 3 ǫ ǫ

In particular, this implies that t∗ ≤ τ ∗ . Let us denote this last event by A. We get: B−

1 2 2 2 {k,l} ln(T ) − − ≤ E[bt∗ ] ≤ B + , 3 2 ǫ ǫ 1 − exp(−2ǫ ) 1 − exp(−2ǫ2 )

which further implies that: B−

1 2 2 2 ln(T ) − − ≤ µck · E[nk,t∗ ] + µcl · E[nl,t∗ ] ≤ B + . ǫ3 ǫ 1 − exp(−2ǫ2 ) 1 − exp(−2ǫ2 )

Moreover: E[nk,t∗ ] + E[nl,t∗ ] = t∗ . Combining this equality with the last set of inequalities yields: E[nk,t∗ ] ≥ T ·

B c T − µl µck − µcl



2ǫ3

3 2 3 · ln(T ) − − c c c c c c · (µk − µl ) ǫ · (µk − µl ) (µk − µl ) · (1 − exp(−2ǫ2 )) 41

F LAJOLET JAILLET

and E[nl,t∗ ] ≥ T ·

µck − B 3 2 3 T · ln(T ) − − c . − 3 c c c c c c c µk − µl 2ǫ · (µk − µl ) ǫ · (µk − µl ) (µk − µl ) · (1 − exp(−2ǫ2 ))

We finally derive: B c T − µl µck − µcl

E[nk,τ ∗ −1 ] ≥ E[nk,t∗ IA ] ≥ T ·

1 3 1 1 · (2 + ) · ln(T ) − 2 − 4 2 ǫ 2ǫ 1 − exp(−2ǫ ) ǫ



and E[nl,τ ∗ −1 ] ≥ E[nl,t∗ IA ] ≥ T ·

µck − B 1 3 1 1 T · (2 + ). − 4 · ln(T ) − 2 − c c 2 µk − µl ǫ 2ǫ 1 − exp(−2ǫ ) ǫ

Appendix N. Proof of Proposition 15 We first notice that the result of Lemma 2 specializes to: EROPT (B, T ) ≤ max k

µrk + T · max[ µck

max µrk , max

k|

µck ≤ B T

k|

µck ≥ B T

µc − B max ( ck Tc (k,l)∈P µk − µl

· µrl

B · µr , T · µck k +

B c T − µl µck − µcl

·

(23) µrk )

].

A unique optimal basis. To simplify the proof, we assume that there is a unique optimal basis to (3), which we denote by x∗ . The case of multiple optimal basis is treated in the next paragraph. The proof starts by refining (4), similarly as in the proof of Proposition 9: RB(1),··· ,B(C) ≤ ≤ ≤ ≤ ≤ ≤

K X

k=1 K X k=1

K X k=1

K X

k=1

K X k=1

∗ · ξkx

t=1

µrk · ξkx − ∗

µrk · ξkx − ∗

µrk

k=1

K X

τ X 1 rat ,t ] + + 1 − E[ ǫ ∗

µrk

µrk

∗ · ξkx

∗ · ξkx

− −

µrk · ξkx − ∗

∞ X t=1

∞ X t=1

∞ X t=1

E[Iτ ∗ ≥t · rat ,t ] + E[Iτ ∗ ≥t · E[Iτ ∗ ≥t ·

K XX

x∈B k=1

K XX

x∈B k=1

K X X k=1 x∈B

K X X k=1 x∈B

1 +1 ǫ rk,t · Ixt =x,at =k ] +

1 +1 ǫ

Ixt =x,at =k · E[rk,t | Ft−1 ]] +

1 +1 ǫ

τ X 1 Ixt =x,at =k ] + + 1 · E[ ǫ ∗

µrk

t=1

µrk · E[nxk,τ ∗ ] +

1 + 1. ǫ

Note that, once a basis has been selected at Step 1, the decision to pull an arm in supp(x) is independent from the past rewards and amounts of resource consumption observed when selecting 42

L OGARITHMIC R EGRET B OUNDS

FOR

BWK

any other basis because the load balancing algorithms are decoupled. This means that the process generating the rewards and the amounts of resource consumption when we select x∗ is the same whether only x∗ can be selected or whether all basis are being considered. The two processes only differ by their respective stopping times. For an arm k∗ involved in x∗ , denote by n ˜ k∗ ,˜τ −1 the ∗ number of times this arm is pulled when x is selected throughout the game, where τ˜ is defined as the stopping time of this process. Because we are exploring suboptimal basis, n ˜ k∗ ,˜τ −1 will typically ∗ be larger than nxk∗ ,τ ∗ −1 . Nevertheless, selecting a suboptimal basis may only decrease the number of times x∗ is selected by 1ǫ compared to the situation where we systematically select x∗ . This is because pulling any arm in a suboptimal basis consumes at most 1 unit of resource while pulling any arm in x∗ consumes at least ǫ unit of resource. From these observations, we derive: K X

µrk

k=1

· E[n

k ∗ ,τ ∗ −1

]≥

K X k=1

nk∗ ,˜τ −1 ] − µrk · E[˜

X

x∈O c ∩B

E[nx,τ ∗ −1 ] X E[nx,τ ∗ −1 ] − . ǫ ǫ x∈B /

This yields: RB(1),··· ,B(C) ≤ +

K X

µrk

k=1

·

X

x∈O c ∩B



K X



K X

∗ ξkx



K X

k∈supp(x∗ )

nk,˜τ −1 ] µrk · E[˜

E[nx,τ ∗−1 ] X E[nx,τ ∗−1 ] + + O(1) ǫ ǫ x∈B /

K X

nk,˜τ −1 ] + µrk · E[˜

K X

µrk · E[˜ nk,˜τ −1 ]

X 1 · (2βx · E[ln(τ ∗ )] + Cx ) ǫ x∈O c ∩B k=1 k∈supp(x∗ )   2 1 X 1 π B + · ( + 2 ln( + 1) ) + O(1) ǫ 3 · (1 − exp(−2ǫ2 )) ǫ ǫ µrk · ξkx − ∗

x∈B /

k=1

µrk · ξkx − ∗

k∈supp(x∗ )

X 4 T 2 B K2 B + 16 · (1 + )3 · ( ( ) ) · ln( + 1) + 3 · ln( + 1) + O(1), ǫ ∆ ǫ ǫ ǫ x c x∈O ∩B

where we use Lemma 11, Lemma 12, and the fact that τ ∗ ≤ and (23), we obtain the following inequalities. If x∗ is involves two arms, i.e. x∗ = {k∗ , l∗ }:

B ǫ

+ 1. Using Lemma 13, Lemma 14,

X T 2 B K2 4 B 4 ( ) ] · ln( + 1) + ( 3 + 4 ) · ln( + 1) + O(1). RB,T ≤ 16 · (1 + )3 · [ ǫ ∆x ǫ ǫ ǫ ǫ c x∈O ∩B

If x∗ involves a single arm: X 1 T 2 B K2 B RB,T ≤ 16 · (1 + )4 · [ ( ) ] · ln( + 1) + 3 · ln( + 1) + O(1). ǫ ∆ ǫ ǫ ǫ x c x∈O ∩B

43

F LAJOLET JAILLET

Multiple optimal basis. In the general case of multiple optimal basis, we use the same coupling argument but we look at the expected total payoff derived from pulling arms when selecting any of the optimal basis x ∈ O: X X X X µrk · E[nxk,τ ∗ −1 ] ≥ µrk · E[˜ nxk,˜τ −1 ] x∈O k∈supp(x)

x∈O k∈supp(x)



X

x∈O c ∩B

E[nx,τ ∗ −1 ] X E[nx,τ ∗ −1 ] − . ǫ ǫ x∈B /

Since we are now pulling arms from possibly several optimal basis, Lemmas 13 and 14 need to be adapted. We give the proof when there are two optimal basis involving two arms, say x∗1 = {k∗ , l∗ } and x∗2 = {i∗ , j ∗ }, that happen to be optimal but the proof can be easily adapted when there are more and/or if there are optimal single-armed basis. Using a result derived in the course of proving Lemma 14, we have: {k ∗ ,l∗ }

·

exp(−2⌊x⌋ · ǫ2 ) B | ≥ x] ≤ 2 · T 1 − exp(−2ǫ2 )

{i∗ ,j ∗ }

·

B exp(−2⌊x⌋ · ǫ2 ) | ≥ x] ≤ 2 · , T 1 − exp(−2ǫ2 )

{k ∗ ,l∗ }

− nt

{i∗ ,j ∗ }

− nt

P[|bt and P[|bt

T and x = 12 (T − t∗ ) · for all times t. Defining t∗ = T − ǫ13 ln(T ) − 2 B 1 4 probability at least 1 − 1−exp(−2ǫ 2) · T : {k ∗ ,l∗ }

nt∗

·

B T,

we conclude that with

B {k ∗ ,l∗ } B {k ∗ ,l∗ } · + x, ≤ nt∗ − x ≤ bt∗ T T

and:

B {i∗ ,j ∗ } B {i∗ ,j ∗ } · + x. ≤ nt∗ − x ≤ bt∗ T T We denote this last even by A. This last fact implies in particular that τ˜ ≥ t∗ as we get: {i∗ ,j ∗ }

nt∗

·

bt∗ ≤ B, by summing up the last two inequalities and where bt∗ is the total budget consumed at time t∗ . We also get: {k ∗ ,l∗ }

E[nt∗



4 B {k ∗ ,l∗ } {k ∗ ,l∗ } −x− ≤ µck∗ · E[nk∗,t∗ ] + µcl∗ · E[nl∗ ,t∗ ], 2 T 1 − exp(−2ǫ )

{k ∗ ,l∗ }

{k ∗ ,l∗ }

µck∗ · E[nk∗ ,t∗ ] + µcl∗ · E[nl∗ ,t∗ {k ∗ ,l∗ }

{k ∗ ,l∗ }

] ≤ E[nt∗ {k ∗ ,l∗ }

E[nk∗ ,t∗ ] + E[nl∗ ,t∗ {i∗ ,j ∗ }

E[nt∗





B 4 +x+ , T 1 − exp(−2ǫ2 )

{k ∗ ,l∗ }

] = E[nt∗

],

4 B {i∗ ,j ∗ } {i∗ ,j ∗ } −x− ≤ µci∗ · E[ni∗ ,t∗ ] + µcj∗ · E[nj ∗ ,t∗ ], 2 T 1 − exp(−2ǫ )

{i∗ ,j ∗ }

{i∗ ,j ∗ }

{i∗ ,j ∗ }

µci∗ · E[ni∗ ,t∗ ] + µcj∗ · E[nj ∗ ,t∗ ] ≤ E[nt∗ 44



B 4 +x+ , T 1 − exp(−2ǫ2 )

L OGARITHMIC R EGRET B OUNDS

and

FOR

{i∗ ,j ∗ }

{i∗ ,j ∗ }

BWK

{i∗ ,j ∗ }

E[ni∗ ,t∗ ] + E[nj ∗ ,t∗ ] = E[nt∗

].

Based on these inequalities, we obtain: {k ∗ ,l∗ }

{k ∗ ,l∗ }

E[nk∗ ,˜τ −1 ] ≥ E[nt∗ {k ∗ ,l∗ } E[nl∗ ,˜τ −1 ]





{k ∗ ,l∗ } ]· E[nt∗

B c T − µl ∗ µck∗ − µcl∗

− O(ln(T )),

µck∗ − B T − O(ln(T )), µck∗ − µcl∗

{i∗ ,j ∗ }

{i∗ ,j ∗ }



B c T − µj ∗ µci∗ − µcj∗

{i∗ ,j ∗ }

{i∗ ,j ∗ }



µci∗ − B T − O(ln(T )). µci∗ − µcj∗

E[ni∗ ,˜τ −1 ] ≥ E[nt∗ and E[nj ∗ ,˜τ −1 ] ≥ E[nt∗

− O(ln(T )),

We conclude the proof using: {k ∗ ,l∗ }

E[nt∗

{i∗ ,j ∗ }

] + E[nt∗

] = t∗ = T − O(ln(T )).

Appendix O. Proof of Proposition 16 Building on the result of Proposition 15, it remains to prove that: X

x∈O c ∩B

(

T 2 ) = O(1). ∆x

The proof follows the exact same steps as the proof of Proposition 10 observing that, by assumption, we have: ǫ · T ≤ B ≤ T .

45