ORANGE LABS. BayesOpt 2017. MULTI-ARMED BANDIT. For each step t = 1, ..., T. ⢠The player chooses an arm kt â K. ⢠The reward kt is revealed xkt. â [0,1].
M EMORY B ANDITS : A B AYESIAN APPROACH FOR THE S WITCHING B ANDIT P ROBLEM 1,3
2
R ÉDA ALAMI , O DALRIC -A MBRYM MAILLARD , R APHAEL FERAUD 1 2 3 PARIS -S ACLAY U NIVERSITY, I NRIA L ILLE , O RANGE L ABS
3
BayesOpt 2017 M ULTI -A RMED B ANDIT
G LOBAL S WITCHING TS WITH B AYESIAN A GGREGATION Growing number of Thompson Sampling fi,t : i denotes the starting time and t the current time. Let P(fi,t ) be the probability at time t of the Thompson sampling starting at time i. Initialization: P(f1,1 ) = 1, t = 1, ∀k ∈ K αk,f1,1 ← α0 , βk,f1,1 ← β0
For each step t = 1, ..., T • The player chooses an arm kt ∈ K • The reward kt is revealed xkt ∈ [0, 1] • Bernoulli rewards: xkt ∼ B (µkt ,t ) Minimize the pseudoh regret: i PT PT ? R(T ) = t=1 µt − E t=1 xkt , where µ?t = maxk µk,t .
2 Instantaneous gain update: αk,fi,t β k,fi,t +αk,fi,t ∀ i < t P(xt |fi,t ) = βk,fi,t
1 Decision process: at each time t:
βk,fi,t +αk,fi,t
• ∀ i < t, ∀k: θk,fi,t ∼ Beta αk,fi,t , βk,fi,t • Play (Bayesian Aggregation): kt = arg max k
X
P (fi,t ) θk,fi,t
i