ODALRIC-AMBRYM MAILLARD BayesOpt 2017

4 downloads 0 Views 888KB Size Report
ORANGE LABS. BayesOpt 2017. MULTI-ARMED BANDIT. For each step t = 1, ..., T. • The player chooses an arm kt ∈ K. • The reward kt is revealed xkt. ∈ [0,1].
M EMORY B ANDITS : A B AYESIAN APPROACH FOR THE S WITCHING B ANDIT P ROBLEM 1,3

2

R ÉDA ALAMI , O DALRIC -A MBRYM MAILLARD , R APHAEL FERAUD 1 2 3 PARIS -S ACLAY U NIVERSITY, I NRIA L ILLE , O RANGE L ABS

3

BayesOpt 2017 M ULTI -A RMED B ANDIT

G LOBAL S WITCHING TS WITH B AYESIAN A GGREGATION Growing number of Thompson Sampling fi,t : i denotes the starting time and t the current time. Let P(fi,t ) be the probability at time t of the Thompson sampling starting at time i. Initialization: P(f1,1 ) = 1, t = 1, ∀k ∈ K αk,f1,1 ← α0 , βk,f1,1 ← β0

For each step t = 1, ..., T • The player chooses an arm kt ∈ K • The reward kt is revealed xkt ∈ [0, 1] • Bernoulli rewards: xkt ∼ B (µkt ,t ) Minimize the pseudoh regret: i PT PT ? R(T ) = t=1 µt − E t=1 xkt , where µ?t = maxk µk,t .

2 Instantaneous gain update:  αk,fi,t β k,fi,t +αk,fi,t ∀ i < t P(xt |fi,t ) = βk,fi,t 

1 Decision process: at each time t:

βk,fi,t +αk,fi,t

• ∀ i < t, ∀k: θk,fi,t ∼ Beta αk,fi,t , βk,fi,t • Play (Bayesian Aggregation): kt = arg max k

X

P (fi,t ) θk,fi,t

i