Supplementary Material to A PAC-Bayesian Approach forDomain ...

1 downloads 0 Views 364KB Size Report
Supplementary Material to A PAC-Bayesian Approach for. Domain Adaptation with Specialization to Linear Classifiers. Pascal Germain.
Supplementary Material to A PAC-Bayesian Approach for Domain Adaptation with Specialization to Linear Classifiers

Pascal Germain [email protected] D´epartement d’informatique et de g´enie logiciel, Universit´e Laval, Qu´ebec, Canada Amaury Habrard [email protected] Laboratoire Hubert Curien UMR CNRS 5516, Universit´e Jean Monnet, 42000 St-Etienne, France Fran¸ cois Laviolette [email protected] D´epartement d’informatique et de g´enie logiciel, Universit´e Laval, Qu´ebec, Canada Emilie Morvant [email protected] Aix-Marseille Univ., LIF-QARMA, CNRS, UMR 7279, 13013, Marseille, France

In this document, Section 1 contains some lemmas used in subsequent proofs, Section 2 presents an extended proof of the bound on the domain disagreement disρ (DS , DT ) (Theorem 3 of the main paper), Section 3 introduces other PAC-Bayesian bounds for disρ (DS , DT ) and RPT (Gρ ), Section 4 shows equations and implementation details about PBDA (our proposed learning algorithm for PAC-Bayesian DA tasks).

Lemma 4 (from Inequalities (1) and (2) of Maurer (2004)). Let m ≥ 8, and X = (X1 , . . . , Xm ) be a vector of i.i.d. random variables, 0 ≤ Xi ≤ 1. Then, !! n √ √ 1 X Xi E [Xi ] ≤ 2 m, m ≤ E exp mkl m i=1

1. Some tools

2. Detailed Proof of Theorem 3

Lemma 1 (Markov’s inequality). Let Z be a random variable and t ≥ 0, then,

We recall the Theorem 3 of the main paper.

P (|Z| ≥ t) ≤ E (|Z|) / t . Lemma 2 (Jensen’s inequality). Let Z be an integrable real-valued random variable and g(·) any function. If g(·) is convex, then, g(E [Z]) ≤ E [g(Z)] . If g(·) is concave, then, g(E [Z]) ≥ E [g(Z)] .

where,

def

kl(a k b) = a ln ab + (1 − a) ln 1−a 1−b .

Theorem 3. For any distributions DS and DT over X, any set of hypothesis H, any prior distribution π over H, any δ ∈ (0, 1], and any real number α > 0, with a probability at least 1 − δ over the choice of S × T ∼ (DS ×DT )m , for every ρ on H, we have, h i 2KL(ρkπ)+ln δ2 2α disρ (S, T )+ −1 +1 m×α disρ (DS , DT ) ≤ , −2α 1−e where disρ (S, T ) is the empirical domain disagreement. Proof. Firstly, we propose to upper-bound, def

Lemma 3 (Maurer (2004)). Let X = (X1 , . . . , Xm ) be a vector of i.i.d. random variables, 0 ≤ Xi ≤ 1, 0 ), where Xi0 with E Xi = µ. Denote X 0 = (X10 , . . . , Xm is the unique Bernoulli ({0, 1}-valued) random variable with E Xi0 = µ. If f : [0, 1]n → R is convex, then, E [f (X)] ≤ E [f (X 0 )] .

(7)

d(1) =

E

(h,h0 )∼ρ2

[RDS(h, h0 )−RDT(h, h0 )] ,

by its empirical counterpart, (1)

def

dS×T =

E

(h,h0 )∼ρ2

[RS (h, h0 )−RT (h, h0 )] .

and some extra terms related to the Kullback-Leibler divergence between the posterior and the prior.

Supplementary Material to A PAC-Bayesian Approach for Domain Adaptation

ˆ def To do that, we consider an “abstract” classifier h = 0 2 (h, h ) ∈ H chosen according a distribution ρˆ, with ˆ = ρ(h)ρ(h0 ). Notice that with π ˆ = π(h)π(h0 ), ρˆ(h) ˆ (h) we obtain that KL(ˆ ρkˆ π ) = 2KL(ρkπ), KL(ˆ ρkˆ π) =

(h,h )∼ρ2

ρ(h) ρ(h0 ) + 0E ln h∼ρ π(h) h ∼ρ π(h0 ) ρ(h) = 2 E ln = 2KL(ρkπ) . h∼ρ π(h)

= E ln

(8)

ˆ on a pair of exLet us define the “abstract” loss of h amples (xs , xt ) ∼ DS×T = DS × DT by, def

ˆ xs, xt ) = Ld(1)(h,

1+L0-1(h(xs),h0 (xs))−L0-1(h(xt ),h0 (xt )) . 2

ˆ on the joint distriTherefore, the “abstract” risk of h bution is defined as, (1) ˆ = RDS×T (h)

ˆ xs, xt ) , E Ld(1) (h,

E

xs ∼DS xt ∼DT

and the error of the related Gibbs classifier associated with this loss is, (1) RDS×T (Gρˆ)

= E

ˆ ρˆ h∼

E e

=

=

(1)

E

 (1) ˆ ˆ (h))−2αR S×T (h)

.

e−2αXhˆ

E (1) Xh ˆ ∼B(m,RD S×T

=

ˆ x ,x ) Ld(1)(h, t

ˆ

e−2mαRS×T (h)

≤ m X

ˆ (h))



Pr (1)

ˆ (h)) ˆ ∼B(m,RD k=0 Xh S×T

(xs,xt )∼S×T

ˆ ρˆ h∼

S×T

DS×T

and, (1) RS×T (Gρˆ)

 (1) F (RD

ˆ let us define a random variable For a classifier h, Xhˆ that follows a binomial distribution of m trials (1) ˆ denoted by with a probability of success RDS×T (h)  (1) ˆ . Lemma 3 gives, B m, R (h)

S×T ∼(DS×T )m

s

 (1) ˆ ˆ (h))−2αR S×T (h)

By taking the logarithm on each side of the previous inequality, and transforming the expectation over π ˆ into an expectation over ρˆ, we obtain that, " # (1) ˆ ˆ mF (R(1) (h))−2αR ˆ π ˆ (h) DS×T S×T (h) ln E e ˆ ˆ ρˆ ρ h∼ ˆ(h)    (1) (1) ˆ ˆ 1 m F (RD (h))−2αR S×T (h) S× T ≤ ln E e E ˆ π δ S×T ∼(DS×T )m h∼ˆ   (1) (1) ˆ ˆ 1 (h)) mF (RD −2mαRS×T (h) S× T e . E e E = ln ˆ π δ h∼ˆ S×T ∼(DS×T )m (12)

E

(1) ˆ . RDS×T (h)

E

S×T

1 m E E e ˆ π δ S×T ∼(DS×T )m h∼ˆ

The empirical counterparts of these two quantities are, (1) ˆ RS×T (h)

 (1) m F (RD

ˆ π h∼ˆ



ρ(h)ρ(h0 ) π(h)π(h0 )

ln

E 0

least 1−δ over the choice of S ×T ∼ (DS×T )m , we have,

=

(1) ˆ RS×T (h) .

=

It is easy to show that,

m X k=0 m X

 Xhˆ = k e−2αk

m k

(1) ˆ k (1) ˆ m−k −2αk RS×T (h) 1 − RS×T (h) e

m k

ˆ −2α RS×T (h)e



(1)



k 

(1)

ˆ 1 − RS×T (h)

m−k

k=0

d

=

(1) 2RDS×T (Gρˆ)

(1)

=

2RS×T (Gρˆ) − 1.

(1)

dS×T

− 1,

 im (1) ˆ −2α (1) ˆ + 1 − RS×T (h) . = RS×T (h)e h

(9)

(1)

(10) (1)

As Ld(1) lies in [0, 1], we can bound the true RDS×T (Gρˆ) following the proof process of Th. 2 of the main paper (with c = 2α). To do so, we define the convex function,

The last line result, together with the choice of F (Eq. (11)), leads to, (1)

E e

mF (RD

S×T

ˆ (h))

(1)

def

−2α

F(p) = − ln[1 − (1 − e

mF (RD

)p] ,

(11)

and consider the non-negative random variable, E e

ˆ π h∼ˆ

 (1) m F (RD

S×T

(1) ˆ ˆ (h))−2αR S×T (h)



.

We apply Markov’s inequality (Lemma 1 of this Supp. Material). For every δ ∈ (0, 1], with a probability at

(1)

E

S×T ∼(DS×T

ˆ π h∼ˆ

≤ E e

S×T

ˆ π h∼ˆ

ˆ (h))

h

)m

ˆ

e−2mαRS×T (h)

 im (1) ˆ −2α (1) ˆ RS×T (h)e + 1 − RS×T (h)

= E 1 = 1. ˆ π h∼ˆ

We can now upper bound Eq. (12) simply by, " # (1) ˆ ˆ mF (R(1) (h))−2αR ˆ π ˆ (h) 1 DS×T S×T (h) ln E e ≤ ln . ˆ ˆ ρˆ ρ δ h∼ ˆ(h)

Supplementary Material to A PAC-Bayesian Approach for Domain Adaptation

Let us insert the term KL(ρkπ) in the left-hand side of the last inequality and find a lower bound by using Jensen’s inequality (Lemma 2) twice, first on the concave logarithm function and then on the convex function F, " # (1) ˆ ˆ mF (R(1) (h))−2αR ˆ π ˆ (h) ( h) DS×T S×T e ln E ˆ ˆ ρˆ ρ h∼ ˆ(h)    (1) (1) ˆ ˆ m F (RD (h))−2αR S×T (h) S×T = ln E e − 2KL(ρkπ) ˆ ρˆ h∼

  (1) ˆ − 2αR(1) (h) ˆ − 2KL(ρkπ) ≥ E m F(RDS×T (h)) S×T ˆ ρˆ h∼

(1) ˆ − 2mα E R(1) (h)−2KL(ρkπ) ˆ ≥ mF( E RDS×T (h)) S×T ˆ ρˆ h∼

=

ˆ ρˆ h∼

(1) mF(RDS×T (Gρˆ))



(1) 2mαRS×T (Gρˆ)

− 2KL(ρkπ) .

We then have, mF( E

ˆ ρ h∼ ˆ

(1) RDS×T

ˆ − 2mα E (h))

ˆ ρ h∼ ˆ

(1) ˆ RS×T (h)−2KL(ρkπ)

1 ≤ ln . δ

This, in turn, implies that, (1) F(RDS×T (Gρˆ))



(1) 2αRS×T (Gρˆ)

2KL(ρkπ) + ln 1δ . + m

To finish the proof, note that by definition, we have that d(1) = −d(2) , hence |d(1) | = |d(2) | = disρ (DS , DT ), and, (1)

(2)

|dS×T | = |dS×T | = disρ (S, T ). Then, the maximum of the bound on d(1) and the bound on d(2) gives a bound on disρ (DS , DT ). Finally, by the union bound, we have that, with probability 1−δ over the choice of S × T ∼ (DS × DT )m , we have, " # 2KL(ρkπ)+ln 2δ α |d(1) | + 1 (1) |d | + 1 + , ≤ 2 1−e−2α S×T m×α or, which is equivalent, h i 2KL(ρkπ)+ln δ2 2α disρ (S, T )+ +1 −1 m×α disρ (DS , DT ) ≤ , 1 − e−2α and we are done.

(1)

Now, by isolating RDS×T(Gρˆ), we obtain,  # " 2KL(ρkπ)+ln 1 (1) δ − 2αRS×T (Gρˆ)+ 1 m (1) RDS×T(Gρˆ) ≤ 1−e , 1−e−2α and from the inequality 1 − e−x ≤ x,   2KL(ρkπ) + ln 1δ 1 (1) (1) 2αR (G )+ . RDS×T (Gρˆ) ≤ ρˆ S×T 1−e−2α m

3. Other PAC-Bayesian Bounds 3.1. PAC-Bayesian Bounds with the kl term Let us recall the PAC-Bayesian bound proposed by Seeger (2002), in which the trade-off between the complexity and the risk is handled by the kl function defined by Equation (7) in this supplementary materials.

[RDT (h, h0 )−RDS (h, h0 )]

Theorem 6 (Seeger (2002)). For any domain PS over X × Y , any set of hypothesis H, and any prior distribution π over H, any δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ∼ (PS )m , for every ρ over H, we have,  √ 

  2 m 1

KL(ρ k π) + ln . kl RS (Gρ ) RPS (Gρ ) ≤ m δ

using exactly the same argument as for d(1) except that ˆ we instead consider the following “abstract” loss of h on a pair of examples (xs , xt ) ∼ DS×T = DS × DT :

Here is a “Seeger’s type” PAC-Bayesian bound for our domain disagreement disρ .

It then follows from Equations (9) and (10) that, with probability at least 1− 2δ over the choice of S × T ∼ (DS × DT )m , we have, " (1) # dS×T + 1 2KL(ρkπ)+ln 1δ d(1) + 1 2α ≤ + , 2 1−e−2α 2 m × 2α def

We now bound d(2) =

E

(h,h0 )∼ρ2

1+L0-1(h(xt ),h0 (xt ) −L0-1(h(xs),h0 (xs))) ˆ xs, xt ) def Ld(1)(h, = . 2 We then obtain that, with probability at least 1 − 2δ over the choice of S × T ∼ (DS × DT )m , " (2) # dS×T + 1 2KL(ρkπ)+ln 1δ d(2) + 1 2α ≤ + . 2 1−e−2α 2 m × 2α

Theorem 7. For any distributions DS and DT over X, any set of hypothesis H, and any prior distribution π over H, any δ ∈ (0, 1], with a probability at least 1−δ over the choice of S ×T ∼ (DS ×DT )m , for every ρ on H, we have,

 h  √ i disρ (S,T )+1 disρ (DS ,DT )+1 2 m 1 kl ≤ 2KL(ρkπ)+ln .

2 2 m δ

Supplementary Material to A PAC-Bayesian Approach for Domain Adaptation

Proof. Similarly as in the proof of Theorem 3, we will first bound, def

d(1) =

E

(h,h0 )∼ρ2

[RDS(h, h0 )−RDT(h, h0 )] ,

by its empirical counterpart, (1)

def

dS×T =

E

(h,h0 )∼ρ2

[RS (h, h0 )−RT (h, h0 )] ,

and some extra terms related to the Kullback-Leibler divergence between the posterior and the prior. However, a notable difference with the proof of Theorem 3 is that the obtained bound will be simultaneously valid as an upper and a lower bound. Because of this, there will no need here to redo the all the proof to bound d

Let us now re-write a part of the equation as KL(ρkπ) and let us then find a lower bound by using twice the Jensen’s inequality (Lemma 2), first on the concave logarithm function, and then on the convex function kl, "

# ˆ mklR(1) (h) ˆ R(1) (h) ˆ π ˆ (h) S×T DS×T e ln E ˆ ˆ ρˆ ρ h∼ ˆ(h)

   (1) ˆ (1) ˆ mkl RS×T (h) RD (h) S× T = ln E e − 2KL(ρkπ) ˆ ρˆ h∼

  (1) ˆ ˆ − 2KL(ρkπ)

R(1) (h) ≥ E m kl RS×T (h) DS×T ˆ ρˆ h∼



(2) def

=

The last inequality comes from the Maurer’s lemma (Lemma 4).

E

(h,h0 )∼ρ2

0

0

[RDT(h, h )−RDS(h, h )] ,

and also, the present proof will not require the use of the union bound argument. ˆ ∈ H2 whose Again, we consider “abstract” classifiers h s t loss on a pair of examples (x , x ) ∼ DS×T is defined by, 1+L0-1(h(xs),h0 (xs))−L0-1(h(xt ),h0 (xt )) ˆ xs, xt )def . Ld(1) (h, = 2 (1) ˆ Note that, again, Ld(1) lies in [0, 1], and that RS×T (h) (1) ˆ are as defined in the proof of Theorem 3. and R (h) DS×T

Now, let us consider the non-negative random variable,

  (1) ˆ (1) ˆ RD (h) mkl RS×T (h) S×T . E e ˆ π h∼ˆ

We apply Markov’s inequality (Lemma 1). For every δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ×T ∼ (DS×T )m , we have,

  (1) ˆ (1) ˆ (h) RD mkl RS×T (h) S× T E e ˆ π h∼ˆ

  (1) ˆ (1) ˆ 1 mkl RS×T (h) RD (h) S× T ≤ E E e . ˆ π δ S×T ∼(DS×T )m h∼ˆ By taking the logarithm on each side of the previous inequality, and transforming the expectation over π ˆ into an expectation over ρˆ, we then obtain that, "

# ˆ mklR(1) (h) ˆ R(1) (h) ˆ π ˆ (h) S×T DS×T ln E e (13) ˆ ˆ ρˆ ρ h∼ ˆ(h)

   (1) ˆ (1) ˆ 1 mkl RS×T (h) RD (h) S×T E ≤ ln E e ˆ π δ S×T ∼(DS×T )m h∼ˆ √ 2 m . ≤ ln δ

≥ m kl

E

ˆ ρˆ h∼



(1) ˆ

RS×T (h)



E

ˆ ρˆ h∼

(1) ˆ RDS×T (h)

− 2KL(ρkπ)



(1) (1) ≥ m kl RS×T (Gρˆ) RDS×T (Gρˆ) − 2KL(ρkπ) . This implies that,  √   

(1) 1 2 m (1)

kl RS×T(Gρˆ) RDS×T (Gρˆ) ≤ 2KL(ρ k π)+ln . m δ Since, as in the proof of Theorem 3 for d(1) , we have: (1) (1) (1) d(1) = 2RDS×T (Gρˆ) − 1 and dS×T = 2RS×T (Gρˆ) − 1, the previous line directly implies a bound on d(1) from its (1) empirical counterpart dS×T . Hence, with probability at least 1−δ over the choice of S × T ∼ (DS × DT )m , we have,  (1)   √ 

dS×T +1 d(1) +1 1 2 m kl ≤ 2KL(ρ k π)+ln . (14)

2 2 m δ We claim that we also have,    (1) √ 

|dS×T |+1 |d(1)|+1 1 2 m ≤ 2KL(ρ k π)+ln , kl

2 2 m δ (15) which, since (1)

|d(1) | = disρ (DS , DT ) and |dS×T | = disρ (S, T ) , implies the result. Hence to finish the proof, let us prove the claim of Equation (15). There are four cases to consider. (1)

Case 1: dS×T ≥ 0 and d(1) ≥ 0. There is nothing to prove since in that case, Equations (14) and (15) coincide. (1)

Case 2: dS×T ≤ 0 and d(1) ≤ 0. This case reduces to Case 1 because of the following property of kl(·k·):

   

b+1 −a+1 −b+1 kl a+1 = kl . (16)

2 2 2 2

Supplementary Material to A PAC-Bayesian Approach for Domain Adaptation (1)

Case 3: dS×T ≤ 0 and d(1) ≥ 0. From straightforward calculations, one can show that,  (1)   (1) 

|d |+1 |d(1) |+1 d +1 (1) d +1 S×T kl S×T − kl

2 2 2 2 (1)

 =

 =

+

=

−d





(1)

S×T 2

 (1) 

d +1 (1) d +1 − kl S×T

2 2

+1



1−

+1

−d



(1)

S×T 2

 (1) −dS×T ln

d

(1)

S×T 2

+1



1 d

ln

!

1 d(1) +1 2

(1)   d +1 − 1− S×T ln 2

(1)

! +

+1



 (1) dS×T ln

2

=



(1) −dS×T



=

dS×T ln

(1)





0.

ln

1 d(1) +1 2

d(1) +1 −d(1) +1

3.2. PAC-Bayesian Bounds when m 6= m0



−d +1 (1) S×T

d 2+1 2

kl

! +



(1) dS×T



ln

Proof. The result is obtained by inserting Ths. 6 and 7 (with δ := 2δ ) in Th. 4 of the main paper.

!

1 1−

d(1) +1 2

!

1 1−

d(1) +1 2

1

!

−d(1) +1 2

 (17)

The last inequality follows from the fact that we have (1) dS×T ≤ 0 and d(1) ≥ 0. Hence, from Equations (17) and (14), we have,  (1)   (1) 

|d |+1 |d(1) |+1 dS×T +1 d(1) +1 kl S×T2 ≤ kl

2

2 2  √  1 2 m ≤ 2KL(ρ k π)+ln , m δ as wanted.

In the main paper, for the sake of simplicity, we restrict to the case where m (the size of the source set S) and m0 (the size of the target set T ) are equal. All the results generalize to the m 6= m0 case. In this subsection, we will show how it can be done from a “McAllester’s type” of bound (Similar results can be achieved for “Catoni’s type” or “Seeger’s type”). First we recall the PAC-Bayesian bound proposed by McAllester (2003), which is stated without a term allowing to control the trade-off between the complexity and the risk. Theorem 9 (McAllester (2003)). For any domain PS over X × Y , any set of hypothesis H, and any prior distribution π over H, any δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ∼ (PS )m , for every ρ over H, we have, s  √  1 2 m KL(ρ k π) + ln . RPS (Gρ ) − RS (Gρ ) ≤ 2m δ Now we can prove the following consistency bound for disρ (DS ,DT ), when m 6= m0 . Theorem 10. For any marginal distributions DS and DT over X, any set of hypothesis H, any prior distribution π over H, any δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ∼ (DS )m and 0 T ∼ (DT )m , for every ρ over H, we have,  √  1 4 m 2KL(ρkπ)+ln 2m δ v " u √ # u 1 4 m0 +t 0 2KL(ρkπ)+ln . 2m δ

(1)

Case 4: dS×T ≥ 0 and d(1) ≤ 0. Again because of Equation (16), this case reduces to Case 3, and we are done.

s

disρ (DS ,DT )−disρ (S, T ) ≤

From the preceding “Seeger’s type” results, one can then obtain the following PAC-Bayesian DA-bound. Theorem 8. For any domains PS and PT (respectively with marginals DS and DT ) over X×Y , any set of hypothesis H, and any prior distribution π over H, any δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ×T∼ (PS ×DT )m, we have,

Proof. Let us consider the non-negative random variable, 0 0 2 E e2m(RDS (h,h )−RS (h,h )) . 0 2

RPT (Gρ ) − RPT (Gρ∗T ) ≤ sup Rρ + sup Dρ + λρ ,

We apply Markov’s inequality (Lemma 1). For every δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ∼ (DS )m , we have,

def

where λρ = RDT (Gρ , Gρ∗T ) + RDS (Gρ , Gρ∗T ) and, h n √ io

 def 1 KL(ρkπ) + ln 4 δm , Rρ = r : kl RS (Gρ ) r ≤ m h n √ io  def disρ (S,T )+1

d+1 ≤ 1 2KL(ρkπ)+ln 4 m . Dρ = d : kl 2 2 m δ

(h,h )∼π

E 0

(h,h



)∼π 2

0

0

e2m(RDS (h,h )−RS (h,h ))

1 E δ S∼(DS )m

E 0

(h,h

)∼π 2

2

0

0

2

e2m(RDS (h,h )−RS (h,h )) .

Supplementary Material to A PAC-Bayesian Approach for Domain Adaptation

By taking the logarithm on each side of the previous inequality and transforming the expectation over π 2 into an expectation over ρ2 , we obtain that for every δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ∼ (DS )m , and for every posterior distribution ρ, we have,   π(h)π(h0 ) 2m(RD (h,h0 )−RS (h,h0 ))2 S e ln E (h,h0 )∼ρ2 ρ(h)ρ(h0 )   1 2m(RDS (h,h0 )−RS (h,h0 ))2 e . E ≤ ln E m δ S∼(DS ) (h,h0 )∼π2 Since ln(·) is a concave function, we can apply the Jensen’s inequality (Lemma 2). Then, for every δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ∼ (DS )m , and for every posterior distribution ρ, we have,   π(h)π(h0 ) 2m(RD (h,h0 )−RS (h,h0 ))2 S e ln E ρ(h)ρ(h0 ) (h,h0 )∼ρ2   1 (2m(RDS (h,h0 )−RS (h,h0 ))2 ≤ ln . E E e δ S∼(DS )m (h,h0 )∼π2

To do so, we have, E S∼(DS

=

)m

E 0

(h,h



)∼π 2

E 0

(h,h

)∼π 2

√ ≤ 2 m.

E 0

(h,h

)∼π 2

E S∼(DS

E S∼(DS

)m

Since 2(a − b) is a convex function, we again apply Jensen inequality,  2 0 0 E (RDS (h, h )−RS (h, h )) 0 2 (h,h )∼ρ



E

(h,h0 )∼ρ2

(RDS (h, h0 )−RS (h, h0 ))2 .

Thus, for every δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ∼ (DS )m , and for every posterior distribution ρ, we have,  2 0 0 2m E R (h, h ) − E R (h, h ) ≤ 2KL(ρkπ) DS S h,h0 ∼ρ2 (h,h0 )∼ρ2   1 2m(RDS (h,h0 )−RS (h,h0 ))2 E m E e . + ln δ S∼(DS ) (h,h0 )∼π2 Let us now bound,   1 2m(RDS (h,h0 )−RS (h,h0 ))2 ln E m E e . δ S∼(DS ) (h,h0 )∼π2

0

0

2

0

)m

(18)

0

ekl(RS (h,h )kRDS (h,h ))

(19) (20)

2(q − p)2 ≤ kl(qkp)

for any p, q ∈ [0, 1],

gives Line (19). The last Line (20) comes from the Maurer’s lemma (Lemma 4). Thus for every δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ∼ (DS )m , and for every posterior distribution ρ, we obtain, 2m





0

RDS (h, h ) − E 0 2 √ (h,h )∼ρ 2 m ≤ 2KL(ρkπ) + ln δ E

(h,h0 )∼ρ2



2

2

Line (18) comes from the independence between DS and π 2 . The Pinsker’s inequality,

By the Equation (8),   π(h)π(h0 ) E ln = −2KL(ρkπ). ρ(h)ρ(h0 ) (h,h0 )∼ρ2

− 2KL(ρkπ) + E m 2(RDS (h, h0 ) − RS (h, h0 ))2 (h,h0 )∼ρ2   1 2m(RDS (h,h0 )−RS (h,h0 ))2 E m E e ≤ ln . δ S∼(DS ) (h,h0 )∼π2

0

e2m(RDS (h,h )−RS (h,h ))



For every δ ∈ (0, 1], with a probability at least 1 − δ over the choice of S ∼ (DS )m , and for every posterior distribution ρ, we have,

0

e2m(RDS (h,h )−RS (h,h ))

0

2 RS (h, h ) 0

0

RDS (h, h ) − E RS (h, h ) (h,h0 )∼ρ2  √  1 2 m ≤ 2KL(ρkπ) + ln 2m δ 0 0 E R (h, h ) − E R (h, h ) S (h,h0 )∼ρ2 DS 0 2 (h,h )∼ρ s  √  1 2 m ≤ 2KL(ρkπ) + ln . 2m δ E

2

(h,h0 )∼ρ2

(21)

Following the same proof process for bounding 0 E RT (h, h0 ) , we obtain (h,h0 )∼ρ2 RDT (h, h ) − (h,hE 0 )∼ρ2 the following result. For every δ ∈ (0, 1], with a probability at least 1 − δ 0 over the choice of T ∼ (DT )m , and for every posterior distribution ρ, 0 0 E RT (h, h ) (h,h0 )∼ρ2 RDT (h, h ) − (h,hE 0 )∼ρ2 v " u √ # u 1 2 m0 ≤ t 0 KL(ρkπ) + ln . (22) 2m δ Finally, let us substitute δ by 2δ in Inequalities (21) and (22). This, together with the union bound that assure that both results hold simultaneously, gives the

Supplementary Material to A PAC-Bayesian Approach for Domain Adaptation

result because, 0 0 E (h,h0 )∼ρ2 [RDT (h, h ) − RDS (h, h )] = disρ (DS , DT ), 0 0 E 0 2 [RT (h, h ) − RS (h, h )] = disρ (S, T ),

1.4 1.2 1 0.8 0.6 0.4 0.2

(h,h )∼ρ

and because if |a1 − b1 | ≤ c1 and |a2 − b2 | ≤ c2 , then |(a1 − a2 ) − (b1 − b2 )| ≤ c01 + c02 . -3

Then we can obtain the following PAC-Bayesian DAbound. Theorem 11. For any domains PS and PT (respectively with marginals DS and DT ) over X × Y , and for any set H of hypothesis, for any prior distribution π over H, any δ ∈ (0, 1], with a probability at least 0 1 − δ over the choice of S1 ∼ (DS )m , S2 ∼ (DS )m , 0 and T ∼ (DT )m , for every ρ over H, we have, RPT (Gρ ) − RPT (Gρ∗T ) ≤ RS (Gρ )+disρ (S, T ) + λρ s  √  4 m 1 KL(ρkπ)+ln + 2m δ s  √  1 8 m + 2KL(ρkπ)+ln 2m δ v " u √ # u 1 8 m0 . +t 0 2KL(ρkπ)+ln 2m δ

Φ(a) Φcvx(a) Φdis(a)

-1

-2

1

2

a

3

Figure 1. Behaviour of functions Φ(·), Φcvx (·) and Φdis (·).

Figure 1 illustrates these three functions. The gradient of the Equation (23) is given by, w+C

m  s s s s X y w·x y x Φ0cvx ikxs k i kxi s ki i

i

i=1

+ s×A

"m X

Φ0dis



w·xti kxti k



xti kxti k

− Φ0dis



w·xsi kxsi k



xsi kxsi k

# ,

i=1

where Φ0cvx (a) and Φ0dis (a) are respectively the derivatives of functions " m Φcvx and Φdis evaluated # at point a,  s  t X w·x w·x and s = sgn Φdis kxs ki −Φdis kxt ki . i

i

i=1

4.2. Using a kernel function

def

where λρ = RDT (Gρ , Gρ∗T ) + RDS (Gρ , Gρ∗T ) . Proof. The result is obtained by inserting Ths. 9 and 10 (with δ := 2δ ) in Th. 4 of the main paper.

4. PBDA Algorithm Details

The kernel trick allows us to work with dual weight vector α ∈ R2m that is a linear classifier in an augmented space. Given a kernel k : Rd × Rd → R, we have, hw (x) =

4.1. Objective function and gradient Given a source sample S = {(xsi , yis )}m i=1 , a target sample T = {(xti )}m i=1 , and fixed parameters A > 0 and C > 0, the learning algorithm PBDA consists in finding the weight vector w minimizing,   m s X kwk2 s w · xi +C Φcvx yi 2 kxsi k i=1 m     X w · xti w · xsi + A Φdis −Φdis , (23) kxsi k kxti k i=1

where, Erf being the Gauss error function, h i def Φ(a) = 12 1− Erf √a2 , h i def Φcvx (a) = max Φ(a), 12 − √a2π , Φdis (a)

def

=

2 × Φ(a) × Φ(−a) .

m X

αi k(xsi , x) +

i=1

m X

αi+m k(xti , x) .

i=1

Let us denote K the kernel matrix of size 2m × 2m such as, Ki,j = k(xi , xj ) where, ( xs# x# = xt#−m

if # ≤ m otherwise.

In that case, the objective function of Equation (23) is rewritten in term of the vector α = (α1 , α2 , . . . α2m ) as, ! P2m 2m 2m m X 1 XX j=1 αj Ki,j s p αi αj Ki,j + C Φcvx yi 2 i=1 j=1 Ki,i i=1 m ! ! P P 2m 2m X j=1 αj Ki,j j=1 αj Ki+m,j p p + A Φdis −Φdis . Ki,i Ki+m,i+m i=1

Supplementary Material to A PAC-Bayesian Approach for Domain Adaptation

The gradient of the latter equation is given by the 0 0 ), with α# equals to, vector α 0 = (α10 , α20 , . . . α2m 2m X

αi Ki,# + C

j=1

m X

P2m Φcvx yis

j=1

p

i=1

" + s×A

m X

P2m Φdis

i=1

−Φdis

αj Ki,j

Ki,i !

!

yis Ki,# p Ki,i

αj Ki,j

K p i,# Ki,i Ki,i ! # P2m Ki+m,# j=1 αj Ki+m,j p p , Ki+m,i+m Ki+m,i+m j=1

p

where,  P2m # P2m m X α K α K j i+m,j j i,j j=1 j=1 √ . −Φdis √ s = sgn Φdis "

i=1

Ki,i

Ki+m,i+m

4.3. Implementation details For our experiments, we minimize the objective function using a Broyden-Fletcher-Goldfarb-Shanno method (BFGS) implemented in the scipy python library1 . We made our code available at the following URL: http://graal.ift.ulaval.ca/pbda/ When selecting hyperparameters by reverse crossvalidation, we search on a 20 × 20 parameter grid for a A between 0.01 and 106 and a parameter C between 1.0 and 108 , both on a logarithm scale.

References Maurer, A. A note on the PAC Bayesian theorem. CoRR, cs.LG/0411099, 2004. McAllester, D. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. Seeger, M. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002.

1

Available at http://www.scipy.org/