Heuristic Search Value Iteration Trey Smith

29 downloads 450 Views 422KB Size Report
What is HSVI? • Heuristic Search Value Iteration is an algorithm that approximates POMDP solutions. •HSVI stores an upper and a lower bound on the optimal.
Heuristic Search Value Iteration Trey Smith

Presenter: Guillermo Vázquez November 2007

What is HSVI? • Heuristic Search Value Iteration is an algorithm that approximates POMDP solutions. •HSVI stores an upper and a lower bound on the optimal value function V*. •It selects belief points to update the upper and lower bounds, making the bounds closer to V*. •The belief points to be updated are selected by heuristic techniques used to explore the POMDP's search graph.

HSVI's basic idea

VU(b) is the upper bound V*(b) is the exact optimal value function VL(b) is the lower bound 0

b1

b2

HSVI's basic idea Locally updating at b VU(b) V*(b) VL(b)

0

b1

b2

0

b

b1

b2

Why is HSVI a point-based algorithm? •The main problem with exact value iteration algorithms is that it generates an exponential number of vectors in each iteration •Say we have |V'| vectors that represent a value function at horizon t, in (the worst case) the next value function V will have |V|=|A||V'|^|O| vectors where A is a set of actions and O is a set of observations •Every iteration (update) results in an exponential growth in the vectors representing V.

Why is HSVI a point-based algorithm? (cont.) •Exact value iteration algorithms plan for all beliefs in the belief simplex. •But some beliefs are much less likely to be reached than others, and so it seems unnecessary to plan equally for all beliefs. • Point-based value iteration algorithms focus on the most probable beliefs.

HSVI - Notation •Lower bound denoted by V

L

•Upper bound denoted by V U L U  V b=[V b ,V b] •Define interval function

•Define the width (i.e. difference) of the interval function at b to be U

L

width V b=V b−V b •The width at b is the uncertainty at b

HSVI – Algorithm Outline •Initialize bounds

L

V ,V

U

•While width V b 0 ε •explore(b,ε,0) •Return policy π •function explore(b,ε,t){ • if width V b≤εγ−t  return •select an action a* and observation o* according to some search heuristics •call explore( τ(b,a*,o*),ε,t+1) •perform a point-based update of V at b •}

HSVI - Bounds •The lower bound VL is represented by the usual set Γ of alpha vectors •Updating the lower bound VL means adding a vector to the set Γ •The upper bound VL is represented by a finite set Υ of belief/value points (bi,υi) •Updating the upper bound VU means adding a new point to the set Υ

L

HSVI – Lower Bound V initialization •The lower bound VL is initialized using the blind policy method suggested in [Hauskrecht, 1997] •Compute all value functions for all “one-action” policies. A one action “policy” is to always select a particular action a. •Such a method then gives a lower bound with |A| vectors •V

blind

:=max { α 0, α 1,. .. , α∣A∣ }

•The idea is that the least “worst” you can do is to always choose the “safest” (i.e maximum expected value) action.

Why use the Blind Policy Method? •All POMDPs have blind policies •The value function of a blind policy is easy to compute and linear, so the blind policy method generates a PWLC representation •The class contains only |A| policies, so it is easy to evaluate them all. O(|A||S3|)

VL of the Tiger Problem

•Using a discount factor γ=.95

U

HSVI – Upper Bound V initialization •The upper bound VU is initialized using Fast Informed Bound (FBI) approximation [Hauskrecht, 2000] •Solve underlying MDP problem, denoted by VMDP •Use that vector to initialize each α a ∈ A •Hauskrecht solves the upper bound VFIB based on the observation that it is equal to the optimal value function of a certain MDP with |A||O||S| states. This MDP can be constructed and solved |A||O||S|. •HSVI uses a simple iterative approach to approximate VFIB •This approximation keeps one vector αa for each action

α

a t1

a t

 s= R s , aγ ∑ max ∑ Pr  s ' , o | s , a α  s o

a'

s'

HSVI – Upper Bound VU initialization •Such a method then gives an upper bound with |A| vectors • V FIB={ α 0, α 1,. .. ,α∣A∣ } •When FIB iteration is stopped, each corner point corresponding to a state s is initialized to the maximum value •The basic concept of this approach is to be optimistic about a solution (i.e. we can do better with more information) •Simply solving the underlying MDP is too optimistic and produces a weak upper bound. •FIB tries to give a tighter upper bound by not being too optimistic and taking into account some uncertainty.

VU of the Tiger Problem

•Only the endpoints of the upper bound are added to Υ

HSVI – Algorithm Outline •Initialize bounds

L

V ,V

U

•While width V b 0 ε •explore(b,ε,t) •Return policy π •function explore(b,ε,t){ • if width V b≤εγ−t  return •select an action a* and observation o* according to the search heuristics •call explore( τ(b,a*,o*),ε,t) •perform a point-based update of V at b

HSVI – While loop •While the width (i.e distance) of VU and VL at the given belief point b0 is greater than a specified regret (precision) ε •Repeatedly explore the search graph •A trial starts at b0 and explores forward.

• At each forward step, the current state is updated and a successor state is chosen via heuristics for picking an action a* and observation o*.

HSVI – Search graph for Tiger Problem

HSVI – Algorithm Outline •Initialize bounds

L

V ,V

U

•While width V b 0 ε •explore(b,ε,t) •Return policy π •function explore(b,ε,t) • if width V b≤εγ−t  return •select an action a* and observation o* according to the search heuristics •call explore( τ(b,a*,o*),ε,t) •perform a point-based update of V at b

HSVI – What does explore() do? •The explore function selects action a* and observation o* to decide with child of current node b to visit next, the child node is τ(b,a*,o*) (i.e the resulting belief state after doing action a* and “seeing” observation o* in state b) •We formally define the regret of a policy π at belief b to be π*

π

• regret π , b=V b−V b •That is, the regret is the difference between the optimal value at point b a and value at point b of policy π. •Because we want to return a policy π with small regret, HSVI prioritizes the state updates that will most reduce the regret at b0 (i.e reduce the uncertainty at b0 , denoted as the width V b 0  )

HSVI – How to select action a*?  •Define the interval function Q  b , a=[Q •Q

VU

VL

b , a−Q b , a]

•We greedily select action a* such that VU

*

• a =argmax Q b , a a

•We greedily choose a*, the idea is that actions that currently seem to perform well are more likely to be part of an optimal policy •Thus selectiong such actions will lead HSVI to update states whose values are relevant to good policies. •This is sometimes called the IE-MAX heuristic [Kaelbling, 1995]

HSVI – How to select action o*? •HSVI uses the weighted excess uncertainty heuristic • Excess uncertainty at belief b with depth t in the search tree is defined to be −t

• excessb ,t =width V b−εγ

•excess uncertainty has the property that if all the children of a node b have negative excess uncertainty, then after an update b will also have negative excess uncertainty. •Negative excess uncertainty at the root implies the desired convergence to ε •This heuristic is designed to focus attention on the child node with the greatest contribution to excess uncertainty at the parent * * * o =argmax [ Pr o | b , a  excessτ b , a , o , t1] • o

HSVI – Example run on the Tiger Problem

HSVI – Convergence to V* •It can be proved that if the upper bound VU and the lower bound VL are uniformly improvable (as the bounds presented here are) they converge to the true value function V*

V 0L b≤V 1L b≤...≤V ∞L b≤V *∞ ≤V U∞ b≤...≤V U1 b≤V U0 b

HSVI – Example run on the Tiger Problem

HSVI – Resulting policy graph

•The five alpha vectors of the previous graph result in this policy graph •Note that this policy is for the starting belief =[0.5,0.5]

HSVI – Resulting policy graph •We see that the policy graph generated by the alpha vectors give by HSVI for the tiger problem for the starting belief =[0.5,0.5] with a discount factor of .95 is a subset of the policy graph computed by exact methods

HSVI – Some Results

References •[Hauskrecht, 1997] Hauskrecht, M. (1997). Incremental methods for computing bounds in partially observable Markov decision processes. In Proc. of AAAI, pages 734–739. •[Hauskrecht, 2000] Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision processes,Journal of Artificial Intelligence Research, 13:33–94. •[Pineau et al., 2003] Pineau, J., Gordon, G., and Thrun, S. Pointbased value iteration: An anytime algorithm for POMDPs. In Proc. of IJCAI.b •[Smith, 2007] Smith, T. Probabilistic Planning for Robot Exploration. PhD thesis, Carnegie Mellon University. •[Smith and Simmons, 2004] Smith, T. and Simmons, R. Heuristic search value iteration for POMDPs. In Proc. of UAI. •[Smith and Simmons, 2005] Smith, T. and Simmons, R. Pointbased POMDP algorithms: Improved analysis and implementation.. In Proc. of UAI.