A numerical approach to stochastic reach-avoid

5 downloads 0 Views 1MB Size Report
We consider finite horizon reach-avoid problems for discrete time stochastic ... of the one-step reach-avoid reward in the case of hyper-rectangular safe and ...
1

A numerical approach to stochastic reach-avoid problems for Markov Decision Processes Nikolaos Kariotoglou, Maryam Kamgarpour Tyler Summers and John Lygeros

arXiv:1411.5925v2 [math.OC] 17 Dec 2014

Abstract We consider finite horizon reach-avoid problems for discrete time stochastic systems with additive Gaussian mixture noise. Our goal is to approximate the optimal value function of such problems on dimensions higher than what can be handled via state-space gridding techniques. We achieve this by relaxing the recursive equations of the finite horizon reach-avoid optimal control problem into inequalities, projecting the optimal value function to the span of a finite number of basis functions and sampling the associated infinite set of constraints. We focus on a specific value function parametrization using Gaussian radial basis functions that enables the analytical computation of the one-step reach-avoid reward in the case of hyper-rectangular safe and target sets. We analyze the performance of the overall method numerically by approximating benchmark reach-avoid control problems and comparing the results to benchmark controllers based on well-studied methods. The full power of the method is demonstrated on a nonlinear control problem inspired from constrained path planning for autonomous race cars, with a highly nonlinear model with 6 states and 2 inputs; to the best of our knowledge this is the first experimental validation of ADP controllers at such high dimensions. Our results demonstrate that the resulting approximation scheme has considerable computational advantages compared to standard grid based methods. Index Terms Reachability, approximate dynamic programming, Markov decision processes, stochastic control.

I. I NTRODUCTION A wide range of controlled physical systems can be modeled using the framework of Markov Decision processes (MDPs) [1], [2]. One can formulate optimal control problems where the objective is to compute control policies for the MDP that maximize an expected reward, or minimize an expected cost over a finite or infinite time horizon. In this work, we consider the stochastic reach-avoid problem for MDPs in which one maximizes the probability of reaching a target set while staying in a safe set. This problem has been formally stated in [3] for discrete time stochastic hybrid systems which are then reformulated as MDPs. Current state-of-the-art methods express the solution to the stochastic reach-avoid problem as a dynamic programming (DP) recursion and then approximate it on a finite grid of the state and control spaces. State-space gridding techniques are theoretically attractive since they can provide explicit error bounds for the approximation of the value function under general Lipschitz continuity assumptions [4], [5], [6]. In practice, the complexity of gridding based techniques suffers from the infamous Curse of Dimensionality and thus, gridding is only applicable to relatively low dimensional problems. For general stochastic reach-avoid problems, the sum of state and control space dimensions that can be addressed with existing tools is limited (to the best of our knowledge to no more than 5). As a result, it is worth investigating alternative approximation techniques to push this limit further. Several researchers have developed approximate dynamic programming (ADP) techniques for various classes of stochastic control problems [7], [8]. Most of the existing work has focused on problems where the state and control spaces are finite but too large to directly solve DP recursions. Our work is motivated by the technique discussed in [9] where the authors develop an approximation method based on Linear Programming (LP) for finite state and control spaces. Although the LP approach has The authors are with the Automatic Control Laboratory, Department of Information Technology and Electrical Engineering, ETH Z¨urich, Z¨urich 8092, Switzerland (e-mail: [email protected]; [email protected]; [email protected]; [email protected]) The work of N. Kariotoglou was supported by the Swiss National Science Foundation under grant number 200021 137876.

2

been extensively used to approximate standard MDP problems with additive stage costs, it has not yet been considered for stochastic reach-avoid problems. In general, LP approaches to ADP are desirable since several commercially available software packages can handle LP problems with large numbers of decision variables and constraints. Here we concentrate on using linear programming to solve stochastic reach-avoid problems over uncountable state and control spaces. Preliminary results in this direction were presented in [10]. Here, we extend the results of [10] in several ways. First, we quantify the relation between the original reach-avoid DP recursion and an infinite dimensional LP formulated over the space of Borel-measurable functions defined on the MDP state space. We then show that restricting the infinite dimensional decision space to a finite dimensional subspace spanned by a collection of basis functions is equivalent to the projection of the value functions onto the intersection of the span with the feasible region of the infinite LP. We prove that recursively projecting the approximate value functions results in computing an upper bound for the optimal ones. As the sequence of infinitely constrained LP problems constructed by our method is not computationally tractable in general, we rely on the scenario sampling approach [11] to compute solutions which are feasible in probability. For the numerical implementation of our method, we propose the use of finite dimensional subspaces spanned by Gaussian radial basis functions (RBFs). We show that this particular choice of basis allows the analytic computation of integrals for a wide class of safe and target sets and a wide class of MDP transition kernels. Using the RBF structure we present a novel algorithm to approximate the value function of reach-avoid problems which results in solving finite dimensional LP problems. The performance of the developed method is illustrated via numerical examples where we approximate the solution to reach-avoid dynamic recursions that are intractable using state-space gridding. In particular, we formulate two types of reach-avoid problems for which heuristic solutions can be efficiently computed using existing control methods and confirm that our approximations are accurate by comparing their performance. We also formulate a more complex reach-avoid problem inspired by car racing where constructing a benchmark method is not straightforward but our method is still applicable. The rest of the paper is organized as follows: In Section II we recall the basics of stochastic reach-avoid problems for MDPs with uncountable state and control spaces and formulate an infinite dimensional LP for which the reach-avoid value function is an optimal solution. In Section III we discuss the effects of restricting the decision space of the infinite LP using basis functions and sampling the infinite constraints and propose a recursive algorithm to compute approximate value functions. Section IV restricts the form of basis elements to Gaussian RBFs and analyzes some of their approximation and numerical computation properties. We use the proposed methodology in Section V to approximate the value function of three reach-avoid problems of varying structure and complexity. In the concluding section of the paper we summarize the main results and outline potential future work. II. S TOCHASTIC R EACH -AVOID P ROBLEM A. Dynamic programming approach We consider a discrete-time controlled stochastic process xt+1 ∼ Q(dx|xt , ut ), (xt , ut ) ∈ X × U with a transition kernel Q : B(X ) × X × U → [0, 1] where B(X ) denotes the Borel σ-algebra of X . Given a state control pair (xt , ut ) ∈ X × U, Q(A|xt , ut ) measures the probability of xt+1 falling in a set A ∈ B(X ). The transition kernel Q is a Borel-measurable stochastic kernel, that is, Q(A|·) is a Borel-measurable function on X × U for each A ∈ B(X ) and Q(·|x, u) is a probability measure on X for each (x, u). For the rest of the paper all measurability conditions refer to Borel measurability. We allow the state space X to be any subset of Rn and assume that the control space U ⊆ Rm is compact; extensions to hybrid state and input spaces where some state or inputs are finite valued are possible [3]. We consider a safe set K 0 ∈ B(X ) and a target set K ⊆ K 0 . We define an admissible T -step control policy to be a sequence of measurable functions µ = {µ0 , . . . , µT −1 } where µi : X → U for each i ∈ {0, . . . , T − 1}. The reach-avoid problem over a finite time horizon T is to find an admissible T -step control policy that

3

maximizes the probability of xt reaching the set K at some time tK ≤ T while staying in K 0 for all t ≤ tK . For any initial state x0 we denote the reach-avoid probability associated with a given µ as: rxµ0 (K, K 0 ) = Pµx0 {∃j ∈ [0, T ] : xj ∈ K ∧ ∀i ∈ [0, j − 1], xi ∈ K 0 \ K} and operate under the assumption that [0, −1] = ∅ which implies that the requirement on i is automatically satisfied when x0 ∈ K. In [12], rxµ0 (K, K 0 ) is shown to be equivalent to the following sum multiplicative cost function: rxµ0 (K, K 0 ) = Eµx0

" T j−1 X Y j=0

!

#

1K 0 \K (xi ) 1K (xj )

(1)

i=0

Qj

where i=k (·) = 1 if k > j. The function 1A (x) denotes the indicator function of a set A ∈ B(X ) with 1A (x) = 1 if x ∈ A and 1A (x) = 0 otherwise. The sets K and K 0 can be time dependent or even stochastic [13] but for simplicity we assume here that they are constant. We denote the difference between ¯ := K 0 \ K to simplify the presentation of our results. the safe and target sets by X The solution to the reach-avoid problem is given by a dynamic recursion [12]. Define Vk∗ : X → [0, 1] for k = T − 1, . . . , 0 by :   Z ∗ ∗ Vk (x) = sup 1K (x) + 1X¯ (x) Vk+1 (y)Q(dy|x, u) (2) u∈U X ∗ VT (x) = 1K (x). The value of the above recursion at k = 0 and for any initial state x0 is the supremum of (1) over all admissible policies, i.e. V0∗ (x0 ) = supµ rxµ0 (K, K 0 ). In [12] the authors proved universal measurablility of the value functions in (2). Here we impose a mild additional assumption on the continuity of the transition kernel and show that the value functions are measurable and the supremum at each step is attained. Assumption 1. For every x ∈ X , A ∈ B(X) the mapping u 7→ Q(A|x, u) is continuous. Proposition 1. Under Assumption 1, the supremum in (2) is attained for every k by a measurable function µ∗k : X → U and Vk∗ : X → [0, 1] is measurable.

Proof. By induction. First, note that the indicator function VT∗ (x) = 1K (x) is measurable. Assuming that R ∗ ∗ ∗ Vk+1 is measurable we will show that Vk is also measurable. Define F (x, u) = X Vk+1 (y)Q(dy|x, u). Due to continuity of the map u 7→ Q(A|x, u) by Assumption 1, the mapping u 7→ F (x, u) is continuous for every x (by [14, Fact 3.9]). Now, since U is compact, by [15, Corollary 1], there exists a measurable function µ∗k (x) that achieves the supremum. Furthermore, by [16, Proposition 7.29], the mapping (x, u) 7→ F (x, u) is measurable. It follows that F (x, µ∗k (x)) (and hence Vk∗ ) is measurable as it is the composition of measurable functions. We conclude by induction that at each time step k there exists a measurable function µ∗k achieving the supremum and that Vk∗ is measurable for every k. Proposition 1 allows one to compute an optimal feedback policy at each stage k by solving,   Z ∗ ∗ µk (x) = arg max 1K (x) + 1X¯ (x) Vk+1 (y)Q(dy|x, u) . u∈U

(3)

X

The dynamic recursion in (2) implies that the functions Vk∗ (x) are defined on three disjoint regions of X , namely   x∈K 1, R ∗ ∗ ¯ Vk (x) = maxu∈U X Vk+1 (y)Q(dy|x, u), x ∈ X (4)  0, ¯ x ∈ X \X and it can be shown [12] that the value of each Vk∗ on X is restricted to [0, 1].

4

Proposition 2. Assume that for every A ∈ B(X) the mapping (x, u) 7→ Q(A|x, u) is continuous (stronger than Assumption 1), then Vk∗ (x) is piecewise continuous on X . Proof. From continuity of (x, u) 7→ Q(A|x, u) we conclude that the mapping (x, u) 7→ F (x, u) is continuous (Fact 3.9 in [14]). From the Maximum Theorem [17], it follows that F (x, u∗ (x)) and thus ¯ . By construction, each V ∗ is then piecewise continuous on X . each Vk∗ (x), is continuous on X k

The only established way to approximate the solution of (2) is by gridding X × U and working backwards from the known value function VT∗ (x) = 1K (x). In this way, the value of V0∗ (x) is approximated on the grid points of X while an approximate control policy at each step k is computed by taking the maximum of (3) over the grid points of U. The advantages of this approach are that it is straightforward to estimate the approximation accuracy as a function of the grid size (under suitable regularity assumptions [4], [5]) and that the approximate feedback control policy can be stored as a look-up table rather than computed online at each state. The disadvantage is that it quickly becomes intractable as the dimensions of the state and control spaces increase.

B. Linear programming approach We express the reach-avoid DP value function implicitly as a solution to an infinite dimensional LP, which we will later approximate using tools from function approximation and randomized convex optimization. Let F := {f : X → R, f is measurable}. Under the conditions of Proposition 1, we introduce two operators defined for any measurable function V ∈ F to simplify the presentation of our results: Z Tu [V ](x) = V (y)Q(dy|x, u) X (5) T [V ](x) = max Tu [V ](x). u∈U

The optimal value function at each step k ∈ {0, . . . , T − 1} can be written as Vk∗ (x) = 1K (x) + ∗ ](x). Since 1K and 1X¯ are known functions, one only needs to compute the value of 1X¯ (x)T [Vk+1 ∗ ¯ recursively for every k ∈ {0, . . . , T − 1} in order to construct each V ∗ using 1K T [Vk+1 ](x) on X k and 1X¯ . In the following proposition, we show that for every k ∈ {0, . . . , T − 1} the solution to the recursive step in (2) can be constructed from the solution of an infinite dimensional LP using a relaxed ∗ ∗ ¯ . For the ](x), ∀x ∈ X ](x), ∀(x, u) ∈ X¯ × U of the equation Vk∗ (x) = T [Vk+1 version Vk∗ (x) ≥ Tu [Vk+1 ¯. rest of the paper let ν be a non-negative measure supported on X ∗ Proposition 3. Suppose Assumption 1 holds. For k ∈ {0, . . . , T − 1}, let Vk+1 be the optimal value function at step k + 1 in (2). Define the following infinite dimensional linear program: Z ∗ J := inf V (x)ν(dx) V (·)∈F ¯ (6) X ∗ ¯ subject to V (x) ≥ Tu [Vk+1 ](x), ∀(x, u) ∈ X × U.

¯. Then, Vk∗ is a solution to (6) and any other solution to (6) is equal to Vk∗ ν-almost everywhere on X

Proof. From Proposition 1, Vk∗ ∈ F and is equal to the supremum over u ∈ U of the right hand side ¯ of the constraint in (6). Hence, for any feasible V ∈ F, we Rhave that V (x) ≥ Vk∗ (x) for all x ∈ X R R ∗ ∗ ∗ ∗ ¯ V (x)ν(dx) ≥ X ¯ Vk (x)ν(dx) which means J ≥ X ¯ Vk (x)ν(dx). On the other hand, J ≤ X Rand thus ∗ ¯ Vk (x)ν(dx) since it is the X R least cost among the set of feasible measurable functions. Putting these together we have that J ∗ = X¯ Vk∗ (x)ν(dx) and Vk∗ is an optimal solution. We now show that any other ¯ . Assume that there exists a function V ∗ optimal solution to (6) is equal to Vk∗ ν-almost everywhere on X ∗ for (6) that is strictly greaterR than Vk on a setRAm ∈ B(X ) of non-zero ν-measure. Since V ∗ and Vk∗ are both optimal, we have that X¯ V ∗ (x)ν(dx) = X¯ Vk∗ (x)ν(dx) = JR∗ . We can then reduce V ∗ to the value of Vk∗ on Am while keeping it measurable, reducing the value of X¯ V ∗ (x)ν(dx) below J ∗ , contradicting

5

the fact that V ∗ was optimal. We can then conclude that any solution to (6) has to be equal to Vk∗ ν-almost ¯. everywhere on X Consider the following semi-norm on F induced by the non-negative measure ν: R Definition 1 (ν-norm). kV k1,ν := X¯ |V (x)|ν(dx).

Any feasible function in (6) is a point-wise upper bound on the value function Vk∗ and Proposition 3 implies that for any solution V ∗ to the infinite dimensional LP in (6) we have kV ∗ − Vk∗ k1,ν = 0. The results of Proposition 3 hold for any non-negative measure but choosing ν poorly affects the quality of ¯ , the a solution to (6) as discussed in [9]. For example, if ν is concentrated on a very small part of X objective value of (6) will only be minimized there. Typically, ν is selected according to the importance ¯ and is selected to be uniform if all states are equally important. We therefore refer of different areas of X to ν in the rest of the paper as the state-relevance measure. In theory, we could start at k = T − 1 and recursively solve problem (6) to construct the required value function V0∗ , up to a set of ν-measure zero. However, computing a solution to infinite dimensional LP problems in the form of (6) is generally intractable [18], [19] unless the set F has a finite dimensional ¯ × U can be reformulated or evaluated efficiently. The representation and the infinite constraints over X next section discusses how approximate solutions to the infinite LP can be computed. III. A PPROXIMATION WITH A FINITE LINEAR PROGRAM We present a finite dimensional approximation of the infinite dimensional LP constructed in two steps. First, we replace the infinite dimensional function space F with a finite dimensional one and discuss the ¯ × U with a finite effects of this restriction. Second, we replace the infinite constraints enforced on X number of constraints, and provide probabilistic feasibility guarantees. A. Restriction to a finite dimensional function class A typical way to restrict the decision space is by gridding the domain of functions in F. One can then assign a constant function to every grid cell and determine an appropriate scaling factor for each cell. Such piecewise constant approximation schemes result in strong statements about the overall approximation quality [20], [21] but require a large number of cells to achieve small approximation errors. An alternative approach is to restrict F to polynomial basis functions of a given degree which allows reformulating the infinite constraints in (6) to linear matrix inequality (LMI) constraints using sum-of-squares programming [22] and tools from polynomial optimization [23]. In this way one circumvents the need to sample the constraints (see next subsection) at the expense of solving semi-definite programming (SDP) problems that become large and difficult to solve as the degree of polynomials increases. The approach has been successfully used in [24], [25], [26] but to the best of our knowledge it has not yet been applied to reachavoid problems. We present the results of this section using abstract subspaces F M ⊂ F and discuss in the next section a selection of basis elements that are particularly useful for the reach-avoid problem under consideration. Let F M be a finite dimensional subspace of F spanned by M measurable basis elements denoted by {φi }M following semi-finite LP defined over functions V ∈ F M i=1 . For a fixed function f ∈ F, consider the P M M expressed using the basis {φi }M i=1 as V (x) = i=1 wi φi (x) for some w ∈ R Z M X min wi φi (x)ν(dx) w1 ,...,wM

subject to

i=1 M X i=1

¯ X

¯ ×U wi φi (x) ≥ Tu [f ](x), ∀(x, u) ∈ X

w ∈ RM .

(7)

6

Note that we use min in (7) instead of inf since the minimum is attained in F M under mild assumptions ¯ on the basis functions {φi }M i=1 and f ∈ F. Namely, if the basis functions and f are bounded on X and have positive ν-measure, one can show that the constraint set of (7) is closed and bounded. The following proposition relates the solution of (7) to the corresponding infinite dimensional LP defined over the whole space F.

Proposition 4. Suppose Assumption 1 holds. Fix a function f ∈ F and let V ∗ be a solution to the following optimization problem: Z min V (x)ν(dx) V (·)∈F ¯ (8) X ¯ subject to V (x) ≥ Tu [f ](x), ∀(x, u) ∈ X × U. P Denote by Vˆ (x) = M ˆi φi (x) the function constructed using the optimal solution wˆ of (7) for the i=1 w ˆ ¯ and Vˆ is a solution to the following optimization same function f . Then V (x) ≥ V ∗ (x) everywhere on X problem:

V − V ∗ min 1,ν V (·)∈F M (9) ¯ × U. subject to V (x) ≥ Tu [f ](x), ∀(x, u) ∈ X Proof. Note that, we use min in (8) since the infimum is attained by V ∗ (x) := T [f ](x) = maxu∈U Tu [f ](x) which can be shown to be measurable under Assumption 1 in the same way as in Proposition 1. Now, ¯ , i.e. V (x) ≥ V ∗ (x) for all x ∈ X ¯ and in particular, since any feasible V in (8) upper bounds V ∗ on X ∗ ˆ ˆ ¯ ˆ V ∈ F, V (x) ≥ V (x) for all x ∈ X . To show that V is a solution to (9) consider Z Z Z ∗ ∗ V (x)ν(dx) − C V (x)ν(dx) = V (x)ν(dx) − kV − V k1,ν = ¯ X

¯ X

¯ X

¯ with respect to ν and is a constant since V ∗ is fixed. Since where C is equal to the integral of V ∗ on X P M any function in V ∈ F can be written as V (x) = M i=1 wi φi (x), problem (7) is equivalent to (9) for the same f ∈ F. The semi-infinite optimization problem in (7) can be used to recursively construct an approximation Vˆ0 ¯ where each approximation step Vˆk satisfies Proposition 4. to V0∗ on X

Proposition 5. For every k ∈ {0, . . . , T − 1} let F Mk denote the span of a fixed set of Mk measurable k basis elements denoted by {φki }M the known value function VT∗ and recursively construct i=1 . Start with P M VˆT −1 (x), . . . , Vˆ0 (x) where wˆik and Vˆk (x) = i=1k wˆik φki (x) are obtained by substituting f = Vˆk+1 in (7). Each function Vˆk is then a solution to the corresponding optimization problem:

V − Vk∗ min 1,ν V (·)∈F Mk (10) ¯ ×U subject to V (x) ≥ Tu [Vˆk+1 ](x) ∀(x, u) ∈ X ¯. and Vˆk (x) ≥ Vk∗ (x) for all x ∈ X Proof. The statement is a recursive application of Proposition 4. For k = T − 1, the constraint in (10) ∗ ∗ implies that VˆT −1 (x)h≥ T [V Applying T on both sides we have by the monotonicity of i T ](x) = VT −1 (x).  ∗ T (see [12]) that T VˆT −1 (x) ≥ T VT −1 (x) = VT∗−2 (x). By the constraints in (10) at k = T − 2 we h i ˆ ˆ have that VT −2 (x) ≥ T VT −1 (x) which implies that VˆT −2 (x) ≥ VT∗−2 (x). Repeating, we arrive at the ¯ we conclude in the same way result Vˆ0 (x) ≥ V0∗ (x). Using the fact that Vˆk (x) ≥ Vk∗ (x) for all x ∈ X ∗ ˆ as in Proposition 4 that Vk (x) minimizes the ν-norm distance to Vk while respecting the corresponding constraint.

7

Note that as in (7) and Proposition 4, we use min in the optimization problems since the infimum can be shown to be attained under mild assumptions on the basis elements and the input functions. Under such assumptions, the feasibility of each problem in (10) can be guaranteed by adding the constant function φ(x) = 1X¯ (x) to the set of basis elements; each function Vˆk is then bounded by construction and 1X¯ (x) can be scaled by an arbitrarily large number to form a trivial upper bound on Vk∗ . The process described in Proposition 5 is not directly applicable since it involves solving a sequence of optimization problems with an infinite number of constraints parametrized by the pairs (x, u) ∈ X × U. Even when one can solve such problems, there is no guarantee that the obtained solution accurately approximates the optimal value function. The recursive approximation method described in Proposition ¯ but the equality will not be 5 at k = T − 1 guarantees that VˆT −1 ≥ Tu [VT∗ ] =⇒ VˆT −1 ≥ VT∗−1 on X ∗ MT −1 Mk achieved unless VT −1 is a member of F . If each F is not rich enough, the difference might be growing, resulting in poor approximation of V0∗ . B. Restriction to a finite number of constraints Finite dimensional problems with infinite constraints (similar to the ones appearing in the recursive process of Proposition 5) are called semi-infinite or robust optimization problems and are difficult to solve ¯ × U and the basis functions ([27], [28], [29]). An alternative unless specific structure is imposed on X ¯ × U and impose the constraints only on those points. approach is to select a finite set of points from X Problems in the form of (7) are then reduced to finite LP problems and can be solved to optimality using commercially available solvers. We will use the scenario approach [30] to quantify the feasibility properties of solutions constructed using sampled data. For a discussion on the related performance properties, see ¯ [31], [32]. Let S := {(xi , ui )}N i=1 denote a set of N ∈ N elements from X × U and for a fixed function M f ∈ F, consider over functions V ∈ F expressed using the basis {φi }M i=1 PM the following finite LP defined as V (x) = i=1 wi φi (x) for some w ∈ RM Z M X min wi φi (x)ν(dx) w1 ,...,wM

subject to

i=1 M X i=1

¯ X

wi φi (x) ≥ Tu [f ](x), ∀(x, u) ∈ S

(11)

w ∈ RM . The authors in [30] generalize the feasibility properties of a solution to (11) with respect to a solution to (7), as a function of the set S. The following statement is a straightforward application of Theorem 3.5 in [30] to the problem under consideration here. Theorem 1. Assume that for any function f ∈ F and any set S, the feasible region of (11) is nonempty and the optimizer is unique. Fix a function f ∈ F and choose a collection of functions {φi }M i=1 and required violation and confidence levels ε, β ∈ (0, 1). Construct a set S by drawing N independent ¯ × U according to some probability measure PX¯ ×U where N identically distributed sample points from X is chosen as ( ) M −1   X N i N ≥ S(ε, β, M ) = min N ∈ N ε (1 − ε)N −i ≤ β . i i=0 P Then the function V˜ (x) = M ˜i φi (x) constructed using the optimal solution w˜i to (11) satisfies i=1 w h i ˜ PX¯ ×U V (x) < Tu [f ](x) ≤ ε (12) with confidence 1 − β, measured with respect to P{X¯ ×U }N , the measure constructed by taking N products of PX¯ ×U .

8

[0, 1]

Tu[f](x) V˜ (x) Tui [f](xi T [f](x) Vˆ (x)

T [f](x) V˜ (x)

[0, 1]

Tui [f](xi) Vˆ (x)



U



Fig. 1: Approximation after sampling depicted ¯ × U. on X

Fig. 2: Approximation after sampling, projected ¯. on X

The inner P probability statement in (12) can be interpreted as follows: The approximate solution function ˜i φi (x) violates the constraint Tu [f ](x) on a set of at most ε measure, with respect to PX¯ ×U . V˜ (x) = M i=1 w ¯ in probability. To In contrast to the solution in (7), the constructed value functions are upper bounds on X ¯ further clarify the statement of Theorem 1, consider the illustration in Figures 1 and 2. Assuming that X and U are one dimensional and for a hypothetical function f , we have plotted the value of the constraint Tui [f ](xi ) (denoted by crosses) for a collection of sampled points (xi , ui ). The dotted lines correspond to the solution of (11) for the particular f , denoted by V˜ (x). The dashed lines correspond to the solution of (7) for the particular f , denoted by Vˆ (x), while the solid lines correspond to T [f ](x). As stated in Theorem 1, V˜ (x) is greater than or equal to Tu [f ](x) for most pairs (x, u). The relation of V˜ (x) to T [f ](x) remains unclear since V˜ (x) could exceed or stay below T [f ](x), as long as the measure of pairs (x, u) where it is below is smaller than εk . In Figure 1, we have plotted the sets where V˜ (x) < Tu [f ](x) on ¯ × U. In Figure 2, the shaded areas only indicate that there exist pairs (x, u) such that V˜ (x) < Tu [f ](x). X Proposition 6. For every k ∈ {0, . . . , T − 1} let F Mk denote the span of a fixed set of Mk ∈ N+ basis Mk ∗ k elements {φki }M i=1 and choose levels {εi , βi ∈ (0, 1)}i=1 . Start with the known value function VT ∈ F and M M M k k k recursively construct V˜k ∈ F by solving (11) using f = V˜k+1 , {φi }i=1 = {φi }i=1 , S = Sk where Sk is constructed using ε = εk , β = βk and M = Mk in Theorem 1. Denote by Nk the number of elements in Sk . Each function V˜k then satisfies: h i PX¯ ×U V˜k (x) < Tu [V˜k+1 ](x) ≤ εk (13)

with confidence 1−βk , measured with respect to P{X¯ ×U }Nk , the measure constructed by taking Nk products of PX¯ ×U . k ˜ Since at every step k, the basis elements {φki }M i=1 , the function Vk+1 ∈ F and the levels {εi , βi ∈ Mk (0, 1)}i=1 are fixed, Proposition 6 is simply a recursive application of Theorem 1 for k ∈ {0, . . . , T − 1}. k In the following section we discuss a particular choice for the basis sets {φki }M i=1 that provides a number of computational advantages in the context of reach-avoid problems.

IV. I MPLEMENTATION WITH RADIAL BASIS FUNCTIONS We discuss the implementation steps of the recursive approximation technique described in Proposition 6. Our design choices are motivated by the structure of the constraints in the optimization problems of Section III and the discrete-time stochastic reach-avoid problem for MDPs presented in Section II.

9

A. Basis function choice We consider basis function sets comprising parametrized Gaussian radial basis functions. Our choice is motivated by the strong approximation capabilities of such functions, discussed in [33], [34], [35], [36], as well as the fact that for certain types of sets one can compute integrals over Gaussian RBFs analytically. Consider the following assumption: Assumption 2. We impose the following restrictions on the structure of the considered reach-avoid problems: P 1) The stochastic kernel Q can be written as a Gaussian mixture kernel Jj=1 αj N (µj , Σj ) with known P diagonal covariance matrices Σj , means µj and weights αj such that Jj=1 αj = 1 for a finite J ∈ N+ . 0 2) The targetSand safe sets finite unions of disjoint hyper-rectangle sets,   written as S SM SPK andn K pcanp be n P m m 0 [c , d ] for some K = i.e. K = p=1 Kp = p=1 l=1 [al , bl ] and K 0 = M m l m=1 m=1 l=1 l finite P, M ∈ N+ with n = dim(X ) and ap , bp , cm , dm ∈ Rn for every p and m. Q 3) The state-relevance measure ν can be written as a product measure, i.e. ν(dx) = nl=1 νl (dxl ).

×

×

The restrictions imposed by Assumption 2 apply to a wide range of reach-avoid problems. For example the kernel of systems subject to additive Gaussian mixture noise can be written as a Gaussian mixture. Moreover, whenever the safe and target sets cannot be written as unions of disjoint hyper-rectangles, one can approximate them as such to arbitrary accuracy1 [37]. Since the state-relevance measure ν is a design choice, it is generally possible to select one that can be written as a product measure; if this is not the case, our results extend to measures ν that can be approximated by Gaussian mixtures. We consider basis sets constructed using the fundamental block of Gaussian mixtures. For each k ∈ k {0, . . . , T − 1}, F Mk denotes the span of a set of Mk Gaussian radial basis functions {φki }M i=1 where for k n each i ∈ {1, . . . , M }, φi is a mapping from R to R that has the form: φki (x) =

n Y

1 q

l=1

− 21

e

i,k 2 (xl −c ) l i,k s l

(14)

2πsi,k l

where x, ci,k , si,k ∈ Rn and all ci,k , si,k ∈ R. Note that the expression in (14) is equivalent to a multivariate Gaussian function with diagonal covariance matrix. The candidate value function at PM approximate k k stage k ∈ {0, . . . , T − 1} is constructed by taking a weighted sum i=1 wi φi (x) of the corresponding basis elements. For every k ∈ {0, . . . , T − 1}, we fix the centers and variances of the RBF elements k {φki }M i=1 randomly before initializing the recursive process described in Proposition 6 to optimize over ¯ and variances from a bounded set that their scalar weights. We sample centers uniformly from the set X varies according to problem data and constitutes a design choice (see Section V). One could try to further optimize the approximation performance by adaptively choosing centers and variances for the constructed basis [38], [39]. Since we are addressing the approximation of a value function of a particular optimal control problem, we can exploit the system structure in choosing the RBF parameters. In particular, one could analyze the system dynamics offline and compute a rough outer approximation of reachable sets (for each time step) within which the RBFs can be placed. Developing such methods would improve our approximation scheme and will be investigated in future work. A useful property of Gaussian RBFs is that pairwise products are proportional to a Gaussian RBF k such that f 1 (x) = with center and variance [33, Section 2]. Consider two functions f 1 , f 2 ∈ F M PMk known P P Mk Mk PMk 1 k 2 2 k 1 2 ˜k i=1 wi φi (x) and f (x) = j=1 wj φj (x). We then have that f (x) = f (x)f (x) = i=1 j=1 wi,j φi,j (x) 1 The number of hyper-rectangles needed for a given approximation accuracy will in general grow exponentially in the dimension of the state space.

10

where wi,j = wi1 wj2 and φ˜ki,j is equal to φ˜ki,j (x) =

n Y

− 12 γ i,j,k q l e 2πsi,j,k l

l=1

i,j,k 2 (xl −c ) l i,j,k s l

with ci,j,k = l

s

j,k j,k i,k ci,k l s l + cl s l

si,k l

+

, si,j,k = l

sj,k l

j,k si,k l sl

si,k l

+

sj,k l

1



, γli,j,k = q e i,k j,k 2π(sl + sl )

i,k j,k c −c l l i,k j,k 2(s +s ) l l

.

Under the setup of Assumption 2, the integral of functions in F Mk over K and K 0 decomposes into one dimensional integrals of Gaussian functions which can be computationally useful in the method of ¯ × U. In particular, Proposition 6 P that requires evaluating the constraint in (11) over samples from X SD M k k k ˜k (x) = let V w ˜ φ (x) denote the approximate value function at time k and A = i i i=1 d=1 Ad = SD  d d d d a union of hyper-rectangle sets for some D ∈ N+ . The integral of V˜k d=1 [a1 , b1 ] × · · · × [an , bn ] over A can then be written as Z D Z X ˜ V˜k (x)ν(dx) Vk (x)ν(dx) = A

d=1

=

Ad

Mk D X X

w˜ik

Z

φki (x)ν(dx)

=

Mk D X X

w˜ik

n Z Y

bdl

− 12

1

i,k 2 (xl −c ) l i,k s l

q νl (dxl ) (15) e i,k d=1 i=1 d=1 i=1 l=1 2πsl     Mk D X n i,k i,k d d X Y 1 x l − c − bl  1 x l − c − al  = w˜ik − erf  q l + erf  q l 2 2 d=1 i=1 l=1 2si,k 2si,k l l R 2 x where erf denotes the error function defined as erf(x) = √2π 0 e−t dt. The integral of V˜k over X can then be written as Z Z Z Mk Mk M X P X X X k k k V˜k (y)Q(dy|x, u) = w˜i φi (y)Q(dy|x, u) − w˜i φki (y)Q(dy|x, u) Ad

X

m=1 i=1

+

P Z X p=1

0 Km

adl

p=1 i=1

Kp

(16)

Q(dy|x, u)

Kp

and since Q can be written as a Gaussian RBF, every product in (16) is a product of Gaussian RBFs and 0 the integrals over Kp and Km can be computed using (15). If the setup of Assumption 2 is not applicable, one can calculate upper bounds on the value of the integral in the constraints in (11) over general polytopic sets as suggested in [40], [41], [42]. Calculating such bounds requires solving SDP problems which are computationally demanding and will significantly increase the total computation time of the proposed method. B. Recursive value function approximation and policy computation ¯ × U, transition kernel Q and target and safe sets K and K 0 , the For a given horizon T , space X approximate value functions corresponding to (2) are computed using Proposition (6). The method relies on recursively applying Theorem 1 which depends on several parameters that constitute design choices. −1 The first and most important choice is the number of basis elements {Mk }Tk=0 used at every step of the approximation that, as seen in (11) and the sample bound in Theorem 1, affects both the number of decision variables and constraints. The samples are generated according to a chosen measure PX¯ ×U and,

11

Algorithm 1 Approximate value function Input Data: ¯ × U. • State and control space X • Reach-avoid time horizon T . • Centers and variances of the MDP kernel Q. 0 • Sets K and K written as unions of disjoint hyper-rectangles. Design parameters: T −1 • Number of basis elements {Mk }k=0 . T −1 T −1 • Violation and confidence levels {εi }i=0 , {1 − βi }i=0 . • Probability measure PX ¯ ×U used in Theorem 1. k Mk ¯ and • Probability measure of centers and variances for the basis functions {φi }i=1 , supported on X a bounded set respectively. • State-relevance measure ν decomposed as a product measure. Initialize V˜T (x) ← 1K (x). for k = T − 1 to k = 0 do Mk k Construct F Mk by sampling Mk centers {ci }M i=1 and variances {si }i=1 according to the chosen probability measures. ¯ × U using the measure PX¯ ×U . Sample S(εk , βk , Mk ) pairs (xs , us ) from X s s for all (x , u ) hdo i Evaluate Tus V˜k+1 (xs ) using (15) and (16). end for  k . Solve the finite LP in (11) to obtain w˜ k = w˜1k , . . . , w˜M k P Mk ¯ ˜ Set the approximated value function on X to Vk (x) = i=1 w˜ik φki (x). end for −1 −1 apart from {Mk }Tk=0 , their number is also affected by the choice of violation and confidence levels {εi }Ti=0 , −1 . Typically, violation and confidence levels are chosen to be close to 0 and 1 respectively to {1 − βi }Ti=0 enhance the feasibility guarantees of Theorem 1 (measured using PX¯ ×U ) at the expense of more constraints in (11). The quality of the resulting approximation depends on the choice of the center and variance k parameters of each basis set {φki }M i=1 as well as the state-relevance measure ν, given as a product of measures. At each recursive step k ∈ {0, . . . , T − 1} we construct F Mk by sampling Mk centers from ¯ and variances from a bounded set, according to a chosen probability measure; our method is still X applicable if centers and variances are not sampled but fixed in another way. We then sample points ¯ × U according to PX¯ ×U and the sample bound of Theorem 1 and evaluate the constraints in (11) from X using (15) and (16). The approximate value function at step k is constructed using the optimal solution of the resulting finite LP and the process is repeated until k = 0. The steps of the proposed method are summarized in Algorithm 1. Using the sequence of approximate value functions one can compute an approximate control policy for ¯ by solving: any x ∈ X   Z µ ˜k (x) = arg max 1K (x) + 1X¯ (x) V˜k+1 (y)Q(dy|x, u) u∈U X Z (17) = arg max V˜k+1 (y)Q(dy|x, u). u∈U

X

Although the optimization problem in (17) is non-convex, standard gradient based algorithms can be employed to obtain a local solution. In particular, the cost function is by construction smooth with respect ¯ and the gradient and Hessian information can be analytically calculated using the to u for a fixed x ∈ X erf function. Moreover, the decision space U is typically low dimensional (in most mechanical systems

12

Algorithm 2 Approximate controller for k ∈ {0, . . . , T − 1} do Measure the system state xk . R ∗ Compute the gradient and hessian functions of X V˜k+1 (y)Q(dy|xk , u) with respect to u. Solve the optimization problem in (17) using second-order methods to obtain µ ˜∗ (xk ). Apply the calculated control input to the system. end for for example dim U < dim X ) and mature software is available [43] to compute locally optimal solutions. The process of calculating a control input at time k for a fixed state xk is summarized in Algorithm 2. An alternative approach would be to use randomized techniques similar to the approach in [11], [30], [44], [45]. One way to evaluate the approximate value function is to investigate the performance of the associated ¯ , one can control policies defined by µ ˜k (x) for k ∈ {0, . . . , T − 1}. That is, starting at some x0 ∈ X implement the policy in simulation and gather statistics via Monte Carlo methods of how many times the system satisfies the desired reach-avoid objective. V. N UMERICAL C ASE S TUDIES We analyze the numerical performance of the proposed method on three reach-avoid problems that differ in structure and complexity. The first two examples correspond to regulation problems with different constraint types and are used to demonstrate the accuracy of the method as the state and control space dimensions grow. The final example demonstrates the potential of the proposed approach by addressing a reach-avoid problem for a highly non-linear system with 6 states and 2 control inputs. To compute the ADP reach-avoid controllers for all problems, we have used the method described in Algorithm 1 without exploiting any additional structure in the systems (e.g. seperability of dynamics). All value function approximations and control policy simulations were done on an Intel Core i7 Q820 CPU clocked @1.73 GHz with 16GB of RAM memory, using IBM’s CPLEX optimization toolkit in its default settings. A. Reach-avoid origin regulation Consider the reach-avoid problem of maximizing the probability that the state of a controlled linear system subject to additive Gaussian noise reaches a target set around the origin within T = 5 discrete time steps while staying in a safe set. For simplicity, we consider systems described by the equation xk+1 = xk + uk + ωk

(18)

where for each k ∈ {0, . . . , 4}, xk ∈ X = Rn , uk ∈ U = [−0.1, 0.1]n and each ωk is distributed as a Gaussian random variable ωk ∼ N (0n×1 , Σ) with diagonal covariance matrix. We consider a target set K = [−0.1, 0.1]n around the origin and a safe set K 0 = [−1, 1]n (see Figure 3) and approximate the optimal value function using the method described in Proposition 6 and the steps of Algorithm 1 for system dimensions of dim(X × U) = 4, 6, 8. The transition kernel of (18) is a Gaussian mixture xk+1 ∼ N (xk + uk , Σ) with center xk + uk and covariance matrix Σ while the sets K and K 0 are hyperrectangles by construction. We choose 100, 500 and 1000 Gaussian RBF elements for the reach-avoid ¯ and [0.02, 0.095]n problems of dim(X × U) = 4, 6, 8 (Table I) and uniform measures supported on X respectively to sample the RBF centers and variances. The violation and confidence levels for every k ∈ {0, . . . , 4} are chosen to be εk = 0.05, 1 − βk = 0.99 and the measure PX¯ ×U required to generate ¯ × U is chosen to be uniform. Since there is no reason to favor some states more than samples from X ¯ . We then follow the steps of Algorithm 1 to obtain others, we also choose ν to be uniform, supported on X a sequence of approximate value functions {V˜k }4k=0 . To estimate the performance of the approximation, ¯ and for each one generate 100 different noise we sample 100 initial conditions x0 uniformly from X

13

1 0.8

X K

dim(X × U) Mk Nk kV˜0 − VADP k Constr. time (sec) LP time (sec) Memory



0.6 0.4 0.2

K

0 2 4

4D 100 3960 0.0692 4 2 3.2MB

6D 8D 500 1000 19960 39960 0.1038 0.2241 85 450 50 520 80MB 320MB

TABLE I: Parameters and properties of the constructed value function approximation for origin regulation problems of dim(X × U) = 4, 6, 8.

6 8

5

0

0.5

1

Fig. 3: Regulation example. trajectory realizations, using the corresponding policy defined in (17) computed by Algorithm 2. We then count the number of trajectories that successfully complete the reach-avoid objective, i.e. reach K without leaving K 0 . In Table I we denote by kV˜0 − VADP k the mean absolute difference between the empirical success probability of the ADP controller, denoted by VADP , and the predicted performance V˜0 , evaluated over the considered initial conditions. The memory and computation time reported is that of constructing and solving each LP in the recursive process. The structure of the considered regulation problem is similar to that of an LQG control problem with state constraints. Since the system is linear, the noise Gaussian and the target and safe sets symmetric and centered around the origin, we assume that a tuned LQG controller will perform close to optimal for the reach-avoid objective. Under this assumption we directly compare the ADP and LQG controllers in an attempt to build confidence on the ADP method and later use it on more complex problems. Consider an LQG problem of the form ! T −1 X > min Ewk x> + x> k Qxk + uk Ruk T QxT T −1 (19) {uk }k=0 k=0 ¯ subject to (18), x0 ∈ X where Q and R have been chosen to correspond to the largest ellipsoids2 inscribed in K and U respectively. Whenever the resulting LQG control input (calculated via the Riccati difference equation) is infeasible, we project it on the feasible set U. Starting from the same initial conditions as with the ADP controller, we simulate the performance of the LQG controller by generating 100 trajectories using the same noise samples as in the ADP method, counting the number that reach K without leaving K 0 . Since the design criteria of the LQG and reach-avoid based methods are different, our comparison is qualitative. Figure 4 shows the mean absolute difference between VLQG and VADP over the initial conditions, as a function of the basis elements used to construct the approximate value functions. Each line on the graph corresponds to a problem of different dim(X × U) and we can clearly see the trend of increasing accuracy as the number of basis elements increases. As the accuracy keeps improving, it may very well be that the ADP and LQG policies converge to the optimal one. In Table II we indicate the trade-off between accuracy and computational resources specifically for the 6D problem; the situation is analogous in the 4D and 8D problems. Figure 5 shows the same difference metric as Figure 4, this time as a function of the total number of samples for a fixed number of basis elements. As expected (inspect Theorem 1), changing the total number of samples Nk has a direct effect on the violation probability εk (assuming constant 2

Computed via convex optimization [40, Ch. 8]

14

0.8

4D 6D 8D

0.7

kVADP − VLQGk

0.6 0.5 0.4 0.3 0.2 0.1 0

0

200

400

600

800

1000

Basis elements

Fig. 4: Mean absolute difference between the empirical reach-avoid probabilities achieved by the ADP (VADP ) and LQG (VLQG ) policies as a function of approximation basis elements for problems of dim(X × U) = 4, 6, 8. Mk kVADP − VLQG k 50 0.504 100 0.288 200 0.052 500 0.036 1000 0.035

Constr. time (sec) LP time (sec) Memory 1.2 0.143 784KB 4 2.324 3.2MB 15 3.41 12.8MB 85 50 80MB 450 850 320MB

TABLE II: Accuracy and computational requirements as a function of basis elements for dim(X × U) = 6. βk = 0.01) and consequently on the approximation quality. Again, as the number of samples increases, the accuracy increases at the expense of computation resources (Table III). B. Reach-avoid origin regulation with obstacles Consider the same origin regulation problem of Section V-A with the addition of obstacles placed randomly within the state space (Figure 6). The reach-avoid objective in this case is to maximize the probability that the system in (18) reaches K without leaving K 0 or reaching any of the obstacle sets Kα . In the reach-avoid framework, this problem S is equivalent to the one in Section V-A, using the same target set K and a different safe set K 0 = X \ Jj=1 Kαj where each Kαj denotes one of the hyper-rectangular obstacle sets and J ∈ N+ . For a time horizon of T = 7, we choose the same basis numbers, basis parameters, sampling and reward measures as well as violation and confidence levels as in Section V-A in order to apply Algorithm 1 and compute approximate value functions for the obstacle avoidance problem (see Table IV). We simulate the performance of the ADP controller starting from 100 different initial conditions selected such that at least one obstacle blocks the direct path to the origin. For every initial condition we sample 100 different noise trajectory realizations and use the corresponding control policies computed by Algorithm 2 to compute the empirical ADP reach-avoid success probability (denoted by VADP ) by counting the total number of trajectories that reach K while avoiding reaching any of the obstacles or leaving K 0 . The problem of regulating (18) to the origin without reaching any obstacles is in the category of path planning and collision avoidance problems that have been studied thoroughly in the control community [46], [47], [48], [49]. We adopt the formulation in [47] and formulate the problem as an MiQP using mixed logic dynamics (MLD) [50] that can be solved to optimality using standard branch and bound techniques. In order to take noise into account, we truncate the density function of the random variables

15

1.4

4D 6D 8D

1.2

kVADP − VLQGk

1

0.8

0.6

0.4

0.2

0

0

0.5

1

1.5

2

2.5

Sample points

3

3.5

4 4

x 10

Fig. 5: Mean absolute difference between the empirical reach-avoid probabilities achieved by the ADP (VADP ) and LQG (VLQG ) policies as a function of samples for problems of dim(X × U) = 4, 6, 8. Samples εk kVADP − VLQG k 400 1 0.283 4000 0.25 0.206 40000 0.025 0.0360

Constr. time (sec) LP time (sec) Memory 2.2 3.565 1.6MB 17 97 16MB 170 162 160MB

TABLE III: Accuracy and computational requirements as a function of sample number for dim(X × U) = 6, with the parameters of Table I. 1 0.8

X

0.6 0.4

Ka

Ka

0.2 0

dim(X × U) Mk Nk kV˜0 − VADP k Constr. time (sec) LP time (sec) Memory

K′

K Ka

Ka

2 4

6D 500 19960 0.118 130 80 80MB

8D 1000 39960 0.191 671 700 320MB

TABLE IV: Parameters and properties of the constructed value function approximation for obstacle avoidance problems of dim(X × U) = 4, 6, 8.

Ka

6 8

5

4D 100 3960 0.095 4.2 3.2 3.2MB

0

0.5

1

Fig. 6: Non-convex state constraints ωk at 95% of their total mass and enlarge each obstacle set Kα by the maximum value of the truncated ωk in each dimension. An alternative approach to handle process noise would be to include the bounded uncertainty implicitly in the control problem formulation, see [51], or employ randomized techniques and solve the problem using sampling. Starting from the same initial conditions as in the ADP approach, we simulate the performance of the MiQP based control policy by using 100 noise trajectory realizations (same as in the ADP controller), implementing the policy in receding horizon. The empirical success probability of trajectories that reach K while avoiding reaching any of the obstacles or leaving K 0 is denoted by VMiQP . Since the objectives of the MiQP and ADP formulations are different (even though both methods exploit knowledge about the target and the avoid sets) our comparison is again qualitative. The mean differences kV˜0 − VADP k and kVADP − VMiQP k reported in Tables IV and V are computed by

16

0.8

4D 6D 8D

0.7

kVADP − VMiQPk

0.6 0.5 0.4 0.3 0.2 0.1 0

0

200

400

600

800

1000

Basis elements

Fig. 7: Mean absolute difference between the empirical reach-avoid probabilities achieved by the ADP and MiQP policies as a function of approximation basis elements for problems of dim(X × U) = 4, 6, 8. Mk 50 100 200 500 1000

kVADP − VMiQP k 0.214 0.168 0.084 0.07 0.045

Constr. time (sec) LP time (sec) Memory 1.67 0.18 784KB 5.59 2.66 3.2MB 22 4.3 12.8MB 130 80 80MB 507 1213 320MB

TABLE V: Accuracy and computational requirements as a function of basis elements for dim(X × U) = 6. 0.8

K1 × K2

0.7

x2

0.6

K1′ × K2′

0.5



0.4

0.3

0.2 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x1

Fig. 8: Race-car cornering example

Fig. 9: Velocity dependent cornering trajectories

averaging the corresponding empirical reach-avoid success probabilities over the initial conditions. Figure 7 and Table V indicate a similar trade-off between accuracy and complexity as observed in Section V-A. Note that there is a residual error in Figure 7 that could be caused by the truncation of uncertainties that makes the MiQP control policies suboptimal. C. Race car cornering Consider the problem of driving a race car through a tight corner in the presence of static obstacles, illustrated in Figure 8. As part of the ORCA project of the Automatic Control Lab (see http://control. ee.ethz.ch/∼racing/), a 6 state nonlinear model with 2 control inputs has been identified to describe the

17

0.7

Safe region K10 (m) K20 (m) K30 (rad) K40 (m/s) K50 (m/s) K60 (rad/s)

Raw data density N (−0.26, 0.53)

0.6

Probability density

0.5

0.4

0.3

Basis Variance [0.0008,0.0012] [0.0008,0.0012] [0.005,0.015] [0.005,0.015] [0.005,0.015] [2,4]

TABLE VI: Dimension safety limits and basis variances used in ADP approximation.

0.2

0.1

0

min max 0.2 1 0.2 0.6 −π π 0.3 3.5 -1.5 1.5 -8 8

0

2

4

6

8

Angular velocity

Fig. 10: Example of noise fit for the angular velocity. movement of 1:43 scale race cars. The model derivation is discussed in [52] and is based on a unicycle approximation with parameters identified on the experimental platform of the ORCA project using model cars manufactured by Kyosho. We denote the state space by X ⊂ R6 , the control space by U ⊂ R2 and the identified dynamics by a function f : X × U 7→ X . The first two elements of each state x ∈ X correspond to spatial dimensions, the third to orientation, the fourth and fifth to body fixed longitudinal and lateral velocities and the sixth to angular velocity. The two control inputs u ∈ U are the throttle duty cycle and the steering angle. As is typically observed in practice, the state predicted by the identified dynamics and the state measurements recorded on the experimental platform are different due to process and measurement noise. Analyzing the deviation between predictions and measurements, we identified a stochastic variant to the original model using additive Gaussian noise, g(x, u) = f (x, u) + ω,

ω ∼ N (µ, Σ).

(20)

The noise mean µ and diagonal covariance matrix Σ have been selected such that the probability density function of the process in (20) resembles the empirical probability density function estimated using measurements. Figure 10 illustrates the fit for the angular velocity where µ6 = −0.26 and Σ(6, 6) = 0.53; the rest of the states are handled in the same way. The kernel of the resulting stochastic process is by construction a Gaussian mixture with a single term. We cast the problem of driving the race car through a tight corner without reaching obstacles as a stochastic reach-avoid problem. The setup is similar to the one in Section V-B with the crucial difference that the system is highly nonlinear and the MiQP method used before is not applicable without suitable linearization. On the contrary, the developed ADP technique can be readily applied to this problem once the input data and design parameters of Algorithm 1 are fixed. We consider a horizon of T = 6 and a sampling time of 0.08 seconds. The safe region on the spatial dimensions is defined as (K10 × K20 ) \ A where A ⊂ R2 denotes the obstacle set that has to be avoided by the moving car, see Figures 8, 9. The full safe set is then defined as K 0 = ((K10 × K20 ) \ A) × K30 × K40 × K50 × K60 where K30 , K40 , K50 , K60 describe the physical limitations of the model car (see Table VI). Similarly, the target region for the spatial dimensions is denoted by K1 × K2 and corresponds to the end of the turn - Figure 9. The full target set is then defined as K = K1 × K2 × K30 × K40 × K50 × K60 which contains all states x ∈ K 0 for which (x1 , x2 ) ∈ K1 × K2 . We use a total of 2000 Gaussian RBF elements for each approximation step with ¯ and the hyper-rectangle centers and variances sampled according to uniform measures supported on X defined by the product of intervals in the rows of Table VI respectively. As in Sections V-A and V-B, we use a uniform state-relevance measure and a uniform sampling measure to construct each one of the

18

finite linear programs in Algorithm 1. All violation and confidence levels are chosen to be εk = 0.2 and 1 − βk = 0.99 respectively for k = {0, . . . , 5}. Having fixed all design parameters we implement the steps of Algorithm 1 and compute a sequence of approximate value functions. To evaluate the quality of the approximations we initialized the car at two different initial states x1 = (0.33, 0.4, −0.2, 0.5, 0, 0), x2 = (0.33, 0.4, −0.2, 2, 0, 0)

corresponding to entering the corner at low (x14 = 0.5 m/s) and high (x24 = 2 m/s) longitudinal velocities. The approximate value function values V˜0 (x1 ) = 0.98, V˜0 (x2 ) = 1 are both high at these initial states but the associated trajectories computed via Algorithm 2 vary significantly. In the low velocity case, the car avoids the obstacle by driving above it while in the high velocity case, by driving below it; see Figure 9. Such a behavior is expected since the car model would slip if it turns aggressively at high velocities. We also computed empirical reach-avoid probabilities in simulation, VADP (x1 ) = 1 and VADP (x2 ) = 0.99, by sampling 100 noise trajectories from each initial state and implementing the ADP control policy of (17) using the associated value function approximation (Figure 9). The controller was tested on the ORCA setup by running 10 experiments from each initial condition and sequentially applying the computed control inputs. Implementing the whole feedback policy online would require solving problems in the form of (17) within the sampling time of 0.08 seconds. This is conceptually possible since the control space is only two dimensional but requires developing an embedded nonlinear programming solver compatible with the ORCA setup. As demonstrated by the videos in (youtube:ETHZurichIfA), the car is successfully driving through the corner, avoiding the obstacle, even when the control inputs are applied in open loop. VI. C ONCLUSION We developed and analyzed a value function approximation algorithm for the reach-avoid dynamic programming recursion using linear programming and randomized convex optimization. The method suggested is based on function approximation properties of Gaussian radial basis functions and exploits their structure to encode the transition kernel of Markov decision processes and compute integrals over the reach-avoid safe and target sets. The fact that our method relies on solving linear programs allows us to tackle reach-avoid problems with larger dimensions than state-space gridding methods. The accuracy and reliability of the approach is investigated by comparing to heuristics based on well-established methods in the control of linear systems. The potential of the approach is demonstrated by tackling a reach-avoid control problem for a six dimensional nonlinear system with two control inputs. We are currently focusing on the problem of systematically choosing the center and variance parameters of basis function elements, exploiting knowledge about the dynamics of the considered systems. Apart from improving initial basis selection, we are looking into adaptive methods for choosing basis parameters and state-relevance measures to increase objective performance. In terms of computational efficiency, we are looking into decomposition methods for the large linear programming problems that arise in our approximation method to allow addressing reach-avoid problems of even higher dimensions. Finally, we are looking into tractable reformulations of the infinite constraints in the semi-infinite linear programs presented to avoid sampling-based methods. ACKNOWLEDGMENT The authors would like to thank Alexander Liniger from the Automatic Control Laboratory in ETH Zurich for his help in testing the algorithm on the ORCA platform. R EFERENCES [1] Feinberg, E. A., Shwartz, A., and Altman, E., 2002. Handbook of Markov decision processes: methods and applications. Kluwer Academic Publishers Boston, MA. [2] Puterman, M., 1994. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc. [3] Abate, A., Prandini, M., Lygeros, J., and Sastry, S., 2008. “Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems”. Automatica, 44(11), pp. 2724–2734.

19

[4] Abate, A., Amin, S., Prandini, M., Lygeros, J., and Sastry, S., 2007. “Computational approaches to reachability analysis of stochastic hybrid systems”. In Hybrid Systems: Computation and Control. Springer, pp. 4–17. [5] Prandini, M., and Hu, J., 2006. “Stochastic reachability: Theory and numerical approximation”. Stochastic hybrid systems, Automation and Control Engineering Series, 24, pp. 107–138. [6] Kushner, H. J., and Dupuis, P., 2001. Numerical methods for stochastic control problems in continuous time, Vol. 24. Springer. [7] Powell, W. B., 2007. Approximate Dynamic Programming: Solving the curses of dimensionality, Vol. 703. John Wiley & Sons. [8] Bertsekas, D., 2012. Dynamic programming and optimal control, Vol. 2. Athena Scientific Belmont, MA. [9] de Farias, D., and Van Roy, B., 2003. “The linear programming approach to approximate dynamic programming”. Operations Research, 51(6), pp. 850–865. [10] Kariotoglou, N., Summers, S., Summers, T., Kamgarpour, M., and Lygeros, J., 2013. “Approximate dynamic programming for stochastic reachability”. In IEEE European Control Conference, pp. 584–589. [11] Calafiore, G., and Campi, M., 2006. “The scenario approach to robust control design”. IEEE Transactions on Automatic Control, 51(5), pp. 742–753. [12] Summers, S., and Lygeros, J., 2010. “Verification of discrete time stochastic hybrid systems: A stochastic reach-avoid decision problem”. Automatica, 46(12), pp. 1951–1961. [13] Summers, S., Kamgarpour, M., Tomlin, C., and Lygeros, J., 2013. “Stochastic system controller synthesis for reachability specifications encoded by random sets”. Automatica, 49(9), pp. 2906–2910. [14] Nowak, A. S., 1985. “Universally measurable strategies in zero-sum stochastic games”. The Annals of Probability, pp. 269–287. [15] Brown, L., Purves, R., et al., 1973. “Measurable selections of extrema”. The annals of statistics, 1(5), pp. 902–912. [16] Bertsekas, D., 1976. “Dynamic programming and stochastic control”. [17] Sundaram, R. K., 1996. A first course in optimization theory. Cambridge university press. [18] Hern´andez-Lerma, O., and Lasserre, J. B., 1998. “Approximation schemes for infinite linear programs”. SIAM Journal on Optimization, 8(4), pp. 973–988. [19] Anderson, E. J., and Nash, P., 1987. Linear programming in infinite-dimensional spaces: theory and applications. Wiley New York. [20] Park, M., Darbha, S., Krishnamoorthy, K., Khargonekar, P., Pachter, M., and Chandler, P., 2012. “Sub-optimal stationary policies for a class of stochastic optimization problems arising in robotic surveillance applications”. In ASME 2012 5th Annual Dynamic Systems and Control Conference joint with the JSME 2012 11th Motion and Vibration Conference, American Society of Mechanical Engineers, pp. 263–272. [21] Krishnamoorthy, K., Park, M., Darbha, S., Pachter, M., and Chandler, P., 2013. “Approximate dynamic programming applied to uav perimeter patrol”. In Recent Advances in Research on Unmanned Aerial Vehicles. Springer, pp. 119–146. [22] Parrilo, P. A., 2003. “Semidefinite programming relaxations for semialgebraic problems”. Mathematical programming, 96(2), pp. 293– 320. [23] Lasserre, J. B., 2001. “Global optimization with polynomials and the problem of moments”. SIAM Journal on Optimization, 11(3), pp. 796–817. [24] Wang, Y., O’Donoghue, B., and Boyd, S., 2014. “Approximate dynamic programming via iterated Bellman inequalities”. International Journal of Robust and Nonlinear Control. [25] Summers, T., Kunz, K., Kariotoglou, N., Kamgarpour, M., Summers, S., and Lygeros, J., 2013. “Approximate dynamic programming via sum of squares programming”. IEEE European Control Conference. [26] Lasserre, J. B., Henrion, D., Prieur, C., and Tr´elat, E., 2008. “Nonlinear optimal control via occupation measures and LMI-relaxations”. SIAM Journal on Control and Optimization, 47(4), pp. 1643–1666. [27] Bertsimas, D., Brown, D. B., and Caramanis, C., 2011. “Theory and applications of robust optimization”. SIAM review, 53(3), pp. 464–501. [28] Ben-Tal, A., and Nemirovski, A., 2002. “Robust optimization–methodology and applications”. Mathematical Programming, 92(3), pp. 453–480. [29] Hettich, R., and Kortanek, K. O., 1993. “Semi-infinite programming: theory, methods, and applications”. SIAM review, 35(3), pp. 380– 429. [30] Campi, M., and Garatti, S., 2008. “The exact feasibility of randomized solutions of uncertain convex programs”. SIAM Journal on Optimization, 19(3), pp. 1211–1230. [31] Mohajerin Esfahani, P., Sutter, T., and Lygeros, J., 2014. “Performance Bounds for the Scenario Approach and an Extension to a Class of Non-convex Programs”. IEEE Transactions on Automatic Control (to appear). [32] Sutter, T., Esfahani, P. M., and Lygeros, J., 2014. “Approximation of Constrained Average Cost Markov Control Processes”. In IEEE Conference on Decision and Control. [33] Hartman, E. J., Keeler, J. D., and Kowalski, J. M., 1990. “Layered neural networks with Gaussian hidden units as universal approximations”. Neural computation, 2(2), pp. 210–215. [34] Sandberg, I. W., 2001. “Gaussian radial basis functions and inner product spaces”. Circuits, Systems and Signal Processing, 20(6), pp. 635–642. [35] Park, J., and Sandberg, I., 1991. “Universal approximation using radial-basis-function networks”. Neural computation, 3(2), pp. 246– 257. [36] Cybenko, G., 1989. “Approximation by superpositions of a sigmoidal function”. Mathematics of control, signals and systems, 2(4), pp. 303–314. [37] Bemporad, A., Filippi, C., and Torrisi, F. D., 2004. “Inner and outer approximations of polytopes using boxes”. Computational Geometry, 27(2), pp. 151–178. [38] Chen, S., Cowan, C., and Grant, P., 1991. “Orthogonal least squares learning algorithm for radial basis function networks”. IEEE Transactions on Neural Networks, 2(2), pp. 302–309.

20

[39] Wettschereck, D., and Dietterich, T., 1991. “Improving the performance of radial basis function networks by learning center locations”. In NIPS, Vol. 4, Citeseer, pp. 1133–1140. [40] Boyd, S. P., and Vandenberghe, L., 2004. Convex optimization. Cambridge university press. [41] Chen, X., 2007. “A new generalization of Chebyshev inequality for random vectors”. arXiv preprint arXiv:0707.0805. [42] Van Parys, B. P., Goulart, P. J., and Kuhn, D., 2014. “Generalized Gauss inequalities via semidefinite programming”. Mathematical Optimization. [43] Waechter, A., Laird, C., Margot, F., and Kawajir, Y., 2009. Introduction to IPOPT: A tutorial for downloading, installing, and using IPOPT. [44] Petretti, A., and Prandini, M., 2014. “An approximate linear programming solution to the probabilistic invariance problem for stochastic hybrid systems”. In IEEE Conference on Decision and Control. [45] Deori, L., Giulioni, L., and Prandini, M., 2014. “Optimal building climate control: a solution based on nested dynamic programming and randomized optimization”. In IEEE Conference on Decision and Control. [46] Borenstein, J., and Koren, Y., 1991. “The vector field histogram-fast obstacle avoidance for mobile robots”. IEEE Transactions on Robotics and Automation, 7(3), pp. 278–288. [47] Richards, A., and How, J. P., 2002. “Aircraft trajectory planning with collision avoidance using mixed integer linear programming”. In American Control Conference, 2002. Proceedings of the 2002, Vol. 3, IEEE, pp. 1936–1941. [48] Van Den Berg, J., Ferguson, D., and Kuffner, J., 2006. “Anytime path planning and replanning in dynamic environments”. In Robotics and Automation, 2006. ICRA 2006. Proceedings 2006 IEEE International Conference on, IEEE, pp. 2366–2371. [49] Patel, R. B., and Goulart, P. J., 2011. “Trajectory generation for aircraft avoidance maneuvers using online optimization”. Journal of guidance, control, and dynamics, 34(1), pp. 218–230. [50] Bemporad, A., and Morari, M., 1999. “Control of systems integrating logic, dynamics, and constraints”. Automatica, 35(3), pp. 407–427. [51] Rakovic, S., and Mayne, D., 2005. “Robust time optimal obstacle avoidance problem for constrained discrete time systems”. In Decision and Control, 2005 and 2005 European Control Conference. CDC-ECC’05. 44th IEEE Conference on, IEEE, pp. 981–986. [52] Liniger, A., Domahidi, A., and Morari, M., 2014. “Optimization-based autonomous racing of 1: 43 scale RC cars”. Optimal Control Applications and Methods.