Reinforcement Learning and Visual Object

9 downloads 0 Views 528KB Size Report
reinforcement learning, Peng 10] nds the best parameter setting for color image segmentations. .... Vision and Pattern Recognition", pages 701{707, 1996. 7.
Reinforcement Learning and Visual Object Recognition Lucas Paletta Computer Vision Group Institute for Computer Graphics and Vision Technical University Graz Munzgrabenstrae 11, A-8010 Graz, Austria Email: [email protected]

http://www.icg.tu-graz.ac.at/lpaletta

1 Introduction

This presentation provides an introduction to reinforcement learning methods and a proposal for dissertation work on using this concept for visual object recognition ( gure 1). The rst section of the presentation ( gure 2) is concerned with the theoretic foundations of Markov decision problems (MDP). Two di erent solutions are considered, dynamic programming and reinforcement learning respectively. In the sequel, object recognition is described in the context of MDPs. Finally there is a summary to stress the most important ideas. While focusing on methods to nd optimal solutions of sequential decision problems in the framework of MDPs ( gure 3), several applications will be presented, current research is discussed, and the core ideas of the proposal are described in the sequel. 2 Markov Decision Process

2.1 Policies Imagine a mobile robot with the task to nd a battery charger in an oce ( gure 4). Assume decisions for aiming at certain directions are based upon visual information. For each time step t, a visual pattern, i.e. an image of brightness values is captured from a camera mounted on the platform of the robot ( gure 5). It is assigned an entry of a lookup-table representing the states xi on the way to the goal. The diagram to the right illustrates all possible states of the task by pictoral cells. It is possible to interpretate from the neighborness of cells a temporal vicinity in visiting corresponding states. When the robot occupies a particular state, it has the choice between di erent actions a ( gure 6), i.e. transitions according to the 4 directions to states in the neighborhood, to the north (aN ), south (aS ), east (aE ) or west (aW ) state. The set of xed decisions for every choice of actions, i.e. in each state, is called a policy . Hence  is a mapping from the set of states X to the set of actions A,  : X ! A. The optimal strategy (red) leads to the goal by the shortest path, whereas a suboptimal (blue) needs an additional number of actions to attain it.

2.2 Mathematical Background

A formal description of a MDP ( gure 7) consists of the set of possible states X and the set of possible actions A, further the transition function  describing the change from state xi to state xj when executing action a. In deterministic MDPs the subsequent state is accessed with probability 1, whereas in nondeterministic MDPs, i.e. the more interesting ones, the transition is probabilistic according to a distribution over the following states. With each action a, the decision-maker, i.e. the agent receives a payo or reward r, which contributes to a de nition of a utility function. An optimal strategy can be found by means of a value function ( gure 8). It is de ned for every state as the cumulative reward that is received in subsequent steps until attaining the task goal when following a

particular decision strategy . There exist tasks that require optimization of a behavior instead of searching a path to the goal. For this purpose, a discount factor is introduced to exponentially decrease the contribution of future rewards so as to keep the resulting sum nite. In nondeterministic MDPs, one should compute the expected sum over all possible successor states. Now how to retrieve an optimal strategy ( gure 9)? A rst solution, dynamic programming, requires knowledge of all rewards r and transition descriptions . When starting from an arbitrary state, neither the number of steps nor the path to the goal is known in advance. Fortunately it suces to optimize the immediate next step, so a global solution is found by recursion: the value of a state xt is described by the value of the successor state xt+1 plus the reward r, received during the transition to the next state. If we would already know the optimal values V  , i.e. the one representing evaluating cumulative rewards of the optimal future action sequence, we could perform an optimal strategy  by selecting the action which maximizes the sum consisting of the reward and the successor value function, i.e. the action that promises maximum reward. The optimal value function is computed using the Bellmann equation, i.e. a relation in dynamic programming. Starting with an arbitrary value of the estimate Vk , the value function is recursively updated until eventually convergencing to V  , which represents the optimal value. 3 Solving the MDP

3.1 Dynamic Programming

Dynamic programming methods ( gure 10) are preferably applied to problems that possess optimal substructure, i.e. global solutions are recursively computed from solutions of speci ed subproblems. In contrast to divide and conquer methods, they take advantage from solutions of commonly shared subproblems, so that these solutions are computed once but can be used a multiple times thereafter. Dynamic programming provides a solution to the MDP ( gure 9) if only the quantities V  are unknown, the optimal strategy  is derived from it.

3.2 Reinforcement Learning

In most applications, the parameters r and  are initially unknown and thus have to be learned. In the framework of reinforcement learning ( gure 11), a so-called agent computes these quantities by statistical evaluation of its experience while executing its task. It tries di erent actions, observes their consequences and adjusts its strategy  accordingly, e.g. in correlation to deviations from its expectations. A second solution to a MDP is thus provided by Temporal Di erence Learning ( gure 12), i.e. a particular method of reinforcement learning. A consistent value function should obey the consistency condition described above ( gure 9). Starting with an arbitrary estimator of the value function Vn , there possibly results an error  caused by a deviation from the consistency equation. The means r and V will not be known in advance, thus an estimator for this quantity is used instead. An estimate for , ^, contributes to the update of the current values of the value function. This estimator converges to the optimal values V  by mathematical proof [12]. In gure 13, the diagrams illustrate results of the reinforcement learning process in a simple application. Top left, the cell's contents depict the optimal values of corresponding states. Top right, the optimal action for each state is shown by arrows pointing to the successor state. Starting from any state, the arrows guide the agent to the goal of the task. The diagram bottom left represents the learning e ect by plotting decreasing lengths of trials, i.e. the number of steps to goal, while continuously updating the estimators of the value function. The diagram bottom right shows the optimal strategy when some transitions are not permitted, i.e. the case of an obstacle, which changes the policy for some of the states. Current research in reinforcement learning is focused on nding universal function approximators ( gure 14) for estimating the value function [3]. Since today, stringent convergence proofs exist only for lookuptable approaches, although several successful implementations of generalizing estimators are reported in the literature [14, 3, 7]. Another issue is to balance exploration and exploitation: the state space has to be explored for registration of the payo s received by executing actions; to avoid an exhaustive search over state space, this discovery should be ecient by visiting only those states that are needed for a suciently precise estimate. The knowledge about these state transitions can be exploited to de ne a strategy, which is evaluated in turn. Multi-agent learning deals with the communication and organization of sets of agents, performing subtasks in a hierarchy of goals.

3.3 Aplications of Reinforcement Learning

The most cited application ( gure 15) using temporal di erence learning methods is TD-Gammon, i.e. a neural network learning to play the game of Backgammon [13]. The network was proved to achieve masterlevel of play and has won against the best human players of the world. Another important example is the 4 elevator dispatcher for usage in skyscrapers. In computer vision, the break-through of reinforcement learning methods has not been achieved so far. Draper [8, 9] describes optimal assemblance of visual procedures by reinforcement learning, Peng [10] nds the best parameter setting for color image segmentations. Bandera et al. [5] use the framework to nd saccade sequences of shortest length for the purpose of 2-D object recognition. The most important research labs that are currently working on this topic are, Center for Visual Science at Rochester University (Ballard, Whitehead), Center for Automated Learning and Discovery at Carnegie Mellon University (Thrun, Davies), University of California (Peng), University of Massachusetts at Amherst (Sutton,Draper), MIT (Singh,Jordan), etc. 4 Object Recognition

4.1 Recognition Process

The theory of reinforcement learning is now applied to visual object recognition ( gure 16). Object recognition is the task to classify a certain pattern as an instance of an object class out of a database of known objects. In many cases, interpretation of single 2-D patterns does not suce for a con dent decision, thus the information from multiple views should be integrated to achieve an improved global classi cation. Object recognition in this context induces the task to attain a most reliable decision with minimal time costs. The decision process is de ned on the basis of visual information, i.e. for each 2-D view we are interested to nd the action that provides access to the most discriminative next view. The dynamics of the recognition process ( gure 17) emerges now from interpretation of a sequence of subsequent 2-D views. The visual patterns induce corresponding probability distributions on the object hypotheses. A distribution which integrates the information of all previous, local hypotheses, is computed by information fusion. In parallel, the information out of the sequence of visual patterns is fused to a sequence of recognition states which re ect the perceptual progress during a trial. If the resulting object hypotheses attain sucient con dence, the agent reaches the goal of the task. Reinforcement learning provides then a mapping from recognition states to actions, e.g. camer amovements, evaluated by the corresponding increase in the con dence in the object hypotheses. An optimal strategy selects exactly those actions that directly lead to the goal, represented by a prede ned level of entropy in the posteriori distribution of object hypotheses. Reinforcement learning not only nds the optimal actions but actually learns a resulting mapping which is performed as reactive behavior by autonomous systems. Necessary prerequisites ( gure 18) to implement reinforcement learning methods at the ICG are, the Active Vision Laboratory which enables visual experiments with multiple degrees of freedom, enabling control of illumination, controlled rotation and translation of the objects, etc. Theory about the integration of sensor information is provided by results of the active fusion research group.

4.2 Objectives

The following objectives ( gure 19) are identi ed to describe a project on applying reinforcement methods for automated object recognition: Learning optimal fusion strategies. By reinforcement learning, the optimal action sequence can be found. Perception is fused by statistical inference, and the recognition system becomes adaptive to changes in the probabilistic environment. Eventually the reactive system performs optimal control in real-time. Learning selective perception. For large object databases, the size of the state space explodes, thus reduction by clustering techniques or extraction of discriminative features should improve the scaling of the method. Scene exploration. A complex scene, consisting of a set of objects, should be interpretated by a complex behavioral architecture. The goal is to make the system learn to interact with the environment on a complex level. One important problem to face is occlusion. Hence particular reinforcement methods should be exploited or even developed to structure the strategical concept.

We now answer to the following 4 important questions ( gure 20): 1. Original contribution of the work? The intention is to apply global optimization to the task of object recognition in contrast to heuristic assumptions about the reasoning. Thus a global evaluation function contributes in the emergence of a complex behavioral architecture which is in accordance to the purpose of the task. Reinforcement learning has not yet been implemented for three-dimensional object recognition, there exist valuable frameworks to deal with 2-D objects [5], from which some ideas can be transfered from. 2. Importance of the work? We follow the paradigma of purposive vision [1, 4], i.e. to use the representations and methods that are necessary to perform the task at hand, without the constructon of a general-purpose system. The objective parameter, i.e. the reward, plays the role of evaluating computational structures for the purpose of discrimination between the di erent object models. MDPs provide a mathematical framework to outline the problems on a quantitative basis. 3. What is the most related work? Learning to recognize objects has already been outlined in the broad framework of aspects representation [11]. Sequential recognition minimizing perceptual entropy measures for the special case of three-dimensional, prede ned geometric shape models is described in [6, 2] in the framework of active recognition. Reinforcement learning was used to nd optimal saccade sequences in 2-D object recognition [5]. No work has been done so far, to the knowledge of the author, in (1) optimal recognition of (2) arbitrary 3-D objects (3) from appearance. 4. Who bene ts from this stu ? Time-critical systems are dependent on minimal execution time which is guaranteed by the proposed methods. Once the optimal strategy is learned, the system follows a mapping from perceptual states to actions that promise processing of the most discriminating features. Reinforcement learning not only nds the most distinguishing views by incorporating knowledge about future payo s into the decision, but also learns the mapping, i.e. the policy for automatic recognition without any reasoning. The mathematical framework should provide further insight into mechanisms underlying object recognition. 5 Conclusion

The Markov decision task ( gure 21) was described as fundamental problem class of object recognition, while reinforcement learning was outlined as an ecient tool to nd an optimal strategy without having a model about the environment. Object recognition is thus a decision process where the described mathematical framework enables the acquisition of optimal fusion strategies. Current work ( gure 22) is focused on nding an optimal strategy for discriminating wire models. After a preprocessing stage of background subtraction, edge detection and noprmalization in brightness and scale, the digital image, which is considered a high-dimensional vector of pixel brightness values, is projected onto a low-dimensional eigenspace by principal component analysis (PCA). The eigenrepresentation of the object is probabilistically interpretated by a radial basis function (RBF) network which performs classi cation by a conditional distribution on the object hypotheses. The information of each perception is fused to an integrated probability distribution which an entropy results from. The loss of entropy between two subsequent distributions is used by reinforcement methods to reinforce actions that lead to more discriminative views. Actions considered are rotation of a turn-table by shifts of k  30. References 1. Y. Aloimonos. Purposive and qualitative active vision. In International Conference on Pattern Recognition, pages 346{360, 1990. 2. T. Arbel and F. P. Ferrie. Informative views and sequential recognition. In European Conference of Computer Vision, pages 469{81, 1996. 3. L. Baird. Residual algorithms: Reinforcement learning with function approximation. In 12th International Conference on Machine Learning, pages 30{37, 1995. 4. D. H. Ballard and C. H. Brown. Principles of animate vision. CVGIP: Image Understanding, 56(1):3{21, 1992. 5. C. Bandera, F.J. Vico, J.M. Bravo, M.E. Harmon, and L.C. Baird III. Residual Q-learning applied to visual attention. In 13th International Conference on Machine Learning, pages 20{27, 1996.

6. F. G. Callari and F. P. Ferrie. Autonomous recognition: driven by ambiguity. In "Conference on Computer Vision and Pattern Recognition", pages 701{707, 1996. 7. R. H. Crites and A. G. Barto. Improving elevator performance using reinforcement learning. In Advances in Neural Information Processing Systems, volume 8, pages 1017{1023. The MIT Press, 1996. 8. B. A. Draper. Learning grouping strategies for 2d and 3d object recognition. In Proceedings ARPA Image Understanding Workshop, pages 1447{1454, 1996. 9. B. A. Draper. Learning control strategies for object recognition. In K. Ikeuchi and M. Veloso, editors, Symbolic Visual Learning, chapter 3, pages 49{76. Oxford University Press, New York, 1997. 10. J. Peng and B. Bhanu. Closed-loop object recognition using reinforcement learning. In "Conference on Computer Vision and Pattern Recognition", pages 538{543, 1996. 11. M. Seibert and A. M. Waxman. Adaptive 3-D object recognition from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):107{124, 1992. 12. R.S. Sutton. Learning to predict by the methods of temporal di erences. Machine Learning, 3:9{44, 1988. 13. G. Tesauro. Practical issues in temporal di erence learning. Machine Learning, 8:257{277, 1992. 14. G. Tesauro. Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215{219, 1994.

Overview

Reinforcement Learning and

• Markov Decision Process

Visual Object Recognition

• Dynamic Programming • Reinforcement Learning • Object Recognition

Lucas Paletta

• Summary

1

2 Robot Task

Benefits

• Solving sequential decision tasks • Reinforcement applications • Dissertation proposal

3

4

Visual State Space goal

i

image task

table

5

xi

Markov Decision Process

Policies x0

aN

aW

a0

x1

a1

x2

a2

...

aG−1

xG

π∗ r0

xi

r1

r2

rG−1

aE

Det.

xi

xj

1.0

xj1

0.7

MDP Θ={X,A,δ,r} xj2

X={x0,x1,...,xG}

π

aS

A={aN,aW,aS,aE}

policy π : X→ A

a=π(x)

Nondet.

δ(xi,a) = xj

xi

r(xi,a) = r

xj3

6

0.1

0.2

7

Value Function Det. Nondet.



V (x)

=

V  (x)

=

V



rt = rt+1 + rt+2 + rt+3 + : : : + rG rt = rt+1 + rt+2 + 2rt+3 + : : : =

X1 k r (x) = E fr jxg = E f t

k=0

t+k+1g

Solution 1: Dynamic Programming

1X k

r

k=0

t+k+1

Optimal policy

 (x) = arg max (rt+1 + V (y)) a Bellman optimality equation (Bellman 61) V(x0)



−1 −0

−0

−1

−2

−9

−8

−12 −16

V (x) = rt+1 + V (y ) Value iteration

0

−8

−12 −16

−0

−1

−1

−0

−1

−2

−13 −12 −16 −19

−2

−1

−2

−3

−16 −16 −18 −20

r = −1

π∗

πrand

8



Vk(x) = max rt+1 + V k?1(y ) a 

V (x) Vk(x)

= =

r1

V(x1)=r2+r3+r4+...

rI



V(xI)=rII+rIII+rIV+...

rt+1 + V (y )  rt+1 + V k?1(y )

9

Reinforcement Learning

Dynamic Programming • optimal substructure • overlapping subproblems • recursive solution

π, V

agent

global

state x

δ, r

reward r

environment

local divide & conquer

dynamic programming

10

11 Demo: Shortest Path

Solution: Temporal Difference (TD) Learning

(A)

8 −1

−0

−1

−2

−3

−4

−5

−6

0

−1

−2

−3

−4

−5

7

6

Estimation error

6

−1

−0

−1

−2

−3

−4

−5

−6

−2

−1

−2

−3

−4

−5

−6

−7

−3

−2

−3

−4

−5

−6

−7

−8

−4

−3

−4

−5

−6

−7

−8

−9

−5

−4

−5

−6

−7

−8

−9

−10

−6

−5

−6

−7

−8

−9

−10

−11

−1

5

4

4

3

3

V n(y )] ? Vn(x) [rt+1 + Vn (y )] ? Vn (x)

= [r t+1 + =

0

5

temporal difference: xt, xt+1 = y

8

7

−0

 ^

action a

2

2

1

1

0

0

2

4

6

0

8

0

2

4

6

8

6

8

policy

value function

(B)

50 45

8

7

40

transitions

6

Convergence: Sutton 88

35 5

30 25

4

20

3

15 2

10 1

5 0 0

0

20

40

60

80

100

120

140

160

180

200

12

0

2

4

policy

trials

13 Applications

Research in Reinforcement Learning General • Generalization in state space Function approximation: V(xi) = f( φj(xi) , Wj )

• Exploration strategies Exploration vs. exploitation

• Multi−agent learning subtasks cooperation

14

• TD−Gammon (Tesauro 92) • Elevator dispatching (Crites/Barto 96)

Computer vision • Assembling visual procedures (Draper 96) • Image segmentation (Peng/Bhanu 96) • Visual attention (Bandera et al. 96)

Research labs • Rochester University (perception) • Carnegie Mellon University (robotics) • Univ. of Calif., Univ. of Mass., MIT, etc.

15

Recognition Process

Dissertation proposal Reinforcement learning in visual object recognition

x0

recognition states

X1

Object recognition:

information fusion

... p(Oi|x)

Optimization task Confident object hypotheses Active vision Decision process 2

3 4 Oi

1

2

x0

x1

h0

h1

x2

...

xn

?

object hypotheses information fusion

...

2

h0

p(Oi|x)

1

3 4 Oi

1

2

visual pattern

3 4 Oi

3 4 Oi

p(Oi|x)

1

2

p(Oi|x)

1

p(Oi|x)

• • • •

fused hypotheses

H1

3 4 Oi

r2

rn

rewards

p(Oi|x)

r1 1

2

3 4 Oi

16

17

Prerequisites Active Vision Lab

Objectives

Active Fusion (FWF Task 3.1) • • • •

xz−table controlled illumination

camera on pan−tilt head

Image understanding (Pinz 92) Probability theory (Prantl 95) Evidence theory (Ganster 96) Fuzzy control (Borotschnig 96)

States:



optimal action sequence task−dependent adaptation reactive system •

Camera motion Illumination Camera parameters Processing parameters

• Information fusion • Parameter set

extraction of discriminative features •

Rewards/costs: • Confidence measure • Illumination

18 Why ? 1. Original contribution

• global optimization • reinforcement learning in OR

2. Importance

• task−specific optimization • mathematical framework for OR

3. Related work

• Seibert/Waxman 92 • Callari/Arbel/Ferrie 96 • Bandera et al. 96

4. Benefits

Learning selective perception reduced state space

turn table

Actions: • • • •

Learning optimal fusion strategies

• applications: time−critical systems improveme OR systems

20

Scene exploration occlusion detection control of scene modeling

19

Current work Summary

entropy background subtraction sensor noise model

Canny

information fusion

• Markov decision process normalize

object hypotheses p(Oi|e)

• Reinforcement learning PCA

RBF

• Object recognition process • Learning optimal fusion strategies

view ± k*30°

a RL

1.5 1 0.5

controller

0

−1

vertical rotation const. illumination const. distance to object

21

r

2

−0.5

Assumptions:

x

2.5

−1.5 −2 −2.5 4 2 0 −2

−2

−1

0

1

22

2

3