Reinforcement Learning Neural Network For Distributed Decision

0 downloads 0 Views 52KB Size Report
Figure 1 shows the architecture of the system described in the paper. ... represented by a multi-level reinforcement neural network comprising. 1. + k connected .... by choice of a target vector T , the search of the weights in any direction and ...
Reinforcement Learning Neural Network For Distributed Decision Making Galina L. Rogova Center for Multisource Information Fusion Encompass Consulting Honeoye Falls, NY, U.S.A. [email protected] Abstract - The paper addresses the problem of learning in a distributed system for decision making in uncertain environment. The system comprises autonomous cooperating agents and a fusion center. The agents acquire information about environment, make local decisions, and transmit them to a fusion center, which produces final decisions and presents them to the environment for evaluation. The environment provides binary feedback on correctness of the decision which is used to modify the decision making scheme of the agents. A reinforcement learning neural network has been introduced for modeling such system. The neural network is composed of learning units representing the agents and the fusion center. The model explicitly represents uncertainty related to decisions of the agents by utilizing pignistic probabilities of the agents’ decisions. The neural network is trained with an adaptation of the complementary reinforcement method to distributed reinforcement learning. The developed adaptive learning process improves not only the performance of the whole system but also the performance of each individual agent. Key words : Distributed systems, Reinforcement learning, Neural network, Belief function

1 Introduction The paper presents an approach to learning in a multiagent system for decision making in uncertain environment. The learning goal is to adjust the system's decision-making process in order to improve its performance in future situations. A general multi-agent system consists of a group of distributed intelligent agents that have to coordinate their knowledge, goals, skills, and plans in order to make decisions, take actions, and solve problems. Agents in a dis tributed system may have different areas of expertise, specific a priori knowledge, and different problem-solving abilities. They may be able to observe only certain characteristics of the environment, or they may observe different parts of the environment (spatially or temporally), or some of them may be able and some of them may not be able to interact with environment. Since no one agent has complete information about environment, they have to cooperate to achieve their goals. There have been many coordination schemes developed

Jyotsna Kasturi Center for Multisource Information Fusion State University of New York at Buffalo Amherst, NY, U.S. A. [email protected] in the field of distributed artificial intelligence (see, e.g., [14]). Most of the schemes require information sharing among agents. Information sharing can be explicit, where agents communicate partial results, observations, or resource availability, or implicit where agents use knowledge about capabilities of each other. However, this type of information exchange may not be available or may be manipulated by malevolent agents [5]. In our work, we consider a hierarchical coordination scheme in which agents do not share information among each other but communicate with a designated agent, fusion center. We investigate a problem of distributed learning in a two level hierarchical model where intelligent agents make local observations at the lower level. Agents do not interact with environment and, when they learn separately, make their decisions about the state of environment by using only internal structure of the observed features. The fusion center collects and combines the decisions made by the agents to produce decisions of the system based on collective knowledge. The system decisions are presented to the environment, which in turn sends feedback to the fusion center in the form of reinforcement signal (right/wrong). The fusion center broadcasts the reinforcement signal back to the agents that use it for changing their decision functions. We consider here a case of associative reinforcement learning when the reinforcement signal is evaluated by the system immediately after it is obtained. Figure 1 shows the architecture of the system described in the paper. The reinforcement learning process perfectly well corresponds to uncertain and imprecise environment we are dealing with in many fusion problems (for example, in distributed target detection, identification, and tracking, or in situation assessment) where we do not necessarily have supervised feedback available. Our goal is to teach the system to maximize the accuracy of the system decisions by taking advantage of agents' collective knowledge and the feedback obtained by the fusion center. Unlike the individual learning process with subsequent decision fusion that does not change the agents’ decision making ability, the developed distributed learning process improves not only the performance of the whole system but the performance of individual agents. In order to design such information fusion-based coordination and learning scheme, it is necessary to define a fusion model, a function relating each agent decisions to the

reinforcement signal obtained by the fusion center, and an agents’ learning model. The learning method is built upon our previous studies [6,7] where we introduced a hybrid system combining heuristic and connectionist approaches. In the hybrid system, the agents were modeled by a reinforcement competitive neural network [8,9]. The function relating agent decisions to the reinforcement signal obtained from the environment by the fusion center was defined by a set of heuristic rules and the fusion algorithm was based on the Dempster rule of combination [10]. In the approach introduced here, we are replacing the combination of heuristic rules and the neural networks with an evidential connectionist system. This system uses a reinforcement signal sent by the environment to the fusion center for adjusting the weights defining the decision ability of each agent and the weights representing “trusts” in individual decisions of each agent.



• • •

propagating pignistic probabilities ( BelPk ) [11] of all possible decisions of agents and their combination through the net; producing outputs corresponding to the combined agent decisions on the state of the environment; making a decision based on the output values according to a defined decision rule; presenting this decision for evaluation to the environment that sends back a scalar reinforcement signal about correctness of the decision; this signal is propagated back to the learning units according to the learning algorithm; the reinforcement signal obtained by the units is used for modifying their decision surface. X1

Subnet 1 (Agent1)

BetP1

X2

Subnet 2 (Agent 2)

BetP 2 Subnet k+1 (Fusion Center)

Fusion Center Reinforcement Signal (RS)

Agent 1

Fusion Center’s decisions

Knowledge (hypotheses, confidence levels)

...

RS

BetP

. . .

Xk

BetP K

Subnet k (Agent k)

Agent k

Figure 2. General structure of the connectionist representation

Environment

Figure 1. Distributed Reinforcement Learning System.

2. Connectionist system architecture The evidential connectionist reinforcement learning model for a distributed system presented in Figure 1 is represented by a multi-level reinfo rcement neural network comprising k + 1 connected subnet units modeling k lower level agents and the fusion center (Figure 2). The network operates by: • receiving input from the environment with an input to a subnet corresponding to observations of one of the agents ( X k for agent k ); •

modeling decisions of the agents based on the distance between an agent observations and the class reference vectors within the framework of the Transferable belief theory [11];

We consider two candidate architectures for implementing the system shown in Figure 2. In the first one a subnet representing an individual agent is modeled by an evidential competitive neural network while the fusion center is a fully connected feed forward network with K ⋅ I input and K output nodes, where K is the number of decision hypotheses and I is the numb er of agents. The second architecture employs the same representation of the agents while the subnet modeling the fusion center implemented as a connectionist representation of the Dempster combination rule. This network has 2 hidden layers and the same number of input and output nodes as the previous one. The first hidden layer is organized into I blocks of K + 1 nodes, with activations m i (θ k ) = λi BetPki and m i ( Θ) =1 −

∑λ

i

i

BetPki , where BetP are outputs of the

i

lower level subnets and

λi ( λ i ≥ 0 and

∑λ

i

≤ 1 ) can be

i

referred as a measure of relative reliability of agent i. Fixedweight connections between hidden layer 2 and 3 and within layer 3 bear the similarity with the neural network k-nearest neighbor classifier based on the Dempster-Shafer theory [12] and provide for combination of basic probability assignments computed in hidden layer 1.

The focus of this paper is on the description of the reinforcement learning neural network in which the fusion center is modeled by a feed-forward fully connected neural network network. In Sections 3, 4, and 5 we will describe in details its architecture, the reinforcement learning algorithms, and our experiments and results.

3. Architecture of an evidential reinforcement learning neural network Each individual agent observes environments and has an ability to extract a particular type of information that can be represented by a feature vector X i = ( x1i , x i2 ,...,x iNi ) , where N i is the dimension of a feature vector extracted by agent i and i = 1, I , I is the number of agents. Let Θ = {θ 1 ,..., θ K } be a frame of discernment, where θ k is the hypothesis that a pattern belongs to class k . For each agent i and each class k , a proximity measure Φ ( X i , Wk i ) = Φ ( d ( X i , Wk i )) (1) between a feature vector X i and a class representative vector W ki is computed. Φ is a decreasing function of

d ( X i , W ki ) between

distance

0 ≤ Φ ( d ( X , W k )) ≤ 1 i

i

W ki

X i and

with

Φ ( d ( X , W k )) = 1 i

and

i

if

d ( X i , Wk i ) = 0. The particular form of function Φ ( X i , W k i )

will be

discussed in Section 4. Φ ( X i ,W ki ) serves as a weight of support to hypothesis θ k and yields a simple support function m ik :

(

)

m ik (θ k ) = Φ X i , W ki ,

(

)

m ki ( Θ) = 1 − Φ X i , Wk i ,

and

m ( A) = 0 ∀A ≠ θ k ⊂ Θ . (2) A combination of all the simple support functions obtained with the “unnormalized” Dempster rule (5) [10,11] leads to the basic probability assignment m i = mki . (3) i k

I k

Decisions on the state of environment (the class a pattern in question belongs to) are made at the pignistic level where the combined beliefs are transformed into a pignistic probability function. The transformation is based on the generalized Insufficient Reason Principle according to which ∀A ⊆ Θ the mass of belief m( A) is distributed equally among the elements of A [12]. The pignistic probability of hypothesis θ k is computed according to [11]

BetPk = BetP(θ k ) =

∑ = | A | (1 − m(0)) . m( A)

(4)

θk ⊆ A A⊂ Θ

The connectionist representation of agent i is presented in Figure 3. Each agent is implemented as a feed-forward 4i

hidden layer network with M input nodes and K output nodes where M i is the dimension of observations of agent i

and K is the number of hypotheses in Θ . The transfer function of the first hidden layer (HL1 ) is represented by Φ ( X i , Wk i ). The activations of the hidden layer HL2 are simple support functions obtained as in (2), the activations of the hidden layer HL3 are the combinations of simple support functions (3) and represent basic probability assignments for each hypothesis. Hidden layer HL4 is an auxiliary layer used for calculation of the pignistic probability for agent’s decisions as outputs activations. The connections between all hidden layers and HL4 with the output layer are fixed. The weight vectors W ki can be viewed as the center of a cluster corresponding to a certain class of input observations. When trained separately, the agents represent an unsupervised self-organized NN. Figure 3 presents the agents’ architecture. The pignistic probabilities representing outputs of the lower level agents form the basis for inputs to the neural network corresponding to the fusion center subnet. We consider two different models: deterministic and stochastic. The deterministic model is a two layer neural network, which uses the pignistic probabilities as direct input. The stochastic variant is a tree layer network with hidden layer activations z ik = 1 if BetPki ≥ 1 / 2 , and z ik = 0 , otherwise. (5) The nodes of the hidden layer are fully connected with K output nodes with the sigmoid transfer function.

4. Learning algorithm The neural network described in the previous section is trained by a reinforcement learning method when the weights are adjusted based on an uncertain feedback from environment about the correctness of its decisions (right/wrong). Let X = ( X 1 ,..., X I ) be an input feature vector of a training pattern belonging to class p , O be a corresponding output vector of the fusion subnet, and j = arg max Ok be a system decision on the class of the input 1≤ k≤ K

pattern. The environment evaluates the pair p − j and produces a reinforcement signal R = 1 (reward) if the system decision is correct and R = −1 (penalty), otherwise. If reward occurs we want to change the weights to make O more close to a target vector T similar to a target vector in the supervised learning. If penalty occurs, then we can push, by choice of a target vector T , the search of the weights in any direction and cannot choose the appropriate direction with certainty. In our system we have chosen to make

O more close to a complement of a supervised-type target vector and introduce a learning rule representing an adaptation of the complementary reinforcement learning algorithm [13]. The introduced learning rule comprises two different parts: a rule for updating weights of the fusion subnet and a rule for updating weights of the subnets modeling agents’ decisions:

1. Rule for updating weights of the fusion subnet: a. Generate target vector T = (t1 ,..., tK ) :

k = j and t k = 0 , otherwise.

If R = 1 : t k = 1 , if

a uniform random variable and t k = 0 , otherwise. b. If

W ci = Wc i + ρ ( X i − Wc i ) , if c = j

If R = −1 : t k = 1 , if O k ≤ σ O j , where σ ∈ [0,1] is a uniform random variable and t k = 0 , otherwise. b. Back propagate error to update weights connecting the hidden layer and the output layer of the fusion subnet. 2. Rule for updating weights of agent i: a. find c = arg max BetPi and generate target vector

R = 1:

Wc i = Wc i − ρ ( X i − W ci ) , if

Wk = Wk , if k ≠ c , where ρ is a constant i

c.

c≠ j

i

If R = −1 then compute

z ik like in (5) and

ρ 1 = ρ σ , where σ ∈[ 0,1] : W mi = Wmi + ρ 1 ( X i − Wmi ) , if t im ⋅ z im = 1 ,

1≤ k ≤ K

T i = (t1i ,..., t iK ) :

W mi = Wmi − ρ 1 ( X i − Wmi ) , if ( z mk = 1) & (t mi = 0)

If R = 1 : t k = 1 , if k = c and

t k = 0 , otherwise.

Wmi = Wmi , if z mi = 0

If R = −1 : t k = 1 , if O k ≤ σ O j , where σ ∈ [0,1] is

HL3

HL1

m1i (θ 1 )

m i (θ 1 )

Φ 1 i W1 1

m1i ( Θ)

i 1

X 1i

Φ

X 2i

.

HL4

HL2

m i (θ 2 ) BetP i (θ 2 )

m2i (θ 2 )

i 2

m2i (Θ)

W 2i

X Ni 2

.

m iK (θ K ) Wki

BetP i (θ 1 )

BetP i (θ K )

Φ iK

mi (θ K )

m iK (Θ )

m i (Θ )

Figure 3. Agent’s architecture

5. Experiments and results The approach described in the paper is problem independent and does not impose any restrictions on the kind of features or information used by each agent. In order to evaluate the performance of the neural network, we used a proprietary database containing a set of 486 images of size 256x256. The images were obtained from the VisTex image database from MIT Media Labs. They contained 4 classes represented 4 states of the environment: metal (144 patterns), sand (96 patterns), vegetation (133 patterns), and water (113 patterns).

We have conducted experiments with several agents, which have different expertise and are able to extract different features from the images: one texture agent and several color agents. The texture agent extracts texture features represented by a vector of fractal dimensions estimated with alternating sequential filters [14]. Each color agent has an ability to extract different color features. The color features are represented by the red, green and blue channels converted to HSV color space. The spectral components are viewed separately and considered to be three independent feature sets. We have conducted two types of experiments. First series of experiments corresponds to the situation when agents do not communicate with environment and

are trained as unsupervised neural network with the Kohonen learning rule as the weight updating procedure. Second series of experiments was conducted with the reinforcement neural network introduced in this paper. We considered two different proximity measures as the transfer functions for computing activations of the HL1 in the neural network modeling individual agents: ( X i , W ki ) 2 1. Φ ( X i , Wk i ) = γ ki ( ) , (6) || X i || ⋅ ||W ki || where

0 < γ ki and W ki and X i are as in (1).

2. Φ ( X i , Wk k ) = α i exp( −γ ki ( d ki ) 2 ), where

0 < γ and i k

0