Extending Cortical-Basal Inspired Reinforcement ... - IEEE Xplore

2014 Joint IEEE International Conferences on Development and Learning and Epigenetic Robotics (ICDL-Epirob) October 13-16, 2014. Palazzo Ducale, Genoa, Italy

Extending Cortical-Basal Inspired Reinforcement Learning Model with Success-Failure Experience Shoubhik Debnath

John Nassour

Institute for Cognitive Systems Technical University Munich 80333, Munich, Germany Email: [email protected]

Artificial Intelligence, Computer Science Chemnitz University of Technology 09107, Chemnitz, Germany Email: [email protected]

Abstract—Neurocognitive studies showed that neurons of the orbitofrontal cortex get activated for expectation of immediate reward. Therefore they are the key reward structure in the brain. It was also shown that neurons in the anterior cingulate cortex work as an early warning system that prevents repeating mistakes. This paper introduces an extended model of reinforcement learning in the cortex-basal ganglia network by the hypothetical involvement of two cortical regions, the orbitofrontal cortex and the anterior cingulate cortex. In order to prove the effectiveness of the approach, we propose an enhanced actor-critic method that is guided by experiences of success and failure. Failures help the agent to explore regions by avoiding past mistakes. Successful experiences allow to exploit those regions that guarantee the agent to reach its goal. First, the method was applied to a 2-D grid problem, where an agent had to reach its goal by avoiding obstacles in its path. Second, the proposed RL model was used to optimize the learning policy of how to play bowling by the NAO humanoid robot. The results showed significant improvement using the enhanced actor-critic method both in terms of performance and rate of learning compared with the standard actor-critic method.

I. I NTRODUCTION Neurocognitive studies showed that basal ganglia is the center of reinforcement learning in the human brain [1] [2] [3]. In this paper, we propose an extended model for reinforcement learning, inspired by the cortex-basal ganglia network, by incorporating the role of two distinct cortical regions, namely the orbitofrontal cortex (OFC) and the anterior cingulate cortex (ACC) (Fig. 1). The OFC is related to reward dealing in the brain and encodes reward features into scalar value [4]. Studies also showed that the neurons of this cortical region get activated for the expectation of immediate reward [5]. The ACC gets activated during situations involving high-risk decision and also after making mistakes [6]. It works as an early warning system that adjusts the behaviour to avoid negative consequences [7]. The ACC is involved not only in external error detection, but also in internal error prediction [8]. From our reinforcement learning model, we propose an enhanced actor-critic method that takes into account the experiences of success and failure before taking an action on a particular state. Success-failure experience helps the agent to decide on a particular state whether it wants to exploit its past 978-1-4799-7540-2/14/$31.00 ©2014 European Union

actions or explore new actions. The proposed method was then applied to two different problems. The first one is a 2-D grid problem where an agent had to reach a particular destination starting from a random position in the 2-D space. In the second experiment, the humanoid robot NAO learned to play bowling in a simulated environment. Earlier works of Barto [9] mapped actor-critic implementation of temporal difference (TD) learning with basal ganglia. The model presented by Gurney et al. [10] enforced on action selection in basal ganglia. A detailed review of the existing actor-critic models by Joel et al. [11] showed that there is almost no compatibility between the computational perspectives of the actor-critic model with the anatomical data. A comparatively better hypothetical model of reinforcement learning in the cortex-basal ganglia network was proposed by Doya [12], where the process of decision making was decomposed in four steps. Overall, most of the earlier works mainly focused on the computational aspect of learning. However our work carefully correlates both the anatomical model of reinforcement learning in the cortex-basal ganglia network with the computational perspective using the notion of successfailure experience in the enhanced actor-critic method. The rest of the paper is structured as follows: section II describes the proposed model of reinforcement learning in the cortex-basal ganglia network by involving the two cortical regions, ACC and OFC. Section III introduces the existing actor-critic method. Section IV proposes an enhanced actorcritic method for reinforcement learning with success-failure experiences. Section V compares the result of the existing actor-critic method with the enhanced actor-critic method by solving two different problems. Section VI concludes the paper. II. P ROPOSED B RAIN -I NSPIRED R EINFORCEMENT L EARNING M ODEL In this paper, an extended model of reinforcement learning in the cortex-basal ganglia network is developed. This model is an extension to the works of Doya [12], where the information flow related to decision making involves four steps. The first step is linked to the sensory cortex which identifies the present state. In step two, the striatum calculates the expected future 293

Motor cortex /Actions/ Sensory cortex

Actions Reward that cortical coding

States ofSuccess-Failure inputs striatum can be mainly categorized into experience coding two parts: The first part involves OFC that targets the ventral /States/ /ACC/ /OFC/ striatum. This part is involved in reward prediction. The second part involves ACC that targets the dorsal striatum. This part Dorsal Striatum Ventral Striatum is responsible for finding the optimal policy. Based on these /Policy optimization/ /Reward Reward Policy three important connections: the studies, our model highlights prediction/ prediction optimization and VS, the connection between /VS/ connection between OFC /DS/ δ and DS, and the connection between VS and DS. ACC Action selection OFC is involved in coding reward information. OFC beOrbitofrontal cortex Thalamus TD /Reward coding/ comes responsive to the outcomes of the actions performed learning and then generates reward information which flows down to Substantia nigra Pallidum /TD learning/ the VS. Studies showed that there exists a ventral-to-dorsal /Action selection/ gradient within the connections of the striatum [17], which suggests an information flow from VS to DS. On the other (a) side, the activation in ACC takes place when human tries to take a risky action and also after taking an action that has Reward Actions States Success-Failure resulted in a failure [8]. Therefore it is involved in successSensory cortex coding experience coding failure coding. ACC along with DS helps in carefully choosing States/ /ACC/ /OFC/ an action depending on its current state and past experience of success and failure. rsal Striatum To summarize the information flow in our model, the input icy optimization/ from the OFC containing reward-related information flows to Reward Policy VS. The VS is then linked to the dorsal region that involves prediction optimization /VS/ DS and ACC, which are important for action selection and /DS/ δ action control. The proposed model considers the role played Action selection by VS and DS inside the striatum. It also incorporates the two Thalamus important cortical regions, ACC and OFC, in the cortex-basal TD ganglia network that helps in reward-based learning. The next learning section presents the existing biologically inspired actor-critic lidum method for reinforcement learning. tion selection/ Anterior cingulate cortex /Success-Failure experience coding/

(b)

III. ACTOR -C RITIC M ETHOD

FIG . 1: The proposed RL model. (a) A cross-sectional view of the brain showing the hypothetical connections between the various brain regions in the cortex-basal ganglia network. In our proposed model, we suggest to include the role played by ACC and OFC and their link with DS and VS during rewardbased learning. (b) A functional model for our proposed model of reinforcement learning based on cortex-basal ganglia connectivity.

The goal of reinforcement learning is to improve agent’s (which can be an animal, a human or an artificial system like a computer program or a robot [18], [19]) action probability on a particular state, commonly known as policy, in order to maximize the expected cumulative future reward. A reinforcement learning method was proposed by Barto and Sutton [20], called actor-critic. As the name suggests, the method has two components, the first one is the Actor, which takes actions based on the policy and the second one is the Critic, which predicts the expected future reward and also helps in improving the actor’s policy. Equations from 1 to 4, show the four steps involved in the process of the actor-critic method. Step one calculates the policy of the agent, p(a|s) which is the probability of mapping from state s to action a. Step two calculates the state-value function, V (s), which is a measure of the expected future reward by following the current policy on a given state s. The temporal discount factor, γ is a measure of how far into the future is the agent concerned and r(t) is the reward given at time step t. Step three calculates the temporal difference error, δ(t), which is the deviation of the actual value from the expected value. Step four updates the state-action probability,

reward that an action would bring when taken on a particular state. Step three involves the pallidum, for selecting an action from the possible options. Step four evaluates the action after execution, and an outcome is obtained by the help of substantia nigra. However, our extended model is more detailed and takes into account the different roles played by the ventral striatum (VS) and the dorsal striatum (DS). Furthermore, the proposed model considers the involvement of two distinct cortical regions, ACC and OFC in reinforcement learning through the connectivities with different striatal regions (Fig. 1). Neuroanatomical studies of cortical and subcortical loops involved in reward-based learning [13] [14] [15] [16] showed 978-1-4799-7540-2/14/$31.00 ©2014 European Union

294

which will later results in an updated policy. p(Action(t) = a|State(t) = s) = ep(s,a) /

X

ep(s,b)

(1)

b

V (s) = E[r(t)+γr(t+1)+γ 2 r(t+2)+...|State(t) = s] (2)

δ(t) = r(t) + γV (State(t + 1)) − V (State(t))

(3)

p(State(t), Action(t)) = p(State(t), Action(t)) + βδ(t) (4) To correlate with the anatomical perspective, steps 2, 3 and 4 look similar to the role played by VS, Substantia Nigra and DS respectively in the cortex-basal ganglia network. So, these four steps along with the success-failure experience are incorporated into our proposed enhanced actor-critic method, which is described in the next section. IV. E NHANCED ACTOR -C RITIC M ETHOD We propose an enhanced actor-critic method that considers the concept of success-failure experience, which was presented in the work of Nassour et al. [21], [22]. For an agent trying to learn the optimal policy using reinforcement learning, it is not only important to explore the environment by taking actions and receiving reward, but it is also necessary to keep note of the actions taken at a given state that led to a successful or failed exploration. The advantage of success-failure coding, which is motivated from the role of ACC, is the following : when the agent visits the same state next time, and if the agent has an experience of a success (line 4 - 6 in pseudo-code) or a failure (line 7 - 9) on this state, then the agent may prefer to go for exploitation, by taking or avoiding the past actions respectively rather than to go for exploration and try out some new actions. An agent explores the states and the actions that can be taken on those states by following the four steps of actor-critic method (line 11 - 14), which are calculated using equations 1 to 4. At each chance during a trial, the agent verifies if it has reached the goal or not. On reaching the goal (line 15 - 20), the agent adds all the states and actions taken in that trial into the success experience. While, at the end of all the chances during a trial, if the agent fails to reach the goal (line 23-29), then it compares the reward (which is in correlation to the immediate reward generated in adaptive reward coding phase) it receives on the present state with the one in the previous state. If the immediate reward received by taking an action action(c) on a state state(c) during chance c of a trial t is less than that during chance c of a previous trial t − 1 i.e. r(c|t) < r(c|t − 1), then it suggests that the action(c) taken on state(c) might not possibly be the correct one, hence state(c) and its corresponding action(c) taken on that state will be updated to the failure experience. 978-1-4799-7540-2/14/$31.00 ©2014 European Union

Data: States, Actions, Trial T , Chances per trial C, Goal Result: Reach the goal and find optimal policy 1 initialize policy to random; 2 for t = 1 to T do 3 for c = 1 to C do 4 if state(c) exists in Success Experience then 5 take action(c) on state(c) according to Success Experience; 6 end 7 else if state(c) exists in Failure Experience then 8 avoid action(c) on state(c) according to Failure Experience; 9 end 10 else 11 calculate policy, probability(action(c) = a | state(c) = s); 12 calculate V (s) and take action given a policy and state(c) = s; 13 calculate temporal-difference (T D) error ; 14 update probability(state(c), action(c)); 15 if goal is reached then 16 for chance = 1 to c do 17 update Success Experience with state(chance) and the corresponding action(chance) taken on state(chance) 18 end 19 break; 20 end 21 end 22 end 23 if goal is not reached then 24 for c = 1 to C do 25 if r(c|t) < r(c|t − 1) then 26 update Failure Experience with state(c) and the corresponding action(c) taken on state(c) 27 end 28 end 29 end 30 end Algorithm 1: Pseudo code for Enhanced Actor-Critic method. The input takes information of states, actions, total number of trials for the experiment and number of chances per trial. The objective of the algorithm is to find an optimized policy to reach the goal. The algorithm consists of two phases, the exploration phase which involves the four steps mentioned in the actor-critic method and the exploitation phase where the notion of success-failure experiences are used to take or avoid action on a particular state. If the agent succeeds to reach the goal, then all the states and action taken on those states are updated to the Success Experience. However, at the end of all the chances in a trial, if the agent fails to reach the goal, then based on the immediate reward value, the algorithm decides to update the Failure Experience with the state and the action taken on that state.

295

A. 2-D Grid World with Obstacles

(a)

(c)

(b)

(d)

FIG . 2: Paths taken by the agent using actor-critic and enhanced actor-critic to reach the goal. The goal is marked in red and the obstacles in orange. The arrows show the action taken by the agent on each state. (a) and (c) show instances of the agent trying to reach the goal using the actor-critic method, whereas (b) and (d) show instances of the agent trying to reach the goal using the enhanced actor-critic method. In both scenarios, the enhanced actor-critic method leads to a more optimized policy in reaching the goal compared with the actor-critic.

To summarize, if the trial is successful, the proposed algorithm saves the state-action pair for all the chances during a trial and labels it as a success path. During the next trials, if the agent lands on a state which exists in the success path, the action will be selected according to the success path, which will guarantee a successful trial. However, if the trial is a failed one, then the proposed algorithm goes deep into each state that the agent has visited during the failed trial along with the action taken on those states, and then uses the notion of immediate reward to label an action (say a) on a state (say s) as failed if the condition, r(c|t) < r(c|t − 1) is satisfied, and hence action a will be avoided when the agent visits state s during further trials. V. E XPERIMENTS AND R ESULTS In order to show the effectiveness of our proposed model, both actor-critic (AC) and enhanced actor-critic (EAC) methods were applied to two different problems. The first one was a 2-D grid world problem where an agent had to reach a goal with obstacles in its path. The second problem was to make the humanoid robot NAO learn how to play bowling in a simulated environment (Webots simulator1 ). 1 http://www.cyberbotics.com

978-1-4799-7540-2/14/$31.00 ©2014 European Union

This is one of the standard problems where reinforcement learning algorithms are applied to. As the name suggests, the 2-D grid world is like a 2-D space. The agent starts from a random position that is usually given as a (x, y) point in the 2D space. The goal of the agent is to reach a particular location in the space. In our case, shown in Figure 2, the goal of the agent is to start from any random position in the 2-D space and finally reach the location marked in red. The position in red (say A) has the maximum reward and its value is set to 1. The reward at the obstacles is -1. For any other point (say B) in the space, the reward value is set to 0. To make it complicated for the agent to reach the goal, there are two obstacles in the 2-D space. For the experiment, state was defined as a position in the 2-D space; action was defined as one of the four directions (right, left, up, down) which the agent wanted to take on each state. During the learning process, the agent gets a reward of 0 for any point in the space until it reaches the center and gets a positive reward of 1. The agent also saves the location of the center when reached for the first time. It then uses the information about the location of the center and the maximum possible reward i.e 1, to calculate the immediate reward at any point in space. The value of immediate reward (between 0 and maximum reward) is inversely proportional to the Euclidean distance between any point in space (B) and the center (A). However, the notion of immediate reward only comes into play while deciding upon an action as a failed one on a particular state. For both methods, 200 experiment trials were carried out. Each time, the agent started from a random position and the goal was to reach the center of the grid. Figure 2 shows the paths taken by the agent during the test trial at the end of the learning process using actor-critic and enhanced actor-critic methods. The enhanced actor-critic method ( Fig. 2(b) and 2(d)) had optimized the policy to reach the goal when compared with the actor-critic method ( Fig. 2(a) and 2(c)). In Figure 3, Trial Number is shown along the x-axis, whereas the value in y-axis refers to the number of successful trails till a given trial number. In this experimental setup, the agent reached the goal 56 times out of 200 trials when applying the AC method. The agent reached the goal 163 times out of 200 trials when the EAC method was employed. The plot also reflects the rate of learning of the agent. AC learns very slowly when compared with the EAC method. With the increasing number of trials, the learning rate of AC gets better as shown after trial number 180 in the blue line, whereas in the case of EAC, the rate of learning is steady and high from the initial number of trials. So, the steeper the curve, the faster is the learning of the agent. We found that the EAC method has two major advantages compared with AC. EAC has better performance and less latency than classical AC without employing the success-failure experience. 296

Count of Goal Reached

200 Actor-Critic 150

Enhanced Actor-Critic

100 50 0 0

20

40

60

80

100 120 Trial Number

140

160

180

200

FIG . 3: The rate of learning of AC and EAC methods. The slope of the curve for the AC method is less than that of the EAC method, which suggests that the EAC method learns faster than the AC method.

B. NAO humanoid robot learns to bowl The simulated environment for this task contains ten pins and a bowl, which NAO uses to put down the pins. Ten trials were performed during the experiment. In each trial, the robot had two chances to put down the pins. In Fig. 4, Play 1 and 2 shows instances of NAO playing during chance 1 and 2 respectively in a trial. So, NAO can put down a minimum of zero and a maximum of ten pins at the end of each trial. States were defined by the pins which were standing and were ready to be hit. Actions are defined by NAO’s joint angles of the shoulder. So, different values of the joint angles help NAO to hit different regions of the pins which can be: left, right, center and it may even target a no-pin zone. Negative angle values of NAO’s shoulder joint target pins in the right whereas a positive angle values target pins in the left. The reward was given to the robot in terms of the number of pins that were pulled down after each trial. In the implementation of the EAC method for the bowling task, tree data structures were used. Each level of the tree contained nodes with value of the action angle taken during that chance of a trial. So, nodes at level 1 represented actions taken during chance 1 and nodes at level 2 represented actions taken during chance 2. Edges connecting two nodes represented the reward obtained. The goal of this task was to find the longest path (highest reward) between root node and a node at level 2. The tree at the end of the experiment might not give the optimal policy, but the best possible policy, given limited number of trials. Table I shows the reward obtained after each chance of a trial during the experiment. At the end of all the trials, NAO received a cumulative reward of 81 out of the maximum possible 100 using the EAC method, whereas it received a cumulative reward of 67 using the AC method. Hence, it is clear that the EAC method performs better than the AC method, in terms of the cumulative reward gained at the end of the experiment. Moreover, during the course of the experiment, the learning process gets clearly visible as shown in Figure. 5. Figure 5(a) shows the variation of the NAO’s shoulder joint angle during chance 1 of each trial. Initially, both EAC and AC methods explore the state and action space by targeting both left and right regions of the pins. However, in the case of EAC method, after 5 trials, NAO learned a policy according 978-1-4799-7540-2/14/$31.00 ©2014 European Union

FIG . 4: Simulated environment for bowling task showing the different situations during the experiment. Play 1 (left) shows the instance before the ball hits the pin during chance 1 of a trial. Play 1(middle) shows an instance after some of the pins were pulled down after chance 1 of the trial. Play 2 (right) shows the situation at the start of chance 2 of a trial. Enhanced Actor-Critic Chance 1 Chance 2 3 5 5 3 2 4 0 6 3 5 7 2 5 4 5 3 4 5 6 4 81/100

Actor-Critic Chance 1 Chance 2 5 2 2 4 4 4 3 2 4 5 7 0 6 1 3 0 1 6 5 3 67/100

TABLE I: Reward received after each chance during the bowling task using both EAC and AC methods. Using the EAC method, NAO was able to receive more reward.

to which it took actions with negative value of shoulder joint, which suggests that it targeted the pins on the right during chance 1 of a trial. Similarly, during chance 2, after 5 trials the robot was able to learn a policy according to which it took actions with positive value of shoulder joint, which suggests that it targeted the pins on the left during chance 2 of a trial (Figure 5(b)). On the contrary, the AC method was not able to figure out a well-defined policy to target the pins during both chances. To summarize, using the EAC method, NAO figured out a policy for targeting the pins in the right region during the first chance followed by targeting the pins in the left region during the second chance. The values in bold in Table I are a consequence of the agent’s policy. So, even in this experiment, the overall performance was better using the EAC method. Moreover, NAO had been successful in finding out a reasonable policy, given the limited number of trials, unlike the AC method. This suggests that the EAC method has a faster rate of learning than the AC method. VI. D ISCUSSION The main difference between AC and EAC methods consists of the state-action pair which is saved in relation to each 297

0.06

Shoulder Joint Angle

0.04

Enhanced Actor−Critic Actor−Critic

compared with the classical actor-critic. As a conclusion the introduction of the notion of success-failure experiences in the cortex-basal-ganglia-inspired reinforcement learning enhanced the learning process in terms of efficiency and latency.

0.02 0 −0.02 −0.04 −0.06 1

2

3

4

5 6 Trial Number

7

8

9

10

R EFERENCES

(a) 0.06

Shoulder Joint Angle

0.04

Actor−Critic Enhanced Actor−Critic

0.02 0 −0.02 −0.04 −0.06 1

2

3

4

5 6 Trial Number

7

8

9

10

(b) FIG . 5: The variation of NAO’s shoulder joint angle during chance 1 (a) and chance 2 (b) of a trial with classical actorcritic and enhanced actor critic. Using EAC method, NAO’s shoulder joint angle converges to a negative value in (a) and to positive value in (b). Both figures show that the EAC method is able to figure out a policy: targeting pins in the right during chance 1 and then targeting pins in the left during chance 2. With the AC method, NAO was not able to figure out a well-defined policy.

successful or failed trial in the case of the EAC method. Although in both experiments, the EAC method was able to outperform the AC method, a lot depends on the way the agent explores the state-action space to capture the successful or failed trial. The EAC method is more likely to perform better when the agent explores a wider domain in the state-action space because it gives the agent more information to efficiently exploit by avoiding the failed ones and follow the successful ones. VII. C ONCLUSION This paper proposes an extended reinforcement learning model inspired from the cortex-basal ganglia network. It considers the roles played by two cortical regions in reward encoding and in error monitoring, the orbitofrontal cortex and the anterior cingulate cortex, and their projections into the dorsal and the ventral striatum. This provides an elaborated explanation for the information flow in reward-based learning in the cortex-basal ganglia network. From our model, we proposed an enhanced actor-critic method for reinforcement learning with success-failure experiences. The notion of success gives the agent a guaranteed path to reach the goal, whereas experiences of failure make the agent more alert to avoid repeating past mistakes. To show the effectiveness of the proposed extension, this method was then applied to two different problems: the 2-D grid problem along with obstacles and the NAO humanoid robot learning to bowl. The performance and rate of learning were found to be better 978-1-4799-7540-2/14/$31.00 ©2014 European Union

[1] W. Schultz, P. Dayan, and P. R. Montague, “A neural substrate of prediction and reward,” Science, vol. 275, no. 5306, pp. 1593–1599, 1997. [2] J. R. Hollerman and W. Schultz, “Dopamine neurons report an error in the temporal prediction of reward during learning.” Nature Neuroscience, vol. 1, pp. 304–309, 1998. [3] H. M. Bayer and P. W. Glimcher, “Midbrain dopamine neurons encode a quantitative reward prediction error signal,” Neuron, vol. 47, no. 1, pp. 129–141, Jul. 2005. [4] S. J. Thorpe, E. T. Rolls, and S. Maddison, “The orbitofrontal cortex: Neuronal activity in the behaving monkey,” Experimental Brain Research, vol. 49, pp. 93–115, 1983. [5] C. A. Winstanley, D. E. H. Theobald, R. N. Cardinal, and T. W. Robbins, “Contrasting roles of basolateral amygdala and orbitofrontal cortex in impulsive choice,” J. Neurosci., vol. 24, no. 20, pp. 4718–4722, May 2004. [6] M. Cohen, A. Heller, and C. Ranganath, “Functional connectivity with anterior cingulate and orbitofrontal cortices during decision-making,” Cognitive Brain Research, vol. 23, no. 1, pp. 61 – 70, 2005. [7] J. W. Brown and T. S. Braver, “Risk prediction and aversion by anterior cingulate cortex,” Cognitive, Affective, & Behavioral Neuroscience, vol. 7, no. 4, pp. 266–277, 2007. [8] ——, “Learned predictions of error likelihood in the anterior cingulate cortex,” Science, vol. 307, pp. 1118–1121, 2005. [9] A. G. Barto, Adaptive Critics and the Basal Ganglia, J. C. Houk, J. L. Davis, and D. G. Beiser, Eds. Cambridge, MA: MIT Press, 1995. [10] K. Gurney, T. J. Prescott, and P. Redgrave, “A computational model of action selection in the basal ganglia. ii. analysis and simulation of behaviour.” Biol Cybern, vol. 84, no. 6, pp. 411–423, Jun. 2001. [11] D. Joel, Y. Niv, and E. Ruppin, “Actor-critic models of the basal ganglia: new anatomical and computational perspectives,” Neural Networks, vol. 15, pp. 535–547, 2002. [12] K. Doya, “Modulators of decision making,” Nature Neuroscience, vol. 11, pp. 410 – 416, March 2008. [13] D. Hämmerer and B. Eppinger, “Dopaminergic and prefrontal contributions to reward-based learning and outcome monitoring during child development and aging,” Developmental psychology, vol. 48, no. 3, pp. 862–74, May 2012. [14] F. A. Middleton and P. L. Strick, “Basal ganglia and cerebellar loops: motor and cognitive circuits,” Brain Research Reviews, vol. 31, pp. 236 – 250, 2000. [15] B. P. Kolomiets, J. M. Deniau, J. Glowinski, and A. M. Thierry, “Basal ganglia and processing of cortical information: functional interactions between trans-striatal and trans-subthalamic circuits in the substantia nigra pars reticulata,” Neuroscience, vol. 117, pp. 931–938, 2003. [16] J. G. McHaffie, T. R. Stanford, B. E. Stein, V. Coizet, and P. Redgrave, “Subcortical loops through the basal ganglia,” Trends in Neurosciences, vol. 28, no. 8, pp. 401 – 407, 2005. [17] E. Lynd-Balta and S. N. Haber, “The organization of midbrain projections to the ventral striatum in the primate.” Neuroscience, vol. 59, no. 3, pp. 609–23, 1994. [18] K. Doya, “Reinforcement learning: Computational theory and biological mechanisms,” HFSP Journal, vol. 1, no. 1, pp. 30–40, 2007. [19] C. Balkenius and S. Winberg, “Fast learning in an actor-critic architecture with reward and punishment,” Tenth Scandinavian Conference on Artificial Intelligence, Stockholm, vol. 173, pp. 20 – 27, 2008. [20] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998. [21] J. Nassour, V. Hugel, F. B. Ouezdou, and G. Cheng, “Qualitative adaptive reward learning with success failure maps: Applied to humanoid robot walking.” IEEE Trans. Neural Netw. Learning Syst., vol. 24, no. 1, pp. 81–93, 2013. [22] J. Nassour, “Success-failure learning for humanoid: study on bipedal walking,” Ph.D. dissertation, München, Technische Universität München, Diss., 2014.

298