On the Emergence of Rules in Neural Networks - Semantic Scholar

18 downloads 0 Views 217KB Size Report
Generalization to new symbol sets or grammars arises from the spatial nature of the internal representations ..... In the present report, we train a Recurrent Neural Network (RNN, see Fig. 3) on ..... node k to the output layer node i, and Wij0O serves as a first order connection from the .... Eng. 49-52. Sydney, NSW, Australia.
Manuscript Number: 2268

On the Emergence of Rules in Neural Networks

Stephen Jose Hanson Michiro Negishi Psychology Department, Rutgers University, 101 Warren St. Smith Hall 301, Newark 07102, U.S.A.

A simple associationist neural network learns to factor abstract rules (i.e. grammars) from sequences of arbitrary input symbols by inventing abstract representations which accommodate unseen symbol sets as well as unseen but similar grammars. The neural network is shown to have the ability to transfer grammatical knowledge to both new symbol vocabularies and new grammars. Analysis of the state space shows that the network learns generalized abstract structures of the input, and is not simply memorizing the input strings. These representations are context sensitive, hierarchical and are based on the state variable of the finite state machines that the neural network has learned. Generalization to new symbol sets or grammars arises from the spatial nature of the internal representations used by the network, allowing new symbol sets to be encoded close to symbol sets that have already been learned in the hidden unit space of the network. The results are counter to the arguments that learning algorithms based on weight adaptation after each exemplar presentation (such as the long term potentiation found in the mammalian nervous system) cannot in principle extract symbolic knowledge from positive examples as prescribed by prevailing human linguistic theory and evolutionary psychology. 1. Introduction A basic puzzle in the cognitive neurosciences is how simple associationist learning at the synaptic level of the brain can be used to construct known properties of cognition which appear to require abstract reference, variable binding, and symbols. The ability of humans to parse sentences and to abstract knowledge from specific examples appears to be inconsistent with local associationist algorithms for knowledge representation (Pinker 1994, Fodor and Pylyshyn 1988, but see Hanson and Burr 1990). Part of the puzzle is how neuron-like elements could, from simple signal processing properties, emulate symbol-like behavior. Symbols and symbol systems have previously been defined by many different authors. Harnad's version (Harnad 1990) is a useful summary of many such properties:

2

(1) a set of arbitrary physical tokens (scratches on paper, holes on a tape, events in a digital computer, etc.) that are (2) manipulated on the basis of explicit rules that are (3) likewise physical tokens and strings of tokens. The rule-governed symbol-token manipulation is based (4) purely on the shape of the symbol tokens (not their "meaning") i.e., it is purely syntactic, and consists of (5) rulefully combining and recombining symbol tokens. There are (6) primitive atomic symbol tokens and (7) composite symbol-token strings. The entire system and all its parts -- the atomic tokens, the composite tokens, the syntactic manipulations (both actual and possible) and the rules -- are all (8) semantically interpretable: The syntax can be systematically assigned a meaning (e.g., as standing for objects, as describing states of affairs).

As this definition implies, a key element in the acquisition of symbolic structure involves a type of independence between the task the symbols are found in and the vocabulary they represent. Fundamental to this type of independence is the ability of the learning system to factor the generic nature (or rules) of the task from the symbols, which are arbitrarily bound to the external referents of the task. Consider the simple problem of learning a grammar from valid, "positive set only" sentences consisting of strings of symbols drawn randomly from an infinite population of such valid strings. This sort of learning might very well underlie the acquisition of language in children from exposure to grammatically correct sentences during normal discourse with their language community.1 As an example we will examine the learning of strings generated by Finite State Machines (FSMs ; for an example see Fig. 1), which are known to correspond to regular grammars. Humans are known to gain a memorial advantage from exposure to strings drawn from an FSM over random strings (Miller and Stein 1963, Reber 1967), as though they are extracting abstract knowledge of the grammar. A more stringent test of knowledge of a grammar would be to expose the subjects to an FSM with one external symbol set and to see if the subjects transfer knowledge to a novel external symbol set assigned to the same FSM. In principle in this type of task, it is impossible for the subjects to use the symbol set as a basis for generalization without noting the patterns that are commensurate with the properties of the FSM.2 An example of this type of transfer 1Although controversial, language acquisition must surely involve the exposure of children to valid sentences in their language. Chomsky (1957) and other linguists have stressed the importance of the a priori embodiment of the possible grammars in some form more generic than the exact target grammar. Although not the main point of the present report, it must surely be true of the distribution of possible grammars that some learning bias must exist that helps guide the acquisition and selection of one grammar over another in the presence of data. What the nature of this learning bias is might be a more profitable avenue of research in language acquisition than the recent polarizations inherent in the nativist/empiricist dichotomy (Pinker 1997, Elman et al. 1996). 2 Reber (1969) showed that humans would significantly transfer in such a task, however his symbol sets allowed subjects to use similarity as a basis for transfer as

3

(vocabulary transfer) is shown in Fig. 1. In this task, the syntactic structure of the generated sentences is kept constant while the vocabularies are switched. This will be the first kind of simulation in this paper. In the second simulation, we examine a type of syntax transfer effect, where the vocabulary is kept the same but the syntax is altered (Fig. 2). The purpose of the second simulation is to examine the effect of syntactic similarities on the degree of knowledge transfer.

A

D

B 2 C

3

E

1

B C

1

F

A

2

E

F D

3

Fig. 1: A vocabulary transfer task. The figure shows two finite state machine representations, each of which has 3 states (1, 2, 3) with transitions indicated by arrows and legal transition symbols (A, B, ... , F) for each state. The network's task is to first learn the FSM on the lefthand side to criterion and then transfer to the second FSM on the righthand side of this figure. Note that this task involves no possible generalization from the transition symbols. Rather, all that is available are the state configuration geometries. This type of transfer task explicitly forces the network to process the symbol set independently from the transition rule. States with tail-less incoming arrows indicate initial states (state 1 in the above FSMs). States within circles indicate accepting states (state 3 in the above FSMs).

A A

C

3

C

1

B C

1

2

B

B 2

A

A

C 3

B

Fig. 2: A syntax transfer task. In this task, vocabulary sets are the same in both the left and the right FSMs, but the structure of the FSMs, such as starting states, accepting states, allowed inputs at each state and resultant state transitions differ for each FSM. As in the vocabulary transfer task, the network's task is to first learn the FSM on the lefthand side to criterion and then transfer to the second FSM on the righthand side of this figure. In a syntax transfer task, the network may make use of prior knowledge associated with input symbols where the same input causes the same transition in both of the FSMs. They were composed of contiguous letters from the alphabet. However, recent reviews of the literature indicate that this type of transfer is common even across modalities (Redington and Chater 1996).

4

2. Related Work Early neural network research in language learning (e.g. Hanson & Kegl, 1987) involved training networks to encode English sentences drawn randomly from a large text corpus (the Brown corpus). color.90%) and are used for reprojection of sample sentence hidden unit activity patterns. 10Here the phrase "context sensitivity" is used as in the dynamical systems theory and not as in linguistics. That is, the term is used to denote the effect of the attractor structure in a dynamical system, and not the effect of environments on a local syntactic structure.

13

sets (Fig. 5(c), colors signify vocabularies and shapes signify FSM states). In figures 5(a) through 5(d), LDA was applied to the hidden unit activations of one network to demonstrate the development of the phase space structure. To show that such phase space organization was not formed by chance, LDA was applied to 20 networks and a measure of separability of the state space with respect to FSM states was computed. One measure of such separability is the correct rate of discrimination using linear discriminants. One linear discriminant divides the state space into two regions: one in which the linear discriminant is positive, and another one where the linear discriminant is negative. Likewise, two linear discriminants divide the phase space into four. Because there are only three FSM states, if FSM states are completely (pairwise) linearly separable in the phase space, two linear discriminants with respect to the FSM states should be sufficient to correctly discriminate all FSM states. Likewise, two linear discriminants with respect to vocabularies should be sufficient to correctly discriminate the three vocabularies. The correct rate of discrimination after eight vocabulary switchings clearly shows that the state space is organized by FSM states rather than by vocabulary sets, since FSM states could be correctly classified by the two linear discriminants with respect to the FSM states with an accuracy of 91% (SD=7%, n=20) whereas the vocabulary set could be classified correctly for only 73% (SD=10%, n=20). Note that LDA when determining accuracy takes into account the number of classes in each case, although in the present case there was an equal number of classes for both vocabulary and state. This means that the hidden layer learned a mapping from the input and current-state space, which is not pairwise separable by states, into the next-state space, which is pairwise separable by states, greatly facilitating the state transition and output computation at the next input. Notice in both Fig. 5(b) and 5(c) relative to 5(a) that the symbol sets have spread out and occupy more of the hidden unit space with significant gaps between clusters of the same and different symbol sets. Moreover from Fig.5(b), one can also see that in each state (coded by the colors), each vocabulary (coded by the plotting symbols) clusters together. Although vocabularies are not always pairwise separable in this plot, the separation is remarkable considering that the linear discriminants are optimized solely to discriminate the states. We can test the vocabulary separability within each state directly by doing a LDA within each state using vocabulary as the discriminant variable. In this case, vocabularies were correctly discriminated with an average accuracy of 95% (SD=7, n=20). This is strong evidence that the hidden layer activity states are hierarchically organized into FSM states and vocabularies. This hierarchical structure allows for the accommodation of the already-learned vocabularies and any new ones the RNN is asked to learn. LDA after the test vocabulary is learned once also shows that the network state is predominantly organized by FSM states (Fig. 5(d)), although the linear separation by FSM states of a small fraction of activities is compromised, as can be seen at the boundary of red crosses, blue stars and triangles in the figure. This interference by the new vocabulary is not surprising considering that old vocabularies were not relearned after the new vocabulary was learned. What is more interesting is the spatial location of the new vocabulary ("stars"). The hidden unit activity, again, clearly shows that state discriminant structure is dominant and organizes the symbol sets. It seems that the fourth vocabulary simply fills empty spots in the hidden unit space to code it in a position relative to the existing state structure. This can be seen by comparison to Fig. 5(b), which is almost identical to the linear projections that were

14

found without the new symbol set. This retention of the projections of the old space would not be surprising since the configuration prior to exposure to the new symbol set should not change because of the extensive prior vocabulary training. Apparently, the precedence of the existing abstraction encourages use of the hierarchical state representation and its existing context sensitivity. The network seems to bootstrap from existing nearby vocabularies that would allow generalization to the same FSM that all symbol sets are using. There may be several factors which make such bootstrapping possible. First, initial connection weights from novel symbols are small random weights, creating quite different patterns of activations in the hidden layer than old symbols do through the more articulated learned weights (i.e. weight strengths with a wider dynamic range). Second, the probability distribution of FSM states is the same regardless of the vocabulary set, and this information is reflected in the first order weights. Third, the gradient descent algorithm finds a most effective modification to the weights to perform the prediction task. For these reasons, we can view what the network is doing as a kind of analogical learning process (e.g. D is to F as A is to B).

15

Fig. 5: Linear Discriminant Analysis (LDA) of hidden layer activities. (Continued to the next page) (a): LDA of hidden activities of a network that learned a single FSM/symbol set. Note that the colors code state (1:red, 2:blue, 3:green) while the "+" sign codes for the single symbol set. It can be seen that FSM states are pairwise separable in the hidden layer activity space. (b): LDA of the hidden units with respect to states, after the RNN has learned three FSMs with three different symbol sets using state as the discriminant variable. Notice how the hidden unit space spreads out compared to (a). Notice further that the space is organized by clusters corresponding to states (coded by color: red=1, blue=2, green=3) which are internally

16

differentiated by symbol sets (represented by different graphic symbols: +=ABC, triangle=DEF, square=GHI).

Fig. 5: (Continued from the previous page) Linear Discriminant Analysis (LDA) of hidden layer activities. (c): LDA of hidden unit activities with respect to vocabularies , after the RNN has learned three FSM's and three different symbol sets. The LDA used the symbol set as the discriminant. Note that in this figure the color codes the vocabularies (red:ABC, blue:DEF, green:GHI) and the shape codes the FSM states (+:1, triangle:2, square:3). In this case, the discriminant function based on the

17

symbol set produces no simple spatial classification, compared to Fig. 5B, which shows the same activations classified by the state of the FSM. (d): LDA of the Hidden state space after training on 3 independent symbol sets for 3 cycles and then transfer to a new untrained symbol set (coded by stars, encircled for clarity). Note how the new symbol set slides in the gaps between symbol sets previously learned (see (b) which is primarily identical except for the new symbol set). Apparently, this provides initial context sensitivity for the new symbol set creating the 60% savings. 3.4 Simulation 2: The Syntax Transfer Task Humans can apply abstract knowledge (such as the structure of symbol sequences) for solving problems even if the abstract structure required for the task is slightly different from the acquired one (playing card games for example, often the knowledge of strategy in one game transfers to another game). Does a neural network have the same flexibility? In the second simulation, we carried out a simulation of the syntax transfer task where the target syntactic structure was changed from the acquired ones while the vocabulary was kept constant.

18

Fig. 6: Syntax Transfer results as a function of source to target difference in grammar. Numbers on the horizontal arrow tails signify numbers of required training sentences for the source grammar, and numbers on the horizontal arrow head signify numbers of required training on the target grammar. For instance, the left grammar in (a) required 20456 training sentences when it was the source grammar and required 10760 training sentences when it was the target grammar. Thus, the effect of learning the grammar on the right hand side resulted in 47.4% reduction of required training sentences. Numbers in the parentheses are standard errors (N=20).

19

In the first syntax transfer task, only the initial and the last states in the FSM were changed, so the subsequence distributions are almost the same (Fig. 6(a)). In this case, 47% and 49% savings were observed in each transfer direction. In the second syntax transfer task, directions of all arcs in the FSM were changed to the opposite (Fig. 6(b)). Therefore the mirror image of a sentence accepted in one grammar is accepted in the other grammar. Although the grammars were very different, there is a significant amount of overlap in the permissible subsequences. Therefore, there were 19% and 25% savings in training. In the third and fourth syntax transfer tasks, the source and the target grammars share fewer subsequences. In the third case (Fig. 6(c)), the subsequences were very different because the source grammar has two onestate loops (at states 1 and 3) with the same symbol A, whereas the two one- state loops in the target grammar consist of different symbols (A and B). Also, the twostate loops (states 2 and 3) consist of different symbols (BCBC... in source grammar and CCCC... in the target grammar). In this case, there was an 18% reduction in the number of trials required in one transfer direction but there was a 29% increase in the other direction. In the fourth case (Fig. 6(d)), one grammar includes two one-state loops with the same symbol (AAA... in states 1 and 3) whereas in the other grammar they consist of different symbols (AAA... in state 1 and BBB... in state 3). In this case, there were 13% and 14% increases in the number of trials required. The fact that there is interference (increase in the number of required training trials) as well as transfer indicates that finding a correct mapping is a hard problem for the network. From the observations above, we speculated that if the acquired grammar allows many subsequences of symbols that are also allowed by the target grammar the transfer is easier and therefore there will be more savings.11 The simulation results are consistent with the finding in human artificial grammar learning that the transfer effect persists even when the new strings violate the syntactic rules slightly (Brooks & Vokey 1991). 4. Conclusion It has been shown that amount of training needed to learn the end prediction task on a new grammar is reduced in the vocabulary transfer paradigm as well as in the syntax transfer paradigm (given that the source and the target grammars were similar). We have shown that previous experience with examples drawn from rules (governing an FSM) can shape the acquisition process and speed of new, though similar, rules in simple associationist neural networks. The linear discriminant analysis of the hidden layer activities showed that the activity space was hierarchically organized in terms of FSM states and vocabularies. The trajectories in the state space showed context sensitivity, and the reorganization of the state space showed a simple type of analogical process that would support the vocabulary transfer process. It has been argued that unstructured (minimal bias) neural networks with general learning mechanisms are incapable of representing, processing, and generalizing 11To confirm this conjecture, we sought a measure of the overlap of subsequences between the source and the target grammar, Our efforts to produce such a numeric measure have been met with only moderate success (see Negishi and Hanson (2001) for more details).

20

symbolic information (Pinker 1994, Fodor and Pylyshyn 1988, Marcus et al. 1999). Evolutionary psychologists argue that humans have an innate symbol processing mechanism that has been shaped by evolution. Pinker, for one, argues that there must be two distinct mechanisms for language, an associative mechanism and a rulebased mechanism, the latter being equipped with a non-associative learning mechanism or arising from genetic basis (Pinker 1991). The other alternative, as demonstrated by the present experiments, is that neural networks that incorporate associative mechanisms can be both sensitive to the statistical substrate of the world and exploit data structures that have the property of following a deterministic rule. As demonstrated, these new data structures can arise even when that explicit rule has only been expressed implicitly by examples and learned by mere exposure to regularities in the data. Acknowledgment We thank Michael Casey for contributions to an earlier version of this work and Ben Martin Bly for comments and edits on earlier versions of this paper. We would also like to acknowledge an anonymous reviewer for reading over this paper and providing many useful comments and edits and Gary Cottrell for many useful discussions and corrections to the present paper. Appendix: Network equations Output layer node i activation:   O O O S X i (t ) = f (Pi (t )) = f  ∑ wijk X j (t − 1)I k (t)  jk  In the equations, t is a time step that indicates the number of input words that have been seen. For instance, t=1 when the network is processing the first word. In the equation above, the product of a feedback layer node j activity (which is a state hidden layer node j activity at the previous time step) XjS(t-1) and an input layer node k activity Ik(t) is weighted by a second order weight WijkO and is added to an output layer node i potential PiO(t). A transfer function f() is applied to the potential and yields the output layer node i activity XiO(t) . In the current network, there is only one output layer node (i=1). X0S(t-1) and I0(t) are special constant nodes whose output values are always 1.0. As the result, Wi0kO serves as a first order connection from the input node k to the output layer node i, and Wij0O serves as a first order connection from the feedback layer node j to the output layer node i. The weight Wi00O serves as a bias term for the output layer node i. State hidden layer node activation:   S S S S X i (t ) = f (Pi (t )) = f  ∑ wijk X j (t − 1)I k (t)  jk 

21

The product of a feedback layer node j activity XjS(t-1) and an input layer node k activity Ik(t) is weighted by a second order weight WijkS and is added to a state hidden layer node i potential PiS(t). The transfer function f() is applied to the potential and yields the state hidden layer node i activity XiS(t) . Because of the special definitions of X0S(t-1) and I0(t) described above, Wi0kS serves as a first order connection from the input node k to the state hidden layer node i, and Wij0S serves as a first order connection from the feedback layer node j to the state hidden layer node i. The weight Wi00S serves as a bias term for the state hidden layer node i. Output layer weight update:

∂J (t ) ∂X Oi (t ) = − α ( ) ∑ ∑ E i t ∂w O O t ∂w lmn i t lmn O  ∂X i (t)  ∂ = O f  ∑ w Ojk X Sj (t −1)I k (t) O ∂wlmn ∂w lmn  jk 

∆w Olmn = −α ∑

= f (PiO (t))δil X mS (t −1)I n (t ) Change in a second order weight WlmnO (from the input hidden layer node m and the feedback layer node n to the output layer node l) is the negative of the learning constant times the partial derivative of the cost function J(t), summed over the whole sentence. The partial derivative is equal to the error (output value minus desired output) of the output layer nodes Ei(t) times the partial derivatives of the outputs XiO(t) summed over all outputs i (in the current network there is only one output node). The partial derivative of the output node activity is computed using the chain rule (dx/dy=(dx/dz)(dz/dy)). See the equations for the output layer node output for variable descriptions. In the last line of the equations above, f() with an overbar on f denotes the derivative of f(), and il is a Kronecker delta whose value is one only if i=l and is zero elsewhere. State hidden weight update:

∂J i (t ) ∂X Oi (t ) E t = − α ( ) ∑ ∑ i ∂w S wS i t ∂ lmn i t lmn O  ∂X i (t)  ∂ = O f  ∑ w Oijk X Sj (t − 1)Ik (t ) S ∂wlmn ∂w lmn  jk 

∆w Slmn = −α ∑∑

O = f (PiO (t))∑ wijk I k (t) jk

∂X Oi (t) S ∂wlmn

22

  ∂X iS (t) ∂ S S   f w X t I t −1 = ( ) ( ) ∑ ijk j k ∂w Slmn ∂w Slmn  jk   ∂X S (t ) = f (PiS (t))δil X mS (t − 1)In (t) + ∑ w SijkI k (t) i S  ∂w lmn   jk Change in a second order weight WlmnS (from the input layer node m and the feedback layer node n to the state hidden layer node l) is the negative of the learning constant times the partial derivative of the cost function J(t) summed over the whole sentence. The partial derivative is equal to the error (output value minus desired output) of the output nodes Ei(t) times the partial derivatives of the output layer nodes XiO(t), with respect to a state hidden node weight this time, summed over all outputs i. The partial derivative of the output node activity is computed using the chain rule. In this case we further need to compute the partial derivative of state hidden layer node activities XiO(t). This again is computed using the chain rule. The time suffix (t-1) of the state hidden layer node activation XiS(t-1) in the last line indicates that the partial derivative of a hidden layer node activity with respect to a state hidden layer weight is computed recursively. The initial value of the partial derivative (at the beginning of the sentence, t=0) is assumed to be zero. References Special Issue on Cognitive Neuroscience, Science, 275, 1580-1608 (1997). Berko, J. (1958) The Child's learning of English morphology. Word, 14, 150-177. Brooks, L. R., and Vokey, J. R. (1991) Abstract analogies and abstracted grammars: Comments on Reber (1989) and Mathews et al. (1990). Journal of Experimental Psychology: General, 120, 316-323. Burns, B. D., Hummel, J. E., and Holyoak, K.J. (1993) Establishing analogical mappings by synchronizing oscillators. Proceedings of the Fourth Australian Conference on Neural Networks (ACNN'93). Sydney Univ. Electr. Eng. 49-52. Sydney, NSW, Australia. Casey, M. (1996) The dynamics of discrete-time computation, with applications to recurrent neural networks and finite state machine extraction, Neural Computation, 8, 6, 1135-1178. Chomsky, N. (1957) Syntactic structures. Mouton, The Hague. Cleeremans, A. (1993) Mechanisms of Implicit Learning (MIT, Cambridge). Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. J.(1987), Automatic learning, rule extraction and generalization, Complex Systems, 1 (5), 877-922. Dienes, Z., Altmann, and G., Gao, S-J. (1999) Mapping across domains without feedback: A neural network model of transfer of implicit knowledge, Cognitive Science 23, 53-82. Elman, J. (1990) Finding Structures in Time, Cognitive Science, 14, 179-211.

23

Elman, J., Bates, E., Johson, M., Karmiloff-Smith, A., Parisi, D., and Plunkett, K. (1996) Rethinking Innateness (MIT Cambridge). Fodor, J., Pylyshyn, Z. (1988) Connectionism and Cognitive Architecture: A critical analysis. In Pinker & Mehler (Eds.), Connections and Symbols, (MIT Cambridge). Giles, C. L., Horne, B. G., and Lin, T. (1995) Learning a class of large finite state machines with a recurrent neural network, Neural Networks, 8, (9), 1359-1365. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., and Lee, Y. C. (1992) Learning and extracting finite state automata with second-order recurrent neural networks. Neural Computation 4, 393-405 Hanson, S. J.(1990), A Stochastic Version of the Delta Rule, PHYSICA D,42, 265272. Hanson, S. J. and Burr, D. (1990) What connectionist model learn: learning and representation in connectionist models, Behavioral and Brain Sciences, 13, (3), 471-518. Hanson, S. J. and Kegl, J. (1987) PARSNIP: A Connectionist Network that Learns Natural Language Grammar from Exposure to Natural Language Sentences, Proceedings of the Ninth Annual Conference on Cognitive Science, Seattle, WA. Harnad, S. (1990) The Symbol Grounding Problem , Physica D 42, 335-346. Jordan, M. I. (1986) Serial Order: A parallel distributed processing approach, ICS Technical Report (UCSD). Marcus, G. F., Viyayan, S., Bandi Rao, P., and Vishton, M. (1999) Rule learning by seven-month-old infants, Science. 283, 77-80. McCloskey, M. & Cohen, N. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 24, 109-165. Miller, G. A. and Stein, M. (1963) Grammarama I: Preliminary studies and analysis of protocols. Technical Report No. CS-2, Cambridge: Harvard University, CCS. Negishi, M (1999) A comment on G. F. Marcus, S. Vijayan, S. Bandi Rao, and P. M. Vishton "Rule learning by Seven-Month-Old Infants", Science, 284, 435. Negishi, M. and Hanson, S. J. (2001) A study of grammar transfer effects in a second order recurrent network, Proceedings of International Joint Conference on Neural Networks 2001, 1, 326-330. Pinker, S. (1994) The Language Instinct (Morrow & Co.). Pinker, S. (1997) How the Mind works. (W.W. Norton & Co.). Pinker, S. (1991) Rules of language, Science, 253, 530-535. Pratt, L. Y., Mostow, J., Kamm, C. A. (1991) Direct Transfer of Learned Information Among Neural Networks. AAAI (American Association for Artificial Intelligence), 2, 584-589 Pratt, L. Y. (1993) Discriminability-based transfer between neural networks. In Advances in Neural Information Processing Systems (NIPS) 5, 204--211, Denver, Colorado, Morgan Kaufmann. Reber, A. (1967) Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855-863. Reber, A. (1969) Transfer of syntactic structure in synthetic languages. Journal of Experimental Psychology, 81, 115-119. Redington, J. and Chater, N. (1996) Transfer in Artificial Grammar Learning: A reevaluation, J. Exp. Psych: General, 125 (2), 123-138.

24

Rumelhart, D., Hinton, G., and Williams, R. J. (1986) Learning representations by back-propagating errors, Nature, 323, 9. Saffran, J. R., Aslin, R. N., and Newport, E. L. (1996). Statistical learning by 8month old infants. Science 274, 1926-1928. Thagard, P. and Verbeurgt, K. (1998). Coherence as constraint satisfaction. Cognitive Science, 22: 1-24. Williams, R. J., and Zipser, D. (1989) A learning algorithm for continually running fully recurrent neural networks, Neural Computation 1, 270-280.