Staged assimilation: a system for detecting invariant ... - IEEE Xplore

STAGED ASSIMILATION A system for detecting invariant features in temporally coherent visual stimuli

James N. Templeman' Murray H. Loew George Washington University Department of Electrical Engineering and Computer Science

"the invariant structure separates off best when the frozen perspective begins to flow" -James J. Gibson Abstract: Staged assimilation is a means of extracting information from a sequence of related images. It is used to develop feature detectors of transformationinvariant properties. It is applied within a hierarchical network, referred to as the bi-panel architecture. Temporal response properties of neurons are used to capture this information. High-level feature detectors arise by grouping sequentially detected low level features into equivalence classes. When the system is exposed to imagery undergoing a continuous transformation, it develops detectors for properties that tend to remain constant over time. This is how it forms invariant feature detectors. This paper explains how the approach derives from studies of the visual cortex and J. J. Gibson's theory of perception. An example is given of how it works and details of the system's operation are discussed. Finally, its relationship to other' learning paradigms and networks is described. Key words: Adaptive Pattern Recognition, Bi-panel Architecture, Classifiers, Competitive Learning, Complex Cells, Computer Vision, Feature Maps, Invariance, Neocognitron, Perception, Self-organizing Systems, Sensory Scan-to-Scan Correlation, Sensory Fusion, Staged Assimilation, Temporal Processing, Visual Cortex. Foreword Sometimes things turn out to be more direct than you expect them to be. x Functional Architecture for Perceution A new functional architecture for adaptive perception is proposed. It learns to register both changing (variant) and unchanging (invariant) characteristics, in response to a series of structured patterns. It is constructed as a tiered arrangement of processing panels. Each panel takes in a sequence of spatial patterns, applies an array of matching filters, and expresses its result as a sequence of spatial patterns. There are two types of panels: simple, and complex. A simple panel is a self-organizing array of adaptive spatial filters [l], much like a feature map. A complex panel is an array of adaptive spatio-temporal filters (i.e. they retain state information).

* NeuroNetics, 7914 Brompton Street, Springfield VA 22152 1-73 1

Typically, (for the purposes of this paper) the two are configured in a bi-panel arrangement with a simple panel in front, and a complex panel at the back. The responses of the front panel serve as inputs to the back panel. The responses of the back panel may be fed into the front panel of another bi-panel processing module. A number of bi-panel processing modules can be connected together. The resulting system is capable of discerning structure over several levels of abstraction. This functional architecture can learn to perform transformation-invariant shape recognifion without supervision. Each member of the simple panel takes in momentary input from only a localized section of the pattern. Each filter specializes to respond to a particular set of pattern elements which tend to be active simultaneously (i.e. a matching template). In this way the simple front panel discriminates patterns based on their content. Simple filters learn to characterize the momentary spatial patterns transmitted to them. Each member of the back panel takes in consecutive input from a localized section of the front panel. Each filter specializes to respond to a particular set of simple filters which tend to activate in succession. In this way the complex back panel generalizes by assimilating features detected as a scene undergoes continuous transformation. Complex filters learn to characterize persistent properties of patterns transmitted to them via the front panel. The adaptive system relies on the continuity of the image sequences presented to it. It uses the continuous transformation to tell it what features are related. In the absence of supervision, only continuous transformation discloses the structural properties of form. The complex back panel is the key to generalizing from discriminant to invariant features. As a shape or figure undergoes a continuous transformation it generates a series of instances that form an equivalence class over that transformation. The characteristics of the figure also undergo transformation. The system seeks to construct equivalence classes of characteristics. It does so by assimilating features detected successively by the front panel into an equivalence class for each member of the back panel. Each complex filter becomes tuned to respond to a related class of features over continuously changing views.

Figure 1. Conceptual Diagram of Feature Perception

Image Sequence (a rotating square)

Variant Feature Sequence

SPanel Response Clusters

CPanel Response Clusters

Invariant Feature Sequence

Derivation from Biolovical and Psvchological Sources A review of both the current fiidings of neuroscience [2] and models of the visual cortex [3] shows that many of the insights derived from studies of the primary visual cortex have been incorporated into models of vision. There is however a fundamental observation that has long been played down: the temporal characteristics of cortical neurons. This alone makes the topic worth exploring, but what makes it compelling is the way it neatly fits into the larger context of a theory of perception [41. The Visual Cortex - (backeround) The response properties of individual cortical cells have been extensively studied by neurophysiologists over the past thirty years. The seminal work of Hubel and Wiesel (1962...) found that cortical cells respond selectively to visual stimulation. Cells of the visual cortex do not respond to diffuse light or even steady patterns of light presented to the immobilized eyes of test subjects. Cortical cells respond only to changing patterns of light. The eyes of an unrestrained, alert animal are always in motion. The eye makes both fine and coarse movements, termed micro- and macro-saccades. Even when the eyes fixate on a target, the micro-saccades still occur. Neurons of the visual cortex were found to be differentially responsive to certain visual properties. This is the broad meaning of the expression 'feature detection' to be used throughout this discussion. Individual cells respond to a number of different visual attributes. They typically have a broadly-tuned response curve for any particular attribute. Only by examining the collective response of a group of such cells can an accurate characterization of the visual stimulus be determined. The cells of the primary visual areas were found to be selectively responsive to: the orientation of edges, ocular dominance, spatial frequency (i.e. stimulus width), end-stopping, direction selectivity, velocity tuning, and color (in primates). Of all these properties, orientation stands out as being one of the most commonly encountered, especially in area 17 of the cat.

Invariance: It is common to define an invariant as a property that remains constant over a transformation. But this gives little indication of how such a construct is derived. When an irregular triangle is rotated about its center, no section of its image really 'stays the same'. Only high level constructs like the lengths of its sides, and the angles formed at its vertices remain constant. (A property is an attribute, common to all members of a class. An equivalence clnss is a set of elements characterized as beiig the same. Thus a prop erty establishes an equivalence class, when it is used as the basis of characterization. Conversely, an equivalence class may be used to define a property by means of induction.) It is more useful to say that an invariant is defined by an equivalence class of features generated as a form undergoes transformation. Invariants are not discovered, so much as they are forged to match what is detected. Consider a series of features detected as an object undergoes transformation. By recording features corresponding to an particular aspect of the object during transformation, a feature set is derived that indicates a higher level feature. Such a construct(ed feature) is by definition invariant over the transformation. It is difficult however to ensure 'the features all correspond to one particular aspect'. Some feature sets may occur by coincidence, and do not define invariants. The features detected are taken to be instances cast by a higher level, invariant feature over a transformation. Unfortunately, most scenes contain many different visual attributes, all undergoing transformation at once, and unrelated features are detected. The trick is to induct only the features cast by the same invariant feature into an equivalence class. The system seeks to isolate these sets by assuming that only features occurring proximally (in space, and time) are related. This architecture is modelled after the primary visual cortex (of cats and primates). The simple panel corresponds to the network of simple cells, and the complex panel to the complex cell network. It is both a model of how perception occurs, and of how sensory cortex operates. 1-732

Simple and complex cells: Early in their studies Hubel and Wiesel [5] divided the cells of the visual cortex into two major classes. A cell is classified as simple or complex based on its response characteristics. In neuroscience literature the receptive field (RF)of a neuron is the spatial region of the the visual field over which a stimulus can effect the cell's response. The response characteristic of a visual cell is described in terms of its receptive field organization: the cell's response as different portions of its visual field are stimulated [2 chpt. 41. A cell is considered simple if it responds linearly to spatial stimulation over its receptive field (for its preferred range of stimuli). A cell is considered complex if its response is not found to be linear. In spite of many refinements and new subdivisions added since then, this basic division continues to prove useful. Simple cells (S cells) have comparatively small receptive fields and low levels of background firing activity 121. A 'simple' RF can be divided into distinct on- and off-subfields, over which the cell is selectively responsive to local changes in illumination. A cell is excited when either a spot of light brightens in its onsubfield or dims in its off-subfield. It is inhibited when a spot of light dims in its on-subfield or brightens in its off-subfield. Applying a common stimulus over an extended region of a single subfield produces a cumulative effect, but if it falls across both on- and off-subfields the resulting excitation and inhibition cancel each other. The stimulus preferences of a simple cell can be quantitatively predicted from a map of its local responses [ 6 ] . For the RFs of simple cells, spatial summation is linear. Simple cells perform linearly, however, only over their preferred range of stimuli [7]. Simple-cell subfields are typically arranged so as to respond most strongly to oriented visual edges. Cells which do not respond linearly (to preferred stimuli) are classified as complex (C cells). Typically C cells respond actively to both light increment and light decrement throughout their receptive field. Complex cells tend to have larger RFs than S cells [2]. A larger percentage of C cells have appreciable levels of background firing activity 121. Like simple cells, complex cells tend to respond best to oriented visual edges.

the RF. CX-type cells discharge in a rather continuous fashion.". This was found to be the best single measure for discriminating between S and C cell types in their extensive studies of quantitative cell properties. Note that in both cases the same stimulus is effective in activating both cell types. It is the temporal r e sponse of the cell that distinguishes the two. The hierarchical DrocessinP scheme: Hubel and Wiesel postulated a model in which S cells receive their inputs directly from the cells of the Lateral Geniculate Nucleus (LGN), and C cells are driven solely by S cell responses. This is often referred to as the 'hierarchical processing model', and contrasted with the 'parallel processing scheme' in which both S and C cells are directly driven by LGN afferents. Although complex cells receive the majority of their input from simple cells, they can also be driven by alternate pathways leading from the retina. Today a complete model of this cortical network must combine both of these processing schemes. Having acknowledged this, a strictly hierarchical scheme is adopted for the version of the model presented here. Abstractine: the Architecture In this model the primary differences between C and S cells are 1) C cells produce responses over longer time periods, and 2) C cells take the responses of sets of S cells as their primary input. Either one of these two properties is sufficient to make C cells respond nonlinearly to spatial illumination (remembering that S cells threshold their output). The critical step in deriving the model is to interpret the observation that C cells produce responses over longer time periods as indicating that C cells integrate their response to their inputs over longer time periods. When a cell that operates in this manner is inserted into a hierarchical network, and allowed to adapt in response to a dynamic visual stimulus, it develops receptive field properties like those of biological complex cells. A cursory explanation of this process is described by an example later in this paper. This suggests that the spatial receptive field properties of complex cells derive from their temporal processing characteristics.

Temporal characteristics of S and C cells: Hubel and Wiesel [5] went on to describe the response of simple and complex cells to moving edges. Simple cells register a brief (transient) response as an edge crosses over a very narrow region, while complex cells produce a sustained response to movement over a much wider region. (This statement can be misinterpreted as suggesting that S cells detect stationary edges, while C cells detect motion. On the contrary, in area 17 of the cat, velocity low-pass and velocity-tuned cells are associated with S cell types, while velocity broad-band and velocity highpass cells are associated with C cell types [2 p. 1981.) Other kinds of moving stimuli produce similarly distinctive temporal responses. Schiller, et al. [8] described the temporally modulated responses of cells to moving sine-wave gratings as follows: "S-type cells discharge in sharp bursts to each cycle that traverses

The Reauirements of Perception James J. Gibson formulated a comprehensive theory of perceptual psychology. It goes something like this: 1) An organism exists and operates within an environment. 2) There is structure in the environment. 3) Perception is the process of picking up this information. ... optical information pickup entails .., the concurrent registering of both persistence and change in the flow of structured stimulation." I4 p. 2391 Perception depends primarily on the pickup of structural invariances across views changing with time. [91 "

1-733

Structure in the Environment: The environment consist of objects that are juxtaposed with each other (and with the organism itself). Each object has its own physical structure, each: occupies a bounded region of space, is composed of certain materials, has certain surface textures, etc. Objects arranged next to one another define a layout. They may move relative to one another. Linear and angular displacements are continuous, objects rarely pass through one another, objects can rest on other objects, etc... The transformations are also structured. As far as the organism is concerned the environment is a stream of structured stimulation (with which the organism interacts). The eyes continually stroke their receptors over the visual field. There is continuity in the flow of visual stimulation: "One arrangement does not become a wholly different arrangement by a displacement of viewpoint. There is no jump from one to another, only a variation of structure that serves to reveal the nonvariation of structure. The pattern of the array does not ordinarily scintillate; the forms of the array do not go from triangular to quadrangular, for example." - Gibson [4 p.731 Why not take full advantage of the structured nature of information imparted by the environment? The continuity of a transforming sensory image conveys useful information. Models that fail to take this information into account, limit their potential results. (It does however give them a more tractable set of formulations to deal with. That is the advantage of working with independent (i.e. unstructured) samples.) The synthesis of wrception bv the visual cortex On the one hand we have a psychological theory that requires a perceptual system to concurrently register both the momentary and persistent properties in the flow of structured stimulation it receives. On the other hand we find that the visual cortex is made up of two major types of neuron, one registers a momentary response as an edge aosses over a very narrow region, and the other produces a sustained response to movement over a much wider region. It is very tempting to interpret the role of cortical neurons as carrying out the requirements stated above. Once the function of S and C cells is interpreted in this light, it is not hard to come up with a scheme that allows the cells to adaptively develop such capacities. (See the example in the next section.) In the early days, a great deal of excitement was generated by the thought of viewing neurons as feature detectors 110, 111. In 1963 Hubel and Wiesel [121 suggested that complex cells generalize the features registered by simple cells: "What appears to be a first step in perceptual generalization results from the response of cortical cells to the orientation of a stimulus apart from its exact retinal position." Since that time, both the scientist measuring neuronal properties and those developing neural network theories have focused their attention on the discriminating aspects of cortical cells [21 p. NO].In terms of empirical studies, the ongoing exploration and analysis of visual cortex has continued unabated. When considered out of context, it is difficult to say whether the receptive field properties of complex cells

serve primarily to discriminate or generalize over their input. The question need not be addressed before proceeding to measure any specific attribute of the neuron. But when, the properties of cortical cells are viewed in the context of a comprehensive theory of perception, and they fit, then we may have learned something useful about how such systems are built. Ironically, although this interpretation is unnecessary for analysis, it may be the key to synthesizing perceptual networks. Developing detectors for invariant properties is a special form of generalization. An invariant property only exists with respect to a transformation. It defines, and is defined by the transformation. At first this might seem to be a difficult kind of generalization to make, but our nervous systems are very good at doing just that. Change over time discloses the transformations applied. It presents series of related samples. Invariant features are those classes of features that persist over time. This process allow us to discover the underlying properties, or deep structure in our environment. If we want to understand how something works, we want to see it in action, presented at a rate which we can follow. Even though on the surface everything may change, there are things underneath which remain constant. These are the invariants. Example: Translation-Invariant Orientation Detectors The following example describes how staged assimilation can be applied to generalize the feature of orientation. It is also meant as a highly simplified and abstracted description of how C cells might operate. (This high level of abstraction is also used to avoid dealing with unresolved issues of cortical operation. Such an issue is the question of whether or not onand off-center surround cells jointly converge onto S cells [2 p. 2681.) Consider the view of a small region of a line drawing (or an edge-enhanced image in which the edges separating visual regions have been converted into border lines). What kind of features, and invariant features might occur? In the case of translating lines the invariant property to be derived in this case is orientation; the orientation of the line remains constant under translation. The system should learn to group parallel lines into equivalence classes. First consider how this might be learnt without using temporal clues. As an example of competitive learning, Rumelhart and Zisper [13] describe how a network is trained to distinguish between horizontal and vertical lines. Lines are represented by bit-map patterns, with parallel lines sharing no elements in common. The authors point out that the competitive learning network cannot group parallel lines together because their patterns have no elements in common. Supplementary pattern elements must be added to provide overlap between these patterns. These augmented patterns can then be used to train a two layer network to classify parallel lines together. This is one way of providing a competitive learning system with additional information.

1-734

is depicted. A set of filter templates has been set up to match lines at four different orientations and three positions. Filters matching lines with the same orientation are organized in columns to make the example easier to follow. (c) The response of this simple panel is given for each successive moment of stimulation. (d) Each complex filter receives an input from every cell in the simple panel. In this case the signal activation levels reaching the C filters persists over two time intervals. (e) A template for a complex filter is shown which detects this stimulus sequence of diagonally oriented lines. It could have been developed from scratch by repeated exposure (and response) to the stimulus sequence depicted. In competitive learning systems, relationships between stimulus patterns are established by mutual overlap [l p. 911. It is this overlap that allows different instances to be grouped together. Temporal persistence gives rise to the necessary overlap of pattern for assimilation to occur.

Staged assimilation acquires this information from the temporal sequence in which images are received. To capture related sub-patterns the system must first develop discriminant feature detectors for the variety of momentary patterns (instances), and then merge them to form classes.

Figure 2. Refistering; " " Related Patterns Stimulus Sequence

Simple Filters which match the patterns

Class of Lines

If temporal integration were applied upon reception by the simple filters, a set of smeared templates with poor discrimination would result. The first stage is used instead to register momentary patterns. The follow-on stage of complex filters can then integrate these responses over time to generalize from momentary to persistent properties. Assume that conventional techniques have already been applied to develop a set of discriminant filters for lines at a variety of orientations and locations ill. The task then is to group parallel lines together into classes based on their orientation. Functionally this requires individual C filters to register a response whenever a stimulus in their range of orientation occurs. To accomplish this the C filters must become sensitive to only the responses of S filters registering lines in a single range of orientations. The operation of the adaptive network is depicted in Figure 3. (a) The stimulus sequence is a line oriented at about 45" moving from the upper left to the lower right of the visual field. (b)A simple 3 x 4 element discriminant feature map

Since only a small section of an image was considered, the range over which the invariance holds is limited. In the primary visual cortex, complex cells respond to oriented lines (or edges) over a wider range then their simple cell counterparts. But they too have fixed receptive fields limiting their range of invariance. This kind of learning is so direct that it may take place very rapidly in the primary visual cortex. The learning effect may occur so rapidly as to thwart attempts at measuring the pre-developed state in infant cats and primates, leading researchers to consider it as an innate capability [141. Notice that once the C cell learns to respond to the class of parallel lines, then even an individual momentary line can be classified. (In the example above, even a momentary stimulation by a 45" line would excite this complex filter.) Although the system learns to perceive through exposure to continuously varying stimulus patterns, once developed it can deal with discontinuous imagery (e.g., tachistoscope experiments).

Figure 3. Operation of the Adaptive Network (a> Stimulus sequence

(C) (dl Simple Persisting signals panel received bv a reiponses Complex h e r

(b)

Simple filter panel (kature map)

(e) Complex filter for 45"edges

1-735

Ideally a feature map self-organizes the detectors over its entire surface. Panels can only self-organize over local regions. This pattern of organization tends to repeat itself over large areas of the panel. Predefining the RFs for the cells in a panel builds apriori knowledge into the network. It structures and constrains what can be discovered by specifying the sets of stimuli to be combined. This delimits which stimuli are related to each other and which are not. The network design must take into account what sensory information is to be combined, and at what processing level this will occur. Signals are combined by feeding them into into a common filter (set). Only then can the contents of each filter adapt to capture specific relationships. The system learns through exposure to a visual environment undergoing continuous transformation. For this to happen there must be a correspondence between the RF layout and the general forms of continuity present in the environment. Proximity in space and time is one such characteristic of the environment incorporated into the network's design. A continuous transformation is one in which changes take place smoothly over space and time Once a localized feature is generalized to represent an invariant property it is then combined with other information and fed to the next bi-panel stage.

Although this is a very elementary example it demonstrates the basic mechanism, and its ability to generalize from discriminant to invariant properties. A more sophisticated example has been developed which shows how a second bi-panel stage can perform rotation-invariant classification of simple shapes, but there is not room enough to present it here. Processine Details The system applies competitive learning [13] over an array of matched filters. When a number of filters fire (and adapt) in locally adjacent clusters, an organized feature map develops [151 over a local region. This occurs in both the S and C panels. S panels are discriminant feature maps, while C panels assimilate S filter responses to form invariant feature maps. In many neural network schemes, iterative negative feedback is used to determine the maximally activated (winning) cluster. Such a settling process occurring in the S panel introduces a secondary temporal effect that interferes with the integration of successive responses. Recently, however, non-iterative schemes have been suggested for computing winning clusters that are biologically plausible [16,171. Complex filters derive their response properties by combining the input from simple filters over successive time intervals. This can be accomplished by extending the duration of activation of either the C cell, or its incoming signals (via the extended activation of afferent synapses). Either technique will do the job. The two effects are related by the temporal integration of post-synaptic potentials [18]. As it is often easier to visualize the process in terms of persisting input signals, this form is used in presenting examples. Staged assimilation can be accomplished using a variety of simple synaptic adjustment policies, including versions of Hebbian modification [191. The two filter types have different temporal response properties. When this difference interacts with the adaptive mechanism, in response to structured dynamic stimulation, the distinct functional roles of the filters emerge. The S panel develops an array of discriminant filters. The C panel forms an array of filters which act as OR gates by combining discriminated instances into more general classes. The alternating arrays of conjunctive and disjunctive gating elements gives the system a flexible means of encoding information.

Learning Paradigms Both supervised and unsupervised systems learn to classify patterns through exposure to a collection of sample patterns. In supervised learning the system is told the appropriate class label for each sample, whereas with unsupervised learning it is not [20]. The bi-panel architecture learns without supervision; instead it relies on the assumpfion that patterns received in succession are likely to be related to one another via some transformation. Thus staged assimilation is 'limited' to make only transformationinvariant classifications. Extending unsuuervised learninp svstems Typically unsupervised learning techniques work by dividing feature space up into convex regions containing samples which tend to cluster together (being nearer to each other than they are to other samples). Patterns are then classified in terms of which region they fall into. In the absence of additional information this a reasonable way of grouping instances together. The observation made earlier that there must be overlap between patterns for grouping to occur, corresponds to the requirement that two samples must be near each other in feature space to form a cluster. One can continue the process of dividing up feature space indefinitely, but there is usually little point in doing so. This is why you rarely find a number of unsupervised discriminant feature maps placed in succession. At some point it becomes more useful to start putting feature space back together again. Concave regions can be constructed by grouping several convex regions together. This is a much more general way of defining classifications.

The panels of filter arrays used Panel Organization: in this architecture are similar to feature maps. Both spread their response over a number of elements to develop a distributed representation of features. (This aspect of a panel was omitted in the prior example.) There is one difference between them, however: typically each element in a feature map is exposed to the entire stimulus pattern, but an individual element in a panel usually only receives part of the pattern. Each filter has a limited receptive field. In the early stages of vision individual cells view and record only a small portion of the entire visual field. This decomposes the problem into a multitude of smaller chunks. This division of labor requires the receptive field of all cells in each processing stage be predefined.

1-736

The question then becomes: what is the basis for grouping things together. Often this is the point at which supervised learning is invoked, because the labelling information provided to make a classification is just what is needed. Conventional unsupervised learning systems are good at developing discrimination but weak at generalization. If convex regions are required this technique must be assisted by either building in predefined grouping mechanisms (ex. the non-adaptive complex cells of the neocognitron 1261) or by adding additional information to the input pattern I131. When a competitive learning network is taught using an augmented training stimulus it moves towards supervised learning.

Related Architectures There are several prior networks that employ a 'feature map', paired with a second level mechanism used to group together discriminant responses. They fall into the categories of unsupervised and supervised systems. Fukushima's neoconnitron The bi-panel architecture has a lot in common with Fukushima's neocognitron 122, 23,241. This is not surprising since both of them are modelled after the primary visual cortex [W,261. The two architectures are similar in that both 1) can learn without supervision, 2) apply alternating arrays of simple, and complex cells, and 3) use cells with limited receptive fields. The bi-panel differs from the neocognitron in that it: 1) uses adaptive complex cells, 2) develops locally-organized feature maps, and 3) does not use feedback signals. The neocognitron's capacity for translation-invariant recognition derives from its use of complex cells to gather related responses together and to feed the result to the next processing layer. But these complex cells are stereotypic; they all generalize in the same way. Staged assimilation involves incorporating adaptive complex cells into the system. This addition greatly increases the networks ability to generalize over classes of input patterns. It will be interesting to adaptively grow complex cells that generalize over a variety of transformations at different processing stages. Supervised Networks A number of other systems resemble the bi-panel architecture, which apply an adaptive grouping mechanism at the second level of the network. Unlike the bi-panel architecture however, they apply supervised learning at the second level to accomplish this. The Hierarchical Feature Map classifier 1271 is a straightforward example of such a system. Nestor's Reduced Coulomb Energy (RCE) classifier 127 pp. 4374431 has a similar over-all architecture. It is different in that its first (internal) level is not strictly speaking a feature map. This level is an array of discriminating cells that learn through supervision. The fact that this style of architecture can be used either with or without supervision indicates its power and flexibility.

A new Traininv techniaue: Supplying the network with a coherent image sequence provides a new way of entering information into a neural network. It requires that a temporally coherent visual environment be presented to the network, instead of a randomized ensemble of patterns. This is in accord with the way natural perceptual systems work. Information about relationships between features enters at the front of the network and propagates forward. This provides an alternative to the conventional supervised approach of inserting additional information at the output stage, and propagating it back through the network. Since invariant detectors are derived via 'fonvardpropagation', it is straightforward to hook-up a number of different bi-panel stages. The output of C filters are combined with information from other sources, including C filters from other areas, and fed into a follow-on stage. In this way the system derives higherlevel invariant detectors for more abstract constructs. Control over learning: Although the system is unsupervised, the temporal sequencing of input patterns provides an additional, distinct means for entering information into the system. This gives the researcher (or environment) a new means of controlling the learning process. Although this approach comes under the heading of unsupervised learning, it provides a new way of directing and thus controlling the learning process. Supervised learninp Supervised learning systems can be trained to group sets of distinct stimuli into classes. With the proper training they will learn to detect invariant features. The supervision dictates which classifications to make. The problem is that we do not know what to teach each stage of a network, to allow it to perceive nontrivial visual environments. Something as complex as visual perception almost certainly must be dealt with a stage at a time [ X I . Putting a collection of static images in at the front end, and a list of the objects present in each image in at the back is just not going to work. It's too little, too late. The advantage of staged assimilation is that it discovers its own set of invariant features through exposure to a continuous visual transformation. (Provided a framework exists for combining the right kinds of information at the right levels. See above.)

Areas of Research The application of this approach to vision can be extended in breath by developing detectors selectively responsive to other visual properties. The properties of ocular dominance, spatial frequency, end-stopping, and color are immediate candidates. There is no reason why staged assimilation should be limited to visual processing. It is intended as a general paradigm for perceptual processing. The temporal aspects of hearing are even more critical, and the approach is likely to require be extension to work effectively in this domain. The real test of the approach however, is how well it performs as the depth of processing is increased. The architecture has the potential for dealing with multiple image transformations, because it only has to deal with

1-737

References:

one type of invariance per stage. Each processing stage develops a set of discriminant features and generalizes them over a single type of transformation before passing its result upstream for further analysis and synthesis. The output of different areas of complex filters are then combined and fed into the next S panel of the follow-on bi-panel stage. This next stage can deal with another class of invariant features. For example, the first stage might be trained to handle translation and the second stage rotational invariance. The way the information is combined must be defined by the system's architect. An additional source of guidance in formulating such designs may come from studies of the visual cortical tracts of organically evolved brains. The first multi-stage learning task that we are investigating is character recognition. There already exist a large body of neural network research in this area [22-271. This prior research provides both insights in the design of network layouts, and a basis for making comparisons. Extensions The alternating arrangement of simple and complex filters is too limited a model of the cortex. As mentioned earlier, a mix of hierarchical and parallel processing schemes coexist in the visual cortex. Such new pathways would allow the system to discover a greater range of relationships among different kinds of information. It may also be useful to feed the output directly from one C panel to another to seek out higher level generalizations. The system can also be extended by adding a third type of cell to perform a temporal differencing operation. Such an element would be useful for analyzing motion and optical flow. The Y cells of the retina are thought to operate in this way ([21 pp. 170-1751).

von der Malsberg C.v.d., (1973)Self-organization of Orientation Sensitive Cells in the Striate Cortex. Kvbemetic Vol. 14. Orban, G., (1984) Neuronal ouerations in the visual cortex: studies of brain function volume 11. Springer-Verlag. NY. Rose D., Dobson V. G., (1985)M 4 . John Wiley and Sons. Gibson, J.J. (1986) ( _ tion. . Hillsdale, New Jersey. Lawrence Erlbaum Associates. Hubel D.H., Wiesel T.N. (1962), Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. J. Phvsiol (Lond,) 160: pp. 106-154. Movshon, J. A., Thompson, I. D., and Tolhurst, D. J., (1978) Spatial summation in the receptive field of simple cells in the cat's striate cortex. J. Physiol., 283. pp. 53-77. Kulikowski, J. J.,and Murray, I. J., (1985)Contribution of striate cortical cells to pattern, movement and colour processing. In Models of the Visual Cortex. edited by Rose D., Dobson V. G., John Wiley and Sons. chpt. 25. p. 259. Schiller P.H., Finlay B.L., Volmon S.F., (1976) Quantitative studies of single cell properties in monkey striate cortex. 1, Neurouhvsioloev. 39.pp. 1365. Hagen, M.A. (1986). Varieties of realism. eeometries of reuresentational art. Cambridge University Press. p. 17. Barlow, H. B. (1953)Summation and inhibition in the frog's retina. J. Phvsiol. (Lond.1 v. 119. pp. 69-88. Barlow, H. B. (1972) Single units and sensation: a neuron doctrine for perceptual psychology? Peromtion 1. pp. 371-394. Hubel, D. H. (November, 1963) The Visual Cortex of the Brain. Scientific American. Rumelhart, D.E., McClelland, J.L., and the PDP Research Group, (1986). Parallel distributed Drocessing, (Vol. 1). The MIT Press. chpt. 5. Kuffler, S.W., Nichols, J.G., (1976) From Neuron to Brain. Sinauer Associates, Inc., chpt. 19. Kohonen, T. (1987)Self-oreanization and associative memory. Second edition. Springer-Verlag, New York. Kurogi S. (1987)A Model of a Neural Network for Spatiotemporal Pattern Recognition. Bio. Cybernetics 57,pp. 103-114. Templeman J. (July 1988)Race Networks: a theory of competitive recognition networks, IEEE I C " Proceedins Vol. 2,pp. 9-16. Kandel, E.R., Schwartz, J.H., (1981)Principles of Neural Science. Elsvier North Holland, Inc., chpt. 7 pp. 78-79 Hebb D. (1949)The Organization of Behavior. Wiley. NY. Duda, R. O., & Hart, P. E. (1973)Pattern classification and Scene analvsis. New York; Wiley. p. 45. Marr, D., (1982)Vision. San Fancisco: Freeman. Fukushima K. (1988)Neocognitron: A Hierarchical Neural Network Capable of Visual Pattern Recognition. Neural Networks. Vol. 1.Num. 2.,pp. 119-130. Fukushima K. (March 1988) A Neural Network for Visual Pattern Recognition. Computer. Vol. 21.N u n 3.,pp. 65-75. Fukushima K. (1986).A Neural Network for Selective Attention in Visual Pattern Recognition. Biol. Cvbemetics. Vol. 55, pp. 5-15. Fukushima K., Miyake S., It0 T., (1983)Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Transactions on Svstems. Man, and Cvbernetics. Vol. 13, No. 5, pp. 826-834. Fukushima K., Miyake S., (1982)Neocognitron: a self organizing neural network model for a mechanism of visual pattern recognition. in Comuetition and CooDeration in Neural Networks. S. Levin Editor, Lecture Notes in Biomathematics. vol. 45,Springer-Verlag. Chpt 18. pp. 267-285. Darpa Neural Network Study Group (1988)DarDa Neural Network Study. AFCEA International Press, Fairfax Virginia. pp. 90-91.

Summarv Multiple stages of discriminant and invariant feature detectors are generated, by exposing a network architecture to a dynamic, structured environment. The tiered arrays of filters both abstract and represent sensory information discovered in this way. This new capability holds promise for future perceptual processing systems. How generally useful is this technique? Only time will tell. Acknowledeements I (J.T.) would like to thank Prof. Peter Bock for developing my interest in machine learning, Prof. David Atkins for teaching me about neurobiology, Prof. Margaret Hagen for presenting J. J. Gibson's views to me, Dr. Harold Szu for his thoughtful and enthusiastic review of the ideas presented here, and Ms. Helen Cobb for the many interesting discussions that set the stage for these ideas.

1-738