So many pixels, so little time - Nature

news and views

So many pixels, so little time James A Mazer

The author is at the Department of Neurobiology, Yale School of Medicine, POB 208001, New Haven, Connecticut 06520-8001, USA. e-mail: [email protected]

a

b

c

d

Isolated

We effortlessly classify and identify objects hidden in the silver grains of photographs and the discrete pixels that appear on our computer screens, but the apparent simplicity of visual recognition belies the underlying computational complexity that is inherent in high-level vision. Even for a tiny 10 × 10 pixel patch of a typical computer screen, the number of possible binary black and white images that can be displayed (>1030) is staggeringly large, making it virtually inconceivable that our fast and effortless ability to perform high-level image recognition is based on an explicit pixel- or photoreceptor-based representation. So how are real visual objects represented in the brain? A report by Yamane et al.1 in this issue provides new insight into the nature of the representation of complex visual objects in macaque inferotemporal cortex. The authors used a new experimental approach to characterize the selectivity of inferotemporal neurons for a set of three-dimensional object primitives that can be combined in various ways to form real three-dimensional objects. This study may provide a critical missing link between the pixel-like representations found in the earliest stages of visual processing and the high-level object selectivity found in the anterior portions of the temporal lobe in humans and non-human primates. The primate visual system is composed of 25–40 distinct areas2, depending on how they’re counted. Since the onset of modern neurophysiological studies of the visual system, neuroscientists have recognized that the complexity of visual representations systematically increases as you ascend the visual hierarchy2,3. This is perhaps best illustrated by the progression of neuronal response properties as you move from neurons in the retina and thalamus (sensitive to spots of light and darkness) to those in primary visual cortex (selective for stimulus orientation, disparity and direction) and eventually to inferotemporal areas (face- and object-selective neurons).

Combinations

© 2008 Nature Publishing Group http://www.nature.com/natureneuroscience

Previous work has focused on neuronal encoding of two-dimensional shapes. Using a new search algorithm and three-dimensional object primitives, a study in this issue identifies potential subunits of complex object recognition.

Spots

e

Bars and sinusoidal gratings

f

Non-Cartesian gratings

Threedimensional shape primitives

g

Faces and objects

Figure 1 Range of stimuli used to characterize cortical visual representations. Stimuli are arranged from left to right in order of increasing complexity. Stimuli in the upper row are ‘single component’ stimuli and those in lower row contain two or more components designed to simultaneously engage multiple sensory channels. Each column represents a specific experimental approach used to identify certain aspects of the neural representation of complex shapes. Application of light and dark spots revealed center-surround organization in the retina and thalamus10. (a–d) Sinusoidal gratings (shown here) and oriented bars were critical for recognizing the importance of edge orientation in primary visual cortex3,11. (e) Responses to non-Cartesian stimuli revealed evidence of nonlinear tuning for orientation combinations related to surface curvature in area V4 (ref. 12). (f) Synthetic stimuli used by Yamane et al.1 were composed of three-dimensional component features embedded in an elliptical solid. (g) A subpopulation of neurons in anterior inferotemporal cortex, often thought to represent the pinnacle of visual selectivity, are highly selective for faces and other complex visual objects 13.

Hubel and Wiesel’s 3 early intuition was that the overwhelmingly complex pixel-like representation of the visual field observed in the retina and thalamus is transformed by cortical neurons into a more computationally tractable map of oriented edges. This fundamental insight into the nature of visual processing continues to shape how modern neuroscientists study visual processing. Their seminal studies suggested that thalamocortical projection neurons with pixel-like center-surround opponency could be summed with a simple biological circuit to generate the cortical neurons with elongated, oriented receptive fields that were observed experimentally. These neurons responded preferentially to an edge of appropriate orientation falling in their receptive field. Since the 1960s, visual physiologists have used a variety of experimental and computational tools in an effort to identify the visual primitives (such as spots and edges) that are used by different stations in the

nature neuroscience volume 11 | number 11 | NOVEMBER 2008

visual hierarchy to represent the multitude of features that can appear in the visual scene during natural vision (Fig. 1). What is often not recognized is that receptive field mapping (that is, identifying the specific stimuli or class of stimuli that visual neurons are tuned for) and recognizing the corresponding u nderlying organizational p rinciples becomes more difficult as you ascend the visual hierarchy. In the early stages of visual processing, receptive fields can typically be mapped using spots of light. However, in higher areas, where neurons are generally more selective and show marked nonlinear tuning properties, simple stimuli, such as spots, may fail to elicit the spiking activity that is required to estimate tuning from extracellular recordings. If stimuli are too simple, they may fail to drive highly selective neurons (or provide little new insight). On the other hand, if stimuli are too complex, such as white noise or even natural scenes, the number of possible stimuli may become so large as to preclude obtaining enough data

1243

© 2008 Nature Publishing Group http://www.nature.com/natureneuroscience

news and views to characterize highly nonlinear neurons. Therefore, selection of an appropriate stimulus subspace4, on the basis of previous studies, theories or even intuition, is a critical step in the experimental process. The study by Yamane et al.1 combines an on-line evolutionary learning algorithm with a set of three-dimensional visual primitives to efficiently characterize spatial and feature s electivity of high-level visual neurons in macaque inferotemporal cortex. Their results reveal that inferotemporal neurons are both highly tuned for specific threedimensional structural primitives and are strongly influenced by the global context in which these primitives appear. This result represents a crucial step in understanding how the brain encodes the wide variety of complex objects and shapes that can appear in the visual field during natural vision. There are several interesting aspects of this line of research. The first is related to the stimuli themselves. In this study, the authors devised a stimulus set that was based on a three-dimensional generalization of two-dimensional stimuli that were used previously to study contour integration in V4 (ref. 5) and inferotemporal cortex6. Starting from a simple three-dimensional elliptical solid, the authors systematically introduced shape d istortions (ridges, grooves, bumps and dimples) at various points on the surface to generate a parametric family of three-dimensional shapes (see Fig. 1f). These stimuli were then presented to inferotemporal neurons using realistic lighting and disparity cues to provide a new characterization of neuronal selectivity for three-dimensional object shape. A critical property of this stimulus set is that each ridge, groove, bump and dimple is an independent structural primitive, meaning that each stimulus was composed of multiple three-dimensional primitives or parts. This allowed the authors to characterize both tuning for the primitives themselves and the degree to which responses to a preferred primitive were modulated by neighboring primitives (for example, was the response to a dimple different when it appeared next to a ridge or groove?). Implicit in many of our models of sensory processing is the idea that sensory neurons can be modeled as linear or quasi-linear filters or feature detectors. This seductively simple idea means that responses to component stimuli presented in isolation can be used to predict responses to complex combination stimuli, markedly s implifying the task of characterizing selectivity. Knowing the response of a linear

1244

cell to a set of properly chosen components (Fig. 1a,b), we should be able to predict the response of that cell to a pair of spots (Fig. 1c) or a pair of gratings (Fig. 1d) as the sum of the component responses. This model has been applied (implicitly or explicitly) to the study of v irtually all sensory systems (vision7, audition8 and somatosensory9) with varying degrees of success. The attraction of the linear model becomes clear if we again consider the case of a 10 × 10 pixel image. When studying a linear neuron, it is sufficient to measure responses to each of the 100 pixels presented in isolation (or actually responses to 100 light and 100 dark isolated pixels), and with that information we could predict the cell’s response to any of the more than 10 30 p ossible 10 × 10 images. Although this model works well for some neurons, it typically starts to fail beyond striate cortex. Therefore, it is critical that physiological studies of high-level shape selectivity include combinations of component stimuli to access and quantify nonlinearities in neuronal tuning. By incorporating multiple primitives into each test stimulus, Yamane et al.1 were able to apply a simple, yet elegant, regression approach to test the linearity of responses to three-dimensional component stimuli and found significant nonlinear interactions that could be well fit by a second-order model. This result reveals that i nferotemporal neurons are selective for both component features and the specific spatial relationship of each feature relative to other features in the local neighborhood. Finally, the authors devised an evolutionary learning algorithm to efficiently explore the highly complex space defined by their three-dimensional primitives. They measured each cell’s response to an initial set of stimuli, ranked those stimuli on the basis of response level and selected a subset of ‘keepers’ (spanning the range of response rates), which were combined with a new set of random stimuli that had not been shown to the cell previously. This second generation, containing both new and old s timuli, was then shown to the cell and the entire procedure was repeated up to ten times. This adaptive strategy, which uses the cells’ own responses to guide stimulus selection, allowed for fast and efficient identification of optimal stimuli for most inferotemporal neurons in a stimulus space that is far too large to explore by measuring conventional tuning curves for each stimulus parameter. Interestingly, because each generation contained good, bad and intermediate

s timuli from the previous generation, the authors were able to compute robust tuning profiles on the basis of responses to the generational stimuli, something not typically possible using conventional gradient descent methods that explore only a relatively narrow trajectory through the stimulus-response space. These tuning profiles confirmed that inferotemporal neurons are highly selective for component features located at specific positions in specific orientations relative to the global object and neighboring features. This high degree of selectivity was driven by the full set of three-dimensional structural cues—far less tuning was observed when disparity and shading cues were removed, leaving only two-dimensional contour information. These results build on a long history of attempts at identifying the geometric primitives used to encode three-dimensional shapes in the brain. The results of this study suggest a set of three-dimensional object primitives that can be readily combined to localize and identify real three-dimensional objects in the environment. To use the l anguage of l inear systems, the ridges, grooves, bumps and dimples that comprise Yamane et al.’s1 stimuli could be the basis functions underlying our ability to classify and recognize specific three-dimensional objects. These results extend a general principle observed in early visual areas that the representation of complex objects in the brain is systematically constructed through linear and nonlinear interactions between simpler feature detectors that results in an overlapping code of both feature type and feature position in the context of the larger visual object or even visual scene. 1. Yamane, Y., Carlson, E.T., Bowman, K.C., Wang, Z. & Connor, C.E. Nat. Neurosci. 11, 1352–1360 (2008). 2. Felleman, D.J. & Van Essen, D.C. Cereb. Cortex 1, 1–47 (1991). 3. Hubel, D.H. & Wiesel, T.N. J. Physiol. (Lond.) 160, 106–154 (1962). 4. Ringach, D.L., Sapiro, G. & Shapley, R. Vision Res. 37, 2455–2464 (1997). 5. Pasupathy, A. & Connor, C.E. J. Neurophysiol. 82, 2490–2502 (1999). 6. Brincat, S.L. & Connor, C.E. Neuron 49, 17–24 (2006). 7. Jones, J. & Palmer, L. J. Neurophysiol. 58, 1233–1258 (1987). 8. DeBoer, E. & Kuyper, P. IEEE Trans. Biomed. Eng. 15, 159–179 (1968). 9. DiCarlo, J.J., Johnson, K.O. & Hsiao, S.S. J. Neurosci. 18, 2626–2645 (1998). 10. Kuffler, S.W. J. Neurophysiol. 16, 37–68 (1953). 11. Enroth-Cugell, C. & Robson, J.G. J. Physiol. (Lond.) 187, 517–552 (1966). 12. Gallant, J.L., Braun, J. & Van Essen, D.C. Science 259, 100–103 (1993). 13. Desimone, R., Albright, T.D., Gross, C.G. & Bruce, C. J. Neurosci. 4, 2051–2062 (1984).

volume 11 | number 11 | NOVEMBER 2008 nature neuroscience