Acoustic Tactile Representation of Visual Information

NORTHWESTERN UNIVERSITY

Acoustic Tactile Representation of Visual Information

A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree DOCTOR OF PHILOSOPHY

Field of Electrical Engineering and Computer Science

By Pubudu Madhawa Silva

EVANSTON, ILLINOIS August 2014

2

c Copyright by Pubudu Madhawa Silva 2014

All Rights Reserved

3

ABSTRACT Acoustic Tactile Representation of Visual Information Pubudu Madhawa Silva Our goal is to explore the use of hearing and touch to convey graphical and pictorial information to visually impaired people. Our focus is on dynamic, interactive display of visual information using existing, widely available devices, such as smart phones and tablets with touch sensitive screens. We propose a new approach for acoustic-tactile representation of visual signals that can be implemented on a touch screen and allows the user to actively explore a twodimensional layout consisting of one or more objects with a finger or a stylus while listening to auditory feedback via stereo headphones. The proposed approach is acoustic-tactile because sound is used as the primary source of information for object localization and identification, while touch is used for pointing and kinesthetic feedback. A static overlay of raised-dot tactile patterns can also be added. A key distinguishing feature of the proposed approach is the use of spatial sound (directional and distance cues) to facilitate the active exploration of the layout. We consider a variety of configurations for acoustic-tactile rendering of object size, shape, identity, and location, as well as for the overall perception of simple layouts and scenes. While our primary goal is to explore the fundamental capabilities and limitations of representing visual information in acoustic-tactile form, we also consider a number of relatively simple configurations that can be tied to specific applications. In particular, we consider a simple scene layout consisting of objects in a linear arrangement, each with a distinct tapping sound, which we compare to a “virtual cane.” We will also present a configuration that can convey a “Venn diagram.” We present systematic subjective experiments to evaluate the effectiveness of the proposed display for shape perception, object identification and localization, and 2-D layout perception, as well as the applications. Our experiments were conducted with visually blocked subjects. The results are evaluated in terms of accuracy and speed, and they demonstrate the advantages of spatial sound for guiding the scanning finger or pointer in shape perception, object localization, and layout exploration. We show that these advantages increase with the amount of detail (smaller object size) in the display. Our experimental results show that the proposed system outperforms the state of the art in

4

shape perception, including variable friction displays. We also demonstrate that, even though they are currently available only as static overlays, raised dot patterns provide the best shape rendition in terms of both the accuracy and speed. Our experiments with layout rendering and perception demonstrate that simultaneous representation of objects, using the most effective approaches for directionality and distance rendering, approaches the optimal performance level provided by visual layout perception. Finally, experiments with the virtual cane and Venn diagram configurations demonstrate that the proposed techniques can be used effectively in simple but nontrivial real-world applications. One of the most important conclusions of our experiments is that there is a clear performance gap between experienced and inexperienced subjects, which indicates that there is a lot of room for improvement with appropriate and extensive training. By exploring a wide variety of design alternatives and focusing on different aspects of the acoustic-tactile interfaces, our results offer many valuable insights and great promise for the design of future systematic tests visually impaired and visually blocked subjects, utilizing the most effective configurations.

5

Acknowledgements I would like to thank my advisor and committee chair Prof. Thrasyvoulos N. Pappas for all the guidance, encouragement and valuable insights. I am grateful for his consistent support and belief in me and this research and his assistance in writing this dissertation. I would also like to thank Prof. James E. West who was my co-advisor from the beginning of this research. I also had two very productive summer internships at the early stages of this research in his laboratory in Johns Hopkins University. I am also grateful to Prof. William M. Hartmann for his valuable research advice, especially for his feedback and input on this dissertation, and for letting me use his facilities for measurements and calibration. I really appreciate Karen L. Gourgey’s unique insights, which were invaluable in developing the acoustic-tactile display. I also had the privilege to work with Prof.West’s student, Joshua Atkins, who co-authored a couple of papers with me, and contributed in the initial stages of this research. I would also like to thank Prof. Pantelis N. Vassilakis for his advice on my research, James D. Johnston for his valuable advice on sound design, and especially for his guidance in the implementation of artificial reverberation, and Profs. John E. Tumblin, Bryan A. Pardo, and Huib de Ridder for their feedback in the various stages of this research. I am very grateful to my whole family for their persistent support, understanding and encouragement throughout last six years. I highly acknowledge my brother Sanjeewa Anupama Silva, for his valuable feedback and useful brainstorming sessions from the very beginning of this research. Last but not least, I highly acknowledge the time and effort of all of my unpaid volunteer subjects, most of whom are my family, friends, lab mates and roommates.

6

Flowers and fruits are admired by all.... while the roots which feed them are gone unnoticed under the ground

To my late aunt & father, my two role models, the reason, the cause & the blessing May they attain the bliss of the noble nibbana. To my dear mother, the guiding light of my path To my only brother, who made me proud of myself by being proud of me.

7

Table of Contents ABSTRACT

3

Acknowledgements

5

List of Tables

9

List of Figures

10

Chapter 1. Introduction 1.1. Contributions

13 18

Chapter 2. Background 2.1. Categorization of Visual Substitution Techniques 2.2. Existing Visual Substitution Techniques

20 20 21

Chapter 3. Acoustic-Tactile Display 3.1. Basic Setup 3.2. Key Challenges 3.3. Key Tasks and Performance Metrics 3.4. Acoustic Signal Design

25 25 26 28 28

Chapter 4. Proposed Configurations 4.1. Shape Configurations 4.2. Navigation Configurations (Object Localization) 4.3. Layout Configurations 4.4. Special Configurations

42 42 46 47 50

Chapter 5. Subjective Experiments 5.1. Subjects 5.2. General procedure 5.3. Equipment and Materials 5.4. Shape Perception Experiments 5.5. Navigation Experiments 5.6. 2-D Layout Perception Experiments 5.7. Experiments with Special Configurations

56 56 56 57 57 61 65 69

Chapter 6. Results and Discussion

73

8

6.1. 6.2. 6.3. 6.4. 6.5.

Shape Perception Navigation Experiments 2-D Layout Perception Experiments Special Applications Conclusions

73 85 91 98 101

Chapter 7. Conclusions and Future Research

102

References

104

9

List of Tables 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9

Configuration summary 73 First set of experiments: Accuracy and time averaged over all subjects 74 Second set of experiments: Average accuracy, timing, and difficulty for all 11 subjects (Sq - Square, Cir - Circle and Tri - Triangle) 80 Second set of experiments: Average accuracy, timing, and difficulty for 6 of the 11 subjects (Sq - Square, Cir - Circle and Tri - Triangle) 81 Trial time statistics for Navigation Experiment 2 (FO: fixed orientation, TO: trajectory orientation 87 Trial time statistics - Comparison of guided vs. unguided and large vs. small object size 90 Results of layout experiments 96 Results of experiments with A1-cane and A2-cane 99 Results of the experiment A3-venn (In relative locations A to B: A relative to B) 101

10

List of Figures 1.1 1.2 1.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10

Visual to acoustic-tactile mapping. (a) Original color image (b) Segmentation (c) Acoustic-tactile display Simple shapes Simple scene (“virtual cane”) Playback volume variation of tone and white noise with distance from center of the object to the scanning finger Variation of SPL and volume with the distance. Maximum distance is assumed to be 1000 pixels Playback signals (tone and white noise) for the three tempos used in our experiments Tremolo High pass filter - first order IIR with the pole at 18 KHz Implementation of allpass-comb reverberation system Comb filter Generating spatialized reverberant sound Training object in the touch screen for different configurations Two possible finger scanning paths inside a border strip Training object in the touch screen for Configuration S3-2trem Localization of a single object on the touch screen Simplified 2-D layout for configurations L1-ung and L2-seq Configuration L2-seq: Sequential exploration of a 2-D layout Configuration L3-vor: Simultaneous presentation of objects in a 2-D layout Test scenes for “Virtual Cane” configurations (a) A1-cane and (b) A2-cane (green: wood, blue: glass, red: metal) Configuration A3-venn: Venn diagram 2-D layout with constraints (A4); flow chart (A5); circuit (A6); US map (A7)

14 15 15 31 32 33 34 36 39 39 41 43 44 44 46 48 49 50 52 53 54

11

4.11

Long cane (A8); campus map (A9); photograph (A10)

5.1

Layout for sequential exploration experiment with L2-seq (blue: 452 Hz, green: 652 Hz, orange: 852 Hz, and red: 1052 Hz). 66

5.2

Layout for simultaneous exploration experiment with L3-vor (blue: bongo roll, green: trumpet, red: oboe, and black: clarinet)

67

Training layout for simultaneous exploration experiment with L3-vor (blue: bongo roll, green: trumpet, and red: oboe)

67

5.3

55

5.4

Training layout for unguided (blue: bongo roll, green: trumpet, and red: oboe) and visual exploration experiments 68

5.5

Layout for unguided (blue: bongo roll, green: trumpet, red: oboe, and black: clarinet) and visual exploration experiments 69

5.6

Layout for Venn diagram experiment (green: bongo roll, blue: trumpet, red: oboe) 71

5.7

Training layout for Venn diagram experiment (green: bongo roll, blue: trumpet, red: oboe)

72

6.1

First set of experiments: Accuracy for each shape for S1-2cons, S2-3cons, S3-2trem and S5-hrtf (S: square, C: circle,, and T: triangle) 75

6.2

First set of experiments: Time distribution for each shape of S1-2cons, S2-3cons, S3-2trem and S5-hrtf (S: square, C: circle,, and T: triangle) 75

6.3

Selected subject drawings for Configurations S1-2cons, S2-2cons, S32trem, S4-3int, and S5-hrtf (s21A: Subject 21, first set of experiments; s11B: Subject 11, second set of experiments)

77

First set of experiments: Accuracy for each subject for S1-2cons, S2-3cons, S3-2trem, and S5-hrtf, averaged over configurations and shapes

78

6.4

6.5

First set of experiments: Average time and standard deviation for each subject for S1-2cons, S2-3cons, S3-2trem, and S5-hrtf, averaged over configurations and shapes 78

6.6

First set of experiments: Average time vs. average accuracy for each subject in S1-2cons, S2-3cons, S3-2trem, and S5-hrtf (Points labeled with subject number)

79

First set of experiments: Selected subject drawings, all marked wrong, in response to circle stimulus in S1-2cons, S2-3cons, S3-2trem, and S5-hrtf

80

Second set of experiments: Accuracy for each shape for S2-3cons, S43-3int, and S5-hrtf (S: square, C: circle,, and T: triangle)

83

6.7

6.8

12

6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17

6.18 6.19 6.20

Second set of experiments: Time distribution for each shape of S2-3cons, S4-3int, and S5-hrtf (S: square, C: circle,, and T: triangle) 83 Results of Navigation Experiment 1: Trial time distribution for each subject 86 Results of Navigation Experiment 2 with bongo : Trial time distribution for each subject 88 Results of Navigation Experiment 2 with trumpet : Trial time distribution for each subject 89 Results of Navigation experiment 3 - unguided background : Trial time distribution for each subject 91 Results of Navigation Experiment 4 : Trial time distribution for each subject (Note factor of 6 scale difference in y-axes.) 92 Results of the experiment with L2-seq (blue: 452 Hz, green: 652 Hz, orange: 852 Hz, and red: 1052 Hz) 93 Results of the Experiment with L3-vor (blue: bongo roll, green: trumpet, red: oboe, and black: clarinet) 94 Error distribution for each subject in the experiment with L3-vor (Sub01* and Sub03* show the errors after the mix-ups have been corrected.) 95 Results of the experiment with L1-ung (blue: bongo roll, green: trumpet, red: oboe, and black: clarinet) 96 Results of the Experiment with visual layout 97 Selected subject drawings in experiments with A1-cane: First row marked correct, second and third marked wrong (A1-1: First object in A1-cane) 100

13

CHAPTER 1

Introduction The world is increasingly dominated by multimedia technology for communication, commerce, entertainment, art, education, and medicine. Since modern electronic media are rich in graphical and pictorial information, it has been hard for the population of the visually impaired (VI) community to keep up. The goal of this thesis is to explore the use of hearing and touch for conveying graphical and pictorial information to VI people using existing widely available devices, such as smart phones and tablets with touch sensitive screens. Existing assistive techniques, such as Braille and text-to speech, can be used to convey textual information, while raised-line drawings and other tactile patterns can be used to convey graphical information in textbooks, maps, and other documents [1]. However, the most effective of the existing tactile representations of graphical information are static, and cannot handle the interactive nature and continually changing content of electronic media and the Internet. In contrast, the focus of this thesis is on the interactive (and hence dynamic) display of visual information via hearing and touch, using existing devices. The advantage of using hearing and touch to convey visual information, which falls under the banner of visual substitution (VS), is that it can substantially increase the amount of information that is available to the VI without the need for invasive approaches, which typically require surgery, to restore some degree of functional vision by stimulating the visual cortex, e.g., as in the cortical or retinal electrode matrix display for partial restoration of vision [2, 3]. Moreover, some noninvasive VS approaches that rely on the presentation of electrical and other tactile stimuli on body parts like the tongue, back, abdomen or forehead [4–6] are also quite objectionable. For example, the tongue display consists of an array of electrodes that applies voltages to stimulate the tongue [5]. While such displays can be quite effective for certain visual tasks, a large part of the VI find them quite “invasive,” and prefer to scan/explore with the finger [7]. In addition, finger scanning provides valuable kinesthetic feedback. Accordingly, active exploration, which parallels the use of the finger tips for reading Braille and the long cane for exploring the immediate surroundings, is a key focus of this thesis. The wide availability of dynamic tactile sensing devices (tablets, tablet PCs, cell phones with touch screens) enable the presentation of dynamic acoustic signals in response to finger movements. Thus, an object can be represented as a region in a touch screen associated with a characteristic sound. A two-dimensional (2-D) layout (map, diagram, graph, chart) consisting of one or more objects can be represented by partitioning

14

♩

(a)

(b)

(c)

Figure 1.1. Visual to acoustic-tactile mapping. (a) Original color image (b) Segmentation (c) Acoustic-tactile display the touch screen into regions, each with a particular sound field, representing an object, part of an object, background, or other element of a graphical display or visual scene. As we will discuss below, a static overlay of raised-dot tactile patterns will also be briefly considered for object representation, but the primary display will be acoustic. The user can then explore the 2-D layout by moving her/his finger (or a stylus) on the touch screen, and listening to the acoustic feedback played back on stereo headphones. The active exploration and kinesthetic feedback are critical for building a mental picture of the layout. Since we will rely on sound as primary source of information for object localization and identification and on touch for pointing and kinesthetic feedback, we refer to the proposed approach as acoustic-tactile. The ultimate goal of the proposed approach would be to use a still or video camera to capture a scene, and then to translate it into acoustic-tactile form. However, due to the limited spatial resolution of touch [8] and hearing [9], we do not believe that it will be possible to display all the visual detail. Moreover, the direct translation of visual signals into acoustic and tactile signals cannot be done in an intuitive way, e.g., image intensity to sound intensity, frequency, or other attribute. Thus, the visual to acoustic-tactile translation will have to be based on image segmentation into perceptually uniform regions [10, 11] and then mapping of each segment (based on its features or semantics) into a distinct acoustic signal, tactile pattern, or combinations thereof, as shown in Fig. 1.1. Such a representation could provide key information about the location, shape, and identity of the key objects in the scene, while eliminating unnecessary detail. Of course, this approach would require image analysis to produce a region-based representation. Instead, we will assume that a semantic representation (e.g., as in maps and graphics) is available, and we will focus on conveying simple shapes and layouts, such as that shown in Figure 1.2, and simple scenes, like the one shown in Figure 1.3. The task of conveying a simple layout or scene can be broken down into rendering the size, shape, identity, and location of each object (or part thereof) in the layout or scene. For object localization and boundary tracing, we make use of spatial sound. This is in contrast to existing devices like the Talking Tactile Tablet (TTT) [12], where the acoustic signal consists of speech that explains the tactile pattern (typically a block

15

Figure 1.2. Simple shapes

Figure 1.3. Simple scene (“virtual cane”)

diagram) explored with the finger. Spatial sound provides directionality and distance cues. The former is based on spherical head models or head-related transfer functions (HRTFs), and the latter on variations of loudness and tempo. We also add reverberation for better directionality and loudness perception, as well as a more realistic interface. For object identification, we generate perceptually distinct sounds. This can be done by utilizing acoustic signals with different physical (frequency, interaural phase, spectral composition, intensity) and perceptual/psychophysical (pitch, location, timbre, and loudness) attributes [13]. At the same time, sound selection should facilitate spatial sound rendering. One of the findings of this thesis research is that realistic sound rendering is not necessary for achieving our goals. Thus, object distance rendering does not have to obey the physical laws for sound variations with distance from the object, and object identification can be based on the most characteristic sound associated with an object, e.g., striking the object with a stick, even though the user is actually rubbing the object with the finger. In addition, completely arbitrary sounds can be used to represent each object, even though more intuitive selections are preferable, as they are easier to learn and reduce the cognitive load of the user during exploration. Finally, the acoustic-tactile configurations we propose make use of recorded speech feedback. Even though our primary focus is on acoustic display, we will also briefly consider the addition of tactile feedback. Static tactile signals can be added by superimposing a raised-dot pattern embossed in a sheet of paper (using a Braille printer, e.g., [14]) on

16

the screen, as in the Talking Tactile Tablet (TTT) [12, 15]. We will show that raised-dot patterns provide the most effective shape rendering. However, since they are static (a new overlay is need each time the scene changes), they can be used for displaying information that is mostly unchanging e.g., a map or building layout, while dynamic sound rendering can be used to for rendering of moving objects and scene exploration. Recent advances in tactile technology have made it possible to add dynamic tactile signals. For example, the touch screen can be combined with a variable friction display [16–19]. Variable friction displays developed by Colgate and Peshkin’s group [20–23], can also be combined with acoustic feedback in response to finger movements. However, as we will show in Chapters 2 and 6, using a touch screen with sound feedback yields significantly better shape recognition results than those obtained with variable friction displays [18]. This is because sound offers more physical and perceptual dimensions than friction. Another possibility is the addition of vibration feedback, which can be incorporated in the touch screen device [24]. However, vibration does not offer good shape rendition, either. Ultimately, the goal is to exploit the best that each modality can offer in order to obtain the most effective and intuitive interfaces. Embossed raised-dot pattern overlays provide the best complement to acoustic signals. While no dynamic device that can combine sound with raised-dot patterns is currently available, emerging technologies based on electroactive polymers, pneumatics, and microelectromechanical (MEM) devices, hold great promise for the future. Even though this thesis research cannot make use of such dynamic devices, the results are expected to have a significant impact on their effectiveness when they do become available. In this thesis we will consider a variety of configurations for acoustic-tactile rendering of object size, shape, identity, and location, as well as for the overall perception of simple layouts and scenes. The configurations have been implemented on a touch screen device, but as we mentioned above, we will also consider the addition of a raised-dot pattern overlay. While our primary goal is to explore the fundamental capabilities and limitations of representing visual information in acoustic-tactile form, we will also consider a number of relatively simple configurations that can be tied to specific applications. These include maps, block diagrams, charts, and simple scenes. In particular, we will present experimental results with a simple scene layout consisting of objects in a linear arrangement like the one shown in Fig. 1.3, each with a distinct tapping sound, which we will compare to a “virtual cane.” We will also present a “Venn diagram” configuration, where the subjects explore a diagram with three overlapping circles of different sizes. We will present systematic experiments with the proposed configurations. All the experiments were conducted with visually blocked (VB) subjects. The experiments evaluate the effectiveness of the proposed display for shape perception, object identification and localization, and 2-D layout perception. The shape exploration experiments demonstrate the importance of instantaneous acoustic cues (directionality cues via HRTF and distance cues via intensity variations). We show that the proposed approach outperforms existing

17

approaches, and argue that using sound for shape rendering has considerable advantages over variable friction displays. We also demonstrate that raised dot patterns provide the best shape rendition in terms of both the accuracy and speed. However, current technology does not allow the combination of raised-dot patterns and acoustic feedback in a dynamic, affordable display. The object localization and 2-D layout perception experiments demonstrate the advantages of directionality cues (via HRTF) and distance cues (based on intensity and tempo) relative to nondirectional sound exploration. We explore two modes of directionality rendering, one assuming fixed virtual head orientation and the other assuming that the virtual head is oriented in the direction of the finger movement, and show that the former is simpler, more intuitive, and most effective. Experiments with layout rendering and perception demonstrate that the simultaneous representation of objects, using the most effective approaches for directionality and distance rendering, approaches the optimal performance level provided by visual layout perception. Finally, experiments with the virtual cane and Venn diagram configurations demonstrate that the proposed techniques can be used effectively in simple but nontrivial real-world applications. This thesis is organized as follows. Chapter 2 presents a literature review of VS technologies. We first discuss broad categorizations of VS techniques, into acoustic, tactile, and acoustic-tactile, static versus dynamic, active versus passive, as well as techniques that rely on intuitive versus arbitrary and direct versus semantic mapping of information. We then present an extensive review of existing VS techniques, from traditional like Braille and the long cane to the current state-of-the-art. Chapter 3 introduces the proposed acoustic-tactile display. The display is to be scanned actively. The main feedback modality is acoustic while kinesthetic feedback is generated by scanning finger movements. The sequential exploration mode of the display is compared to the visual and tactile perception. In conveying visual information acoustically, a virtual acoustic scene is rendered, where the virtual listener is at the location of the scanning pointer and object/segments in the scene being conveyed become virtual acoustic sources. We discuss the “what” and “where” concepts in visual perception and attempt to build analogies with the proposed display. In addition, we discuss a number of hardware and software challenges that had to be addressed in the implementation of the proposed display in consumer tablets, which have limited memory and computation power. Chapter 3 also presents the fundamental principles and core technologies that we rely on for the proposed acoustic-tactile display. One of the key contributions of the proposed approach is the use of spatial sound. As we discussed, spatial sound provides directionality and distance cues. We discuss the principal approaches for acoustic directionality rendering, which relies on natural acoustic cues, and alternative approaches for modeling them. In contrast, for distance rendering we resort to intuitive cues that do not correspond to physically accurate representations of the acoustic scene. This is because the human judgment of distance based on acoustics is poor even in natural settings [25, 26]. We

18

consider the relative advantages of variations in tempo, intensity, and tremolo as virtual distance cues. We also discuss important implementation issues, such as intensity calibration, dynamic range, and quantization, for both directionality and distance rendering. In addition, we use simulated reverberation to add realism and to emphasize directionality and distance cues. Our implementation is based on one of Moorer’s reverberation modules [27] with several modifications. Chapter 4 presents the proposed acoustic-tactile configurations. We first consider a variety of configurations for shape rendition. We then consider configurations for object localization and simple layout perception. We also present a number of configurations for simple scene perception and graphical applications. Finally, we also propose a number of more elaborate configurations, for more advanced applications, which are beyond the scope of this thesis and will be the subject of future research. Chapter 5 describes the subjective experiments we conducted to evaluate the proposed configurations. First, we discuss the basic experimental setup and procedures that are common for all of the experiments. We then present the specifics of each experiment or groups of experiments, organized according to the configurations they are based on, into shape, localization, layout, and application experiments. Chapter 6 presents the results of the subjective experiments, along with a detailed statistical analysis of the results, as well as feedback from the participants and general observations and conclusions. Comparisons of the different configurations, experimental modes, and parameter settings are used to obtain the most effective interfaces for each application and task. Finally, Chapter 7 summarizes the conclusions and contributions of this research, as well as future work. 1.1. Contributions We now summarize the main contributions of this thesis research. • We have proposed new approach for acoustic-tactile representation of visual information. The key distinguishing features of the proposed approach are the active exploration of the object/layout, the use of spatialized sounds, and the semantic, intuitive, but not necessarily realistic, mapping of objects and actions to sounds. • We have proposed configurations for shape rendering, object localization and identification, 2-D layout perception, simple scene perception, and graphical applications. • We have designed acoustic signals for these configurations. • We have conducted systematic subjective experiments with visually blocked subjects to evaluate the various configurations and to determine the most effective selections/settings/parameters. In particular, – We have established the need for guiding the finger/stylus for shape perception and object localization, as well as overall scene perception.

19

– We have investigated the relative advantages of different approaches for distance rendering (tempo, loudness, tremolo). – We showed that raised dot patterns provide superior shape rendering and contrasted it with other popular tactile feedback techniques such as vibration and variable friction displays. – We have explored the use of different sounds (simple tones, broadband signals, musical instruments) for object identification and effective localization. – We have compared different assumptions for the orientation of the listener’s virtual head in the design of directionality cues for object localization, and determined that fixed orientation of the virtual head is more effective than the one that is derived based on the finger trajectory. – We have analyzed the effect of object size on object localization. – We have shown the effect of demographics and training on subject performance. • We have shown that acoustic rendering can be used for better shape perception than available dynamic tactile technologies, such as variable friction displays, which have important shortcomings (limited physical and perceptual dimensions for conveying information). • We have shown that the most effective interfaces do not have to imitate reality or to follow physical laws, but instead must be simple, intuitive, with minimal impact on cognitive load. • We have demonstrated the use of the proposed techniques in the context of specific applications.

20

CHAPTER 2

Background The focus of this thesis is on visual substitution (VS), that is, representing visual information with non-visual stimuli. It is an example of sensory substitution (SS), which is defined as the use of one or more senses to convey information available in another sensory modality. In the following, we first discuss broad categorizations of VS techniques, and then review the existing techniques, from traditional like Braille and the long cane to the current state-of-the-art. As explained in Chapter 1, surgical approaches that attempt to restore some degree of vision by simulating parts of the brain are beyond the scope of this research. 2.1. Categorization of Visual Substitution Techniques The first and simplest classification of VS techniques is into techniques that substitute visual stimuli with tactile, acoustic, and tactile-acoustic stimuli. Another important classification is into static and dynamic techniques. As we discussed, dynamic techniques are important for the design of interactive systems. Since most of the observations are multi-sensory, there is general agreement among the different senses in the perception of the natural world. Of course, the different senses also complement and reinforce each other, but in most cases, it is possible to find features associated with one sense that can substitute for corresponding features in another sense. On the other hand, as we mentioned in the introduction, one can map the features of one sense to completely arbitrary features in the other sense, especially when no corresponding features exist. For example, there are really no tactile or acoustic features to represent color, even though the term “color” is used in music but has a meaning only remotely connected to visual color. Of course, intuitive selections are mostly preferable, as they are easier to learn and reduce the cognitive load of the user during exploration. Thus VS techniques can be classified into those that use intuitive versus those that use arbitrary feature mappings. Another classification of VS techniques is into those that use direct versus semantic mapping of visual signals. The direct translation of visual to tactile signals is possible, for example in raised-line drawings, but it is not always intuitive, for example in the translation of image intensities to voltages applied to the tongue [5] or sound intensities [1]. Moreover, the limited spatial resolution of touch [8] and hearing [9], makes the direct translation even more ineffective and difficult. Instead, semantic mapping can be used to provide more effective and intuitive, if not realistic, interfaces.

21

Finally, an important distinction is into active and passive VS techniques. For example, Braille reading with the finger is active, while electric stimulation of the tongue is passive. The advantage of the former is that they give the user full control of information access, while in the latter the order and the speed of presentation is determined by the system. Moreover, active exploration with a finger or stylus provides valuable kinesthetic feedback. 2.2. Existing Visual Substitution Techniques Assistive technologies for VI people make extensive use of VS. Perhaps the most common example is Braille, which substitutes vision by touch for reading optical characters and symbols. The Braille system, introduced by Louis Braille in 1821, is one of the oldest and most widely used aids by the VI. Each Braille character is represented as a cell containing a 3 × 2 (three rows, two column) matrix of dots, each of which may be raised or not [28]. Thus, the mapping is semantic but arbitrary. A different Braille alphabet has been designed for most languages. Among several advantages, Braille facilitates active exploration. Moreover, users can use multiple fingers from both hands for look-ahead reading, in a manner that is analogous to peripheral vision. Numerous devices such as Braille printers, are available to render Braille. Braille embossers are relatively inexpensive and can also be used for mass production, but are static. On the other hand, refreshable Braille displays based on pin arrays are interactive, but are bulky and expensive. In addition, pin arrays can be used for 2-D display of graphical information but suffer from the same drawbacks as Braille displays. Linvill and Bliss [29] developed a device called Optacon as an alternative reading aid for the VI. The Optacon uses an array of photo cells to capture an optical character. The photo cells are then (directly) coupled one-to-one with a matrix of piezoelectric pins that generates a vibro-tactile pattern. Oleg Tretiakoff designed a modified version of the Optacon, in which acoustic stimuli are used instead of vibro-tactile stimuli [30]. Another reading device that uses direct translation of optical characters into acoustic stimuli is the Stereotoner [31]. The Sterotoner senses the reflected light of a character by a moving column of 10 sensors and maps the binary output of each sensor to a pure tone. For example, if three sensors are activated, then it plays back the corresponding three tones (simultaneously). These devices can be classified as active exploration. With the advancement of the state of the art of speech synthesis, an alternative VS approach for reading is text-to-speech, which substitutes vision by sound. Compared to Braille, text-to-speech is more natural and does not require training, but the presentation is mostly passive. Another simple and effective VS tool used by the VI community as a navigational aid is the long cane. The long cane relies on both touch and hearing. However, it is limited to the immediate vicinity in front of the VI person. A number of navigational devices have been proposed that attempt to extend the range of the long cane by direct translation of distance to an acoustic or tactile signal. The Mowat sensor introduced in 1977 [32, 33], is an ultrasonic cane that detects objects from ultrasonic reflections and converts them

22

directly to vibro-tactile feedback. The Nottingham obstacle detector is based on similar principles, except that it provides auditory instead of vibro-tactile feedback [34]. In 1984, a head-mounted sonar system called Sonic Pathfinder, which only indicates the closest object (and are thus used in conjunction with the long cane in navigation) was introduced by Tony Heyes [35]. In this system, the distance to the nearest object is mapped directly to tones on a musical scale (that is, pitch is proportional to the distance) and the angular placement of the object relative to the user is denoted by stereophonic sound [36]. KSONAR is another ultrasonic device based on direct distance-to-acoustic mapping, that is used with the long cane [37]. Meers et al. [38] designed another direct mapping system, in which head-mounted stereo cameras are used to build a depth map of the objects in front of the user. The system then conveys the distance of the ten most important objects (left to right) to the ten fingers of the user via gloves that produce electro-neural stimulations proportional to the distance of each object. Walker et al. [39] introduced a system, in which a person is guided to a final destination through several way-points by directional (HRTF-based) wide-band sound beacons and voice. This system used GPS to acquire the user location and orientation. Additional information such as nearby objects and walking terrain is conveyed by sound icons and natural sounds. However, our main interest is in the representation of graphical and pictorial information. As we discussed, raised-line drawings and other tactile patterns can be used to convey graphical information in textbooks, maps, and other documents [1]. Such tactile graphics can be generated with a number of different methods that include stereo-copy, thermoform, silk screen with foam ink (screen-printing), and milling machines. The tactile patterns are embossed on a variety of media, each of which is matched with the method of embossing (micro-capsule paper for stereo-copy, thermoplastic polymers for thermoform, wax-based paper for screen-printing) [40]. Jehoel et al. [41] report that rough paper and micro-capsule paper are the most suitable substrates and suggest rough plastic when a more durable substrate is needed. However, in all of these techniques, the conversion from the visual graphic to the tactile graphic is not always done via direct mapping, and the conversion is generally a time consuming and expensive task that requires human mediation [1]. Nevertheless, the results are quite useful and worth the effort. Kurze proposed a tactile drawing tool, that can be used by the VI subjects to draw pictures and feel their drawings [42]. Fritz et al. [43] proposed a haptic interface with three degrees of freedom force-feedback mechanism, to be used in communicating 2-D and 3-D data plots. Jansson et al. [44] developed a haptic mouse, which provides feedback to the two fingers on the mouse using a matrix of pins for each finger. He tested it on geographical map exploration, but found that it was ineffective without visual feedback. As we discussed in the introduction, recent advances in tactile technology have enabled the presentation of dynamic tactile signals. Colgate and Peshkin’s group developed a number of variable friction displays, which control the surface friction of a rigid material such as glass [20–23]. TeslaTouch is another variable friction display that relies on the electro-vibration principle to modify the attractive force (and hence the friction) between

23

the finger and a flexible panel that can be superimposed on a touch screen [16,18,19]. A closely related technology was developed by Senseg [17]. The advantages and disadvantages of variable friction displays will be discussed in Chapter 6. However, as we will show in Chapters 2 and 6, using a touch screen with sound feedback yields significantly better shape recognition results than those obtained with variable friction displays [18]. This is because sound offers more physical and perceptual dimensions than friction. An ambitious VS task is to use a camera or other device to capture an image of the environment, and then to present it to the user in acoustic-tactile form. As we discussed, one approach for doing this is by direct translation of visual signals into touch and/or sound. An example of this approach is the tongue display [5] we mentioned in the introduction, which relies on direct translation, and passive presentation, of grayscale image intensities captured by a camera into voltages. It consists of an array of electrodes that can apply different voltages to stimulate the tongue, which is the most sensitive tactile organ and has the highest spatial resolution. Even though the size of the electrode array is limited (12 × 12), when the video camera is placed on the head, the brain makes use of super-resolution principles to obtain higher resolution images. The tongue display has proven to be quite effective in providing visual information and assisting visually impaired people with certain visual tasks [5]. However, the aversion factor for this device is quite high. The same is true for presentation of electrical and other tactile stimuli on other parts of the body (back, abdomen, or forehead) [4, 6]. Instead, the majority of visually impaired people find such presentations quite invasive, and prefer to scan with the finger [7] (active exploration). Another example of direct translation is SoundView, developed by Doel et al. [45], whereby the user actively explores a color image on a tablet with a pointer, listening to acoustic feedback. The sound depends on the color of the pixel and the velocity of the scanning pointer, which acts like a gramophone needle creating sounds as it “scratches” the image surface. Meijer’s imaging system named vOICe [46], maps a 64×64 image with 16 gray levels to a sequence of tones. The vertical dimension of the image is represented by frequency and the horizontal dimension is represented by time and stereo panning. The loudness of the tone is proportional to the brightness of the corresponding pixel. In contrast to SoundView, the user does not have active control of the presentation. The performance of these two systems will be discussed in Chapter 6. Another passive system was proposed by Hernandez et al. [47]; in this system sound rays emanate from the object surface in the same manner as light rays. Among the direct mappings, we should also mention the rendition of grayscale values to tactile patterns proposed by Barner et al. [48–50], using digital halftoning. However, as we discussed above, all of the direct mapping systems are not very effective because of the limited spatial resolution of touch and hearing – in contrast to vision. In addition, they are limited by the lack of intuitive mapping from the visual to the acoustic or tactile domain.

24

In contrast to the direct mapping approaches, the proposed approach assumes that the image has been analyzed to obtain a semantic mapping, which is then presented to the user in acoustic-tactile form. A number of systems have been proposed along these lines. Jacobson [51] implemented an audio enabled map in a touch pad, which plays back voice and natural sounds that correspond to the finger position. In addition to the semantic mapping, the system is dynamic but does not make use of spatial sound. Parente et al. [52] developed an audio map using 3-D spatial sounds to locate auditory icons and speech callouts. The user actively explores the map with a mouse, trackball, keyboard, or touch screen, querying information about the points visited. When a joystick is used, additional map information is provide via tactile vibrations and texture. This system, too, is dynamic and uses a semantic mapping. The above two systems are interactive displays based on active exploration. Note, however, that conventional pointing devices such as the mouse, trackball, and joystick, by design, require visual feedback for operation. Hence, in contrast to exploring a touch screen with a finger or stylus, they are not effective in VS tasks as Jannson et al. found [44]. Along these lines, NOMAD [53], the Talking Tactile Maps [54], and the TTT [12] we mentioned above, are static displays that provide an embossed surface that the user scans with the finger, while an auditory signal (typically in the form of speech, even though other sounds are possible) is played back at certain finger positions. However, in these systems the user needs to manually change the tactile overlay in order to explore a different layout. On the contrary, auditory graphical user interfaces (GUIs) such as, Soundtrack [55], Kershmer and Oliver’s system [56], Savidis and Stephanidis’s system [57], and the Mercator Project [58] are dynamic, and hence, more interactive. Finally, Ribeiro et al. [59] proposed a passive system that sonifies objects in a realworld scene using spatialized three dimensional sounds. The spatialized sound rendition was done using HRTFs from CIPIC database [60], intensity, and direct-to-reverberant ratio. Their approach is semantic in that computer vision techniques are used to identify objects in the scene. While this is an indirect semantic approach like ours, and the acoustic rendition methods are similar to ours, a direct comparison of the results is not possible because the different tasks in the two experiments.

25

CHAPTER 3

Acoustic-Tactile Display The goal of this thesis is to design interfaces for conveying graphical and pictorial information (graphs, diagrams, charts, and eventually, photos and videos) to a VI or VB user by acoustic signals as she/he scans a touch screen using the finger or a stylus as the pointing device. As we discussed in the introduction, a sheet of paper with embossed raised-dot patterns may also be superimposed on the touch screen (without affecting its ability to detect the moving finger or stylus), but the emphasis will be on acoustic display with touch as the pointing mechanism. In this chapter we present the basic setup, the key challenges, the fundamental principles on which the interfaces are based, and the design of the acoustic signals. 3.1. Basic Setup In the proposed display, the touch screen is partitioned into regions, each with a particular sound field, and raised-dot pattern when a tactile overlay is added. In the remainder of this chapter, we will assume that the response to finger movements is purely acoustic. Each region represents an object, part of an object, background, or other element of a visual scene or graphical display. As the user scans the screen with the finger, he/she is listening to auditory feedback, played back on stereo headphones, that corresponds to the finger location on the screen. Spatial sound is a key element of the proposed acoustictactile display. The intensity, tempo, or tremolo rate of the sound may change to indicate distance from an object or another region, while the directionality of the sound may be used to guide the finger to an object or region. By scanning the touch screen with the finger or stylus, the user can actively explore the visual information, which has been mapped into acoustic signals, with the goal of building a mental picture of the layout. Initial indications from existing research are that it is possible to perceive scene layout using sound. In particular, Sanchez et al. [61] found that with the use of spatialized acoustic stimuli, blind children can construct spatial imagery. As we discussed, the primary use of touch is as a pointing mechanism; however, when used for active exploration, it also provides important kinesthetic feedback, which facilitates scene perception. The use of the finger/stylus to scan the display should be contrasted with traditional pointing mechanisms used in personal computers, like the mouse, trackball, and joy-stick, which have been found to be ineffective in VS tasks [44].

26

The task of perceiving visual information can be divided into two main subtasks, the “what” and the “where” [62, 63]. The “what” involves identification of the different objects/segments in the scene, for example by their shape, size, color, material, sound, etc. The “where” involves the determination of the segment/object positions in the scene [62,63]. Of course, there are also interactions between the two, for example the perception of relative sizes of two objects depends on their sizes on the retina and their locations relative to the user. This “what” and “where” tasks of perception are in fact common for all three modalities under consideration (visual, acoustic, and tactile) [64, 65] as well as for olfactory. In the configurations for the embodiment of the proposed approach that we present in Chapter 4, we will consider these two tasks both separately and together. For example, we have designed configurations for shape perception (what), for object localization (where), and for simple layout perception (what and where). In addition to shape and size, object identification can be based on the characteristic sound associated with the object. In Section 3.4.1 below, we will discuss the selection of sounds that facilitate object identification and discrimination. For object localization, we will rely on spatial sound (directionality, distance) to guide the finger/stylus to an object. However, the same cues will be used for shape exploration, to guide the finger along the object boundary. 3.2. Key Challenges As we discussed in Chapter 1, one of the key challenges for the acoustic-tactile representation of visual information is the mismatch between the high spatial resolution of vision, and the low spatial resolution of hearing [9] and touch [8]. The challenge is then to provide key information about the location, shape, and identity of the key objects in the scene, without the ability to display much detail. Moreover, as we saw, the direct translation of visual signals into acoustic and tactile signals cannot be done in an intuitive way, e.g., image intensity to sound intensity, sound frequency, raised-dot density, voltage applied to a body part (tongue display [5]), etc. Thus, the proposed approach relies on the availability of semantically meaningful segmentations, and mapping of the segment features or semantics into distinct acoustic signals. This mapping can be direct (when possible, e.g., rough visual textures into rough acoustic textures), semantic (glass surface into the sound of knocking on glass), or arbitrary but intuitive (green grass into smooth music). Although semantic analysis of visual information is beyond the scope of this thesis, our focus is on applications (such as conveying maps, graphs and diagrams) for which semantic information is readily available. Another challenge is the sequential exploration of the display. In the proposed mode of exploration, the user scans the 2-D layout with one finger or stylus, listening to one sound source at a time, the one that corresponds to its location on the screen. The touch screen typically responds to the centroid of the area that is in contact with the finger/stylus tip. The use of the other fingers, as for example in Braille reading, where the other fingers can be used for looking ahead, is possible, but we should point out that, while in Braille reading

27

each finger senses a different signal, when multiple fingers are used to scan a touch screen, the acoustic feedback is not necessarily associated with a particular finger. However, if used carefully, the playback of two or more simultaneous sounds, whether activated by multiple fingers or proximity of one finger to multiple objects, could help navigation in the virtual environment, e.g., by indicating the presence of additional objects, dangers to avoid, etc. Indeed, in Section 4.4.3 we will present a configuration (Venn diagram) where two or more sounds are played simultaneously to indicate overlapping sets. Nevertheless, with a few exceptions, this thesis will primarily be concerned with sequential exploration using one finger. The sequential presentation of information in our setup should be contrasted with vision, where there is a mix of sequential, high-resolution, limited-view scanning by the fovea and gathering of low-resolution data over a wide receptor field in the periphery [66]. The serial presentation has also been recognized as a significant bottleneck in experiments with perception of raised-line drawings [67]. Interestingly, Loomis et al. [68] demonstrated that the limitations imposed by point-by-point serial presentation are not inherent to any one presentation modality. When they narrowed down the visual field of view to the size of a fingertip and asked subjects to recognize line drawings, the recognition accuracy was no better than that achieved in recognizing raised-line drawings with touch. In shape, layout, and scene exploration, an important challenge for the proposed approach is the amount of cognitive effort that is required for guiding the finger to various points on the screen, which may interfere with the overall shape and layout perception. This is in contrast to vision, where object and scene exploration is typically effortless. Interestingly, haptic tracing of simple raised-line drawings (e.g., of a simple object) requires little, if any, cognitive effort, while the perception of shape is quite challenging and requires significant cognitive effort [67]. For acoustic-tactile representations, our goal is to design intuitive interfaces that minimize the cognitive load for the exploration, and of course, also require reasonable cognitive effort for shape and layout perception. As we will see in Chapter 4, one of the key differences with raised-line drawings is that our display uses different signals for the object boundaries, the interiors, and the background, which facilitates shape and layout perception. On the other hand, boundary tracing requires substantial effort. To reduce the cognitive effort of exploration, we use spatial sound (directionality and distance) for guiding the finger. Overall, the perception of visual information by the proposed approach will by necessity be slow and considerably less accurate than vision. As we will see in Chapter 6, subject performance will vary across subjects and will depend on training, previous sensory experience, as well as experience with acoustic-tactile devices. For example, while vision-blocked sighted or late-blind subjects may have an initial advantage over congenitally blind subjects (as Lederman et al. [69] found for haptic perception of raised-line drawings), visually impaired subjects may eventually have an advantage as they have strong motivation for training.

28

3.3. Key Tasks and Performance Metrics The tasks for the user of the proposed acoustic-tactile interface include the following separate but related parts. (1) Starting from an arbitrary point in the virtual space, the subject must navigate to the target object(s) by moving the finger on the touch surface, as guided by the sound. (2) The subject must trace out the border of an object with the finger based on sound guidance. (3) The subject must perceive the object size and shape, as well as possible overlap with neighboring objects. (4) The subject must perceive the object identity and material. (5) The subject must perceive the scene or graph layout, i.e., the relative positions of the objects on the touch screen. These tasks will be accomplished using primarily non-verbal auditory feedback. We will use verbal feedback only for general instructions, as Tran et al. [70] have found that speech is annoying and inappropriate for guiding. One other advantage of sounds that do not rely on speech is that they are expected to be equally valid and useful regardless of the linguistic background of the participants. A central component of this thesis research is to determine the sounds, and the rules for their parametric variation with finger position, that lead to optimum performance. The performance of the proposed techniques will be measured in terms of accuracy and speed of performing the particular tasks, as well as various metrics of user experience, such as the level of difficulty and the cognitive load reported by the subjects, and how intuitive they found the interface to be. 3.4. Acoustic Signal Design As we discussed, sound will be used as the primary mode of feedback in the proposed acoustic-tactile display. The primary functions of acoustic signals will be for localization and identification. We plan to use acoustic signals for guiding the scanning finger in the exploration of the virtual environment, and for object identity, size, material, and shape recognition, as well as its relationship to other objects, e.g., overlap. To achieve these goals, we will utilize acoustic signals with different physical attributes, such as spectral composition, intensity, tempo, and interaural differences, and different perceptual/psychophysical attributes, such as pitch, timbre, roughness, sharpness, loudness, and location [13]. Thus, in contrast to other modalities, such as friction, sound offers multiple perceptual dimensions that can be used for the above stated goals. In the following, we first consider sound design for object identification and then for navigation in the virtual environment. We also discuss the selection of the sound attributes that can facilitated these two tasks. As we will see some of the selections are closely related to both tasks.

29

3.4.1. Sound Discrimination and Identification Our first goal is to generate sounds that are perceptually distinguishable from one another, and can be associated with particular objects in the virtual environment, as well as object borders, graph nodes, and other important points in the scene or graphical display. We will also use acoustic signals for object and material recognition. To that effect, recorded characteristic sounds of materials, combined with appropriate tactile textures, can be used to invoke the sensation (illusion) of touching the actual material. However, as we have already pointed out, while realistic perception of an object or material is desirable, the end-goal of object and material recognition can be achieved using unrealistic, but easier to discriminate, remember, and recognize sounds. For example, it is a lot more difficult to recognize wood or metal from the sound it makes when you rub your finger against it than from the sound it makes when you strike it with a stick. Of course, we can also use completely arbitrary sounds, which the subjects can learn to identify with a particular visual attribute, material, or object. 3.4.2. Sound as a Tool for Navigating the Virtual Environment As we discussed above, spatial sound will be the primary tool for navigating the virtual environment. We will consider distance cues (the distance between the listener and the source) and directionality cues (angle from which the sound is coming from) separately. While distance and directionality can be encoded using arbitrary acoustic cues, which users can learn, it is important to synthesize and use natural cues wherever possible, in order to minimize the learning curve and the cognitive effort required in using the display. Thus, for directionality we will rely on natural acoustic cues using a spherical head model or HRTF, which attempt to synthesize sounds coming from a given direction in the natural environment. However, for distance we will rely on intuitive cues that do not necessarily obey the physical laws for sound variations with distance from the object. 3.4.3. Distance Rendering In real world conditions, changes in distance from a sound source cause significant changes in the acoustic signal that reaches the listener’s ears, thus providing strong proximity cues. There are four main acoustic distance cues for stationary sources and listeners: intensity, direct-to-reverberant energy ratio, spectrum, and binaural differences [25]. Under ideal conditions (point source and acoustic free-field), variations of sound intensity with distance follow an inverse square law. The direct-to-reverberant energy ratio, which is more noticeable in indoor environments, undergoes a systematic decrease as the distance increases. The distortion of the acoustic signal spectrum also increases with distance. Finally, in the near-field, the interaural time difference (ITD) and interaural level difference (ILD) are also distance-dependent. However, the exact relationship between distance and

30

all of these acoustic cues is largely dependent on the environment and the properties of the sound source [26]. Humans make use of multiple distance cues, including non-acoustical factors, such as vision and perceptual organization for distance perception [26]. Moreover, different humans weigh each cue differently in their judgments [26]. Thus, due to the complex and interdependent relationship among physical parameters and perceived acoustic distance, it is very difficult to effectively model human perception of acoustic distance. At the same time, human judgments of absolute acoustic distance in natural settings have been shown to be largely inaccurate. Distance is typically underestimated in the far-field (over 1 m) and overestimated in the near-field (less than 1 m) [26], and there is also significant variability in distance judgments [25]. Indeed, the accuracy of absolute distance judgments using acoustic cues largely depends on the availability of a reference sound (familiarity with a sound at a known distance). For example, human distance judgments based on familiar sounds like human voice are more accurate than those based on unfamiliar sounds [71]. Based on the difficulty of modeling acoustic distance perception and the inaccuracy of human perception of absolute distance, we decided to focus on rendering changes in distance rather than absolute distance, and to rely on intuitive conventions rather than accurate physical models. For conveying the actual distances between the objects in the presented virtual scene, we rely on kinesthetic feedback from the moving finger/stylus on the touch screen. Note that since this is a virtual environment, there is no need to follow physical laws for signal variations. Instead, we can focus on the most intuitive conventions or those that are the easiest to learn. Among several possible conventions for indicating changes in distance we tried three: intensity, tempo, and tremolo. Most of the configurations described in Chapter 4 and utilized in the subjective experiments of Chapter 5 and 6 use intensity to render distance except one each that relies on tempo and tremolo. 3.4.3.1. Intensity. Sound intensity is a natural choice for rendering changes in distance because, as we mentioned above, it is one of the four cues of distance judgments by humans. Another advantage is that, in contrast to tempo we discuss below, it is an instantaneous cue. The simplest way to modulate the intensity of the sound output at the headphones is via the playback volume control of the tablet. The playback volume was thus changed based on the distance of the finger from the object. (As far as the tablet implementation is concerned, this change is essentially instantaneous.) In one of the configurations we discuss in the next chapter (Section 4.2.2), we used a tone signal to denote the object and the tone with added white noise to denote the background associated with the object. While in the background, changes in proximity were indicated by changes in the signal-to-noise ratio, that is, as the finger approached the object, the tone intensity increased and the noise intensity decreased. Ideally, intensity changes should be based on their effect on loudness, but this is difficult for several reasons. First, we do not know what the relationship

31 Tone White noise

1

Playback Volume

0.8

0.6

0.4

0.2

0

1200

1000 800 600 400 200 Distance in pixels from center of the object to the scanning fingure

0

Figure 3.1. Playback volume variation of tone and white noise with distance from center of the object to the scanning finger between the volume controls and intensity is. Second, the relationship between intensity and loudness is not easy to figure out, especially because of the complicated interactions between the tone and noise signals. Instead, we worked directly with the volume controls on the device. We tried several functions for varying the volume of the tone and noise with distance, and selected the functions shown in Figure 3.1. Note that the volume of the white noise decreases rapidly when the finger (subject) is far from the object and slower as it approaches the object, while the volume of the tone increases slowly when the finger is far from the object and more rapidly as it approaches the object. Note that a step in noise volume is necessary to indicate that the finger has entered the object. When the finger is inside the object, the subject only hears a constant, non-directional tone. To get a better grasp of the relationship between the volume control and sound intensity, we measured the relationship between tablet volume controls and the sound output at the headphones. We used a TENMA 72-935 dBa sound pressure level meter to record the sound level at the headphones for 50 uniformly sampled volume levels of a 1 KHz sinusoidal signal. We then used these measurements to calibrate the volume with distance so that the SPL level is uniformly varied with distance. Figure 3.2 shows the SPL and volume variations with distance in pixels, assuming a range of 0 to 1000 pixels. For an Apple iPad 1, this corresponds to 7.6 inches, and for an Samsung Galaxy Note 10.1, this corresponds to 3.34 inches. Since the maximum possible distance is equal to the length of the diagonal of the display, we used that as the distance range in the display calibration. Since human perception of intensity changes is better when there is a reference intensity, in one of the configurations we present in Chapter 4, we used reverberation as a reference intensity. As discussed above, direct-to-reverberation energy ratio is a natural distance cue. Note that, since the reverberation level does not change as the distance or directionality between the source and the listener changes (refer to Section 3.4.8 for

32

SPL Volume

SPL

60

1 0.8

50

0.6

40

0.4

30

0.2

20 0

200

400

600

800

Volume

70

0 1000

Distance in pixels from center of the object to the scanning pointer

Figure 3.2. Variation of SPL and volume with the distance. Maximum distance is assumed to be 1000 pixels

more a more detailed discussion of reverberation), it serves as the fixed reference. Thus, the level of reverberation is constant, while the direct (dry) sound intensity varies with distance according to the curve of Figure 3.2. With the limited dynamic range of sound intensity of the tablets, the design of distance versus intensity curve becomes a challenge. Having an uniform gradient of intensity across the entire (diagonal) length of the tablet results in smaller, and thus more difficult to perceive, intensity changes. This was indeed the case in our initial experiments, and subjects complained. Thus, in one configuration, we designed the distance versus volume curve to lager gradients closer to the object, and smaller gradients away from the object. Of course, some subjects then complained about having difficulty perceiving changes away from the object. Thus, the best distribution of the gradient across distance remains an open issue. 3.4.3.2. Tempo. Even though tempo variations are not naturally experienced in acoustic distance judgments, it is easy for subjects to learn to associate the tempo changes with changes in distance. Moreover, tempo has already been used successfully in other acoustic interfaces such as SONAR, audio vehicle backup assist, and games. In our implementations, the tempo of the sound increases as the scanning finger approaches the object. A simple way to generate a tempo is by creating a periodic signal with alternating segments of signal and silence. Note that the playback signals in the proposed display are already periodic, because for practical implementation reasons, a finite length signal has to be stored in the device and played back in a loop. Thus, the only additional requirement for creating a clear tempo is the inclusion of a silence. To vary the tempo,

33

1 Tempo 1

Amplitude

0 −1 0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tempo 2 0 −1 0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 Tempo 3

0 −1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time in seconds

Figure 3.3. Playback signals (tone and white noise) for the three tempos used in our experiments we fix the length of the signal segment and change the length of the silence segment, as shown in Figure 3.3.1 Ideally, we would like to have continuous tempo variation with distance, in direct analogy to the volume variations described above. However, for simplicity, in our first implementation, we quantized the tempo variations to a few levels. This means that the user must move her/his finger for a certain distance before a tempo change; on the other hand, when a tempo change does take place, it is clearly perceptible. More research is needed to determine whether a finer quantization of tempo would result in more effective distance cues. However, we should point out that, no matter how it is implemented, the perception of tempo is not instantaneous; the user has to listen to the sound at least for one period in order to decode/perceive the tempo. 3.4.3.3. Tremolo. Tremolo is a sound effect that is popular among musicians and can be described theoretically as low-frequency amplitude modulation without zero crossings [72]. We selected a tremolo signal with the following form: T r(t) = [1 + D sin(2πfR t)] sin(2πfc t)

(3.1)

where fc is the carrier frequency, fR is the rate (modulation frequency), D is the depth (modulation degree), and where fc ≫ fR . Note that the selected tremolo consists of a sinusoidal tone modulated by a slowly fluctuating signal; however, almost any sound could be used in place of the sinusoidal tone. In practical implementations, it is important to keep the signal amplitude within the [−1, +1] range to avoid clipping artifacts. Thus, we implemented a slight modification of (3.1), shown in Fig. 3.4, as follows: T r(t) = [1 − D + D sin(2πfR t)] sin(2πfc t) 1This

(3.2)

bypasses the need for sophisticated and computationally intensive algorithms that are capable of manipulating the tempo of a sound without changing the pitch.

34

amplitude

1 1−2D 0 −1+2D −1 0

500

1000

1500

2000 2500 time samples

3000

3500

4000

Figure 3.4. Tremolo Typical values for the rate are between 3 Hz and 10 Hz. As we increase the rate, fR , of the tremolo, the listener perceives a faster periodic sound [73]. We are using this rate change to encode the distance information so that when the finger goes closer to the object the rate increases and vice versa. Note that since rate is also a temporal cue, rendering distance based on tremolo variations is not instantaneous either. 3.4.4. Directionality Rendering The advantage of directionality (in addition to distance) cues is that the subject does not need to move the finger to find out if she/he is moving in the right direction. This considerably simplifies the exploration task, and allows the subject to focus on the perception of the scene layout. We now discuss the basic principles and implementation issues for directionality rendering. Consider the problem of trying to synthesize a realistic and fully immersive acoustic environment to a single listener binaurally over headphones. To do this accurately one needs to know (1) the location and orientation of the virtual listener’s head and (2) the location of all the acoustic sources and the transfer characteristics of the sound, how it is modified by reflections, diffractions, and absorptions of the room and the listener’s head and ear. The first we will discuss in a separate section below. Here we discuss the second. There are two main ways of modeling the transfer characteristics, using a spherical head model and using a head related transfer function (HRTF). The former is a simple approximation of the transfer function that provides lateralization (angle) but cannot localize the sound, as the rendered sounds are perceived as if they occur inside the head [74, 75]. The HRTF utilizes a significantly more elaborate model for rendering spatialized sound; however, it comes at an increased computational cost. The computation can be reduced significantly [76], making the real-time sound rendering via HRTF possible. In the remainder of this thesis, we opt for the more realistic reproduction of the acoustic environment that the HRTF provides. However, the question of whether a simple approximation based on the spherical head model would be be equally effective remains open, and should be investigated in the future.

35

We now discuss some of the implementation issues in more detail. These are particularly important when one uses portable devices with limited memory and computational power, such as as consumer tablets, that are not optimized for signal processing tasks. Each individual person has a unique HRTF that models the effects of the person’s head and torso. The HRTF has to be measured in a controlled lab environment or simulated using a user’s anatomical data, which must be very accurately measured using specialized equipment [77, 78]. Accordingly, obtaining an individual HRTF for each user is a non-trivial task that is costly and time consuming. In order to simplify the procedures for use of the proposed display, as a first simplification step, we chose to avoid individual HRTF measurements and to use HRTF measurements of representative individuals and mannequins that are available in the Image Processing and Integrated Computing (CIPIC) database [60]. However, for dedicated users of the device, and especially VI users, obtaining the individual HRTF may be a worthwhile investment. Another costly operation associated with the HRTF implementation is calculating new points that are not in the measurement grid, as this involves interpolation or recalculation of the simulated HRTF. To reduce the complexity associated with such operations, as well as to reduce additional complexities associated with multiple sound sources, Atkins [76] proposed a method for obtaining the HRTF using a basis of spherical harmonics [79–81]. This transform domain approach offers a compact framework for transmitting spatial 3-D sound fields with an arbitrary number of sources, and makes it possible to carry out the bulk of the interpolation computations off-line. Another key issue is synchronization between the acoustic rendering and the tactile sensing of the finger (or stylus) position. As the finger moves relative to the object, the HRTF-based directionality rendering must be continuously updated. However, even with the simplifications of Ref. [76], the amount of computation is prohibitive for real-time rendering of directionality via HRTF in currently available portable devices, such as the iPad. In order to simplify the computation, and to minimize the delay between sensed finger position and sound rendering, we had to quantize the directionality rendering to a finite set of directions (pie sections), and to preload the sound files for each section. The quantized rendering with preloaded files raises another implementation issue. As the finger moves and the directionality changes from one pie section to another, instantaneous switching from one pre-computed rendering to another would create disturbing artifacts. Thus, it is necessary to continue the current sound until a natural break (the end of a period), at which point the switch to the new direction can be made. However, this introduces a time lag between directional rendering and pointer movements. To minimize this lag, we use sounds with shorter time periods. Finally, another known short coming of rendering sound directionality via HRTFs or spherical head model, is front/back confusion, whereby subjects cannot perceive a difference between the sounds that are coming from angles (on the horizontal plane) that are symmetric with respect to the axis that passes through the ears. An important cue for alleviating this confusion is the pinnae effect, according to which sounds coming from

36

0 dB −5 dB

Gain

−10 dB −15 dB −20 dB −25 dB −30 dB 0

5KHz

10KHz 15KHz Frequency

22.5KHz

Figure 3.5. High pass filter - first order IIR with the pole at 18 KHz behind are less bright than the ones coming from front. This pinnae effect is typically captured by the HRTF, but can be made stronger by emphasizing energy in the higher frequencies in the original non-directional sound before the HRTF transformation. To boost higher frequencies, we can apply a highpass filter, like the one shown in Figure 3.5. We have applied such a filter to the musical instrument recordings we used in our experiments described in Chapter 5. 3.4.5. Listener Head Orientation As should be obvious from the above discussion, knowing the head location and orientation is necessary for synthesizing the acoustic signals that are presented to the listener over stereo headphones. There are two fundamental questions that need to be answered in order to understand the pathway through which auditory spatial cues can direct finger motion. The first relates to the actual location and orientation of the listener’s head, and the second relates to the location and orientation of the virtual listener’s head with respect to the objects in the virtual environment. The first question was relatively easy to resolve. There are two possibilities. The first is that the direction information from spatial cues delivered by the headphones should necessarily referenced to the actual head location and orientation because auditory direction is always referenced to head location and orientation in real world listening. The second is that the actual head location and orientation are irrelevant. The spatial information from the headphones is translated directly into guidance to the finger, which has its own system of reference that is tied to the virtual environment and the task the user is trying to perform. It became clear in early experiments that the latter case is true. The second question concerns the head location and orientation of the virtual listener. The location should obviously be at the location of the finger. However, the head orientation is not as clear. On possibility is that the head orientation is fixed, e.g., always facing the top of the display screen (at the zero azimuth, facing north). We refer to this as the fixed orientation. The other is that it is based on based on short-term memory of previous motions. The most reasonable assumption is this respect is that as the user moves toward the object, she/he is always facing the object. Since the orientation of the

37

virtual listener’s head is derived based on the finger trajectory, we will refer to this as the trajectory orientation. There are other possibilities, for example that the head direction follows the orientation of the finger, but it is easy to reject them, as they lead to very awkward finger twists as the user moves toward the bottom of the display. Moreover, this would not work for a stylus. 3.4.6. Selecting Sound Attributes We now discuss a number of sound attributes that are important for object identification and localization, as well as creating a more natural acoustic environment. 3.4.6.1. Tone Frequency: Objects can be identified with a sine tone having a particular frequency and therefore a particular pitch. This is the simplest form of sounds that can be used, and they are the first ones we tried. Because of the fundamental tonotopic organization of the auditory system, pitch is a strong identifier in auditory organization, and its role in identifying different discrete objects is a natural one. If pure tones are used, then the tone frequencies in the 400-1000 Hz range can be well localized using interaural time differences (ITD), the most powerful localization cue in free field [82, 83]. The 4001000 Hz range comprises six critical bands according to the equivalent rectangular band measures by Glasberg and Moore [84]. With each tone required to be in a different critical band, the space can have as many as six different objects. The tones for different objects correspond to different notes of the standard western chromatic scale to avoid the effects of octave confusion and possible harmonic distortion. The goal is to select tones that can be well enough segregated by the subject that if two objects are contiguous, sharing a common border, the listener can be presented with both tones, corresponding to the objects, and realize that the finger is on the border between two identifiable objects. 3.4.6.2. Other Sound Signals: In our initial set of experiments for simple layout perception (discuss in the following chapters) we found that some of the tones designed according to the above rules can be confused. An alternative is to use recorded broadband, natural sounds, e.g., of rubbing or tapping a surface, as well as notes of familiar musical instruments, making use of the fact that timbre is an intuitively natural object identifier. We have also experimented with complex signals, such as chirps and SONAR pings. 3.4.6.3. Onsets: Onsets of sounds occur when the subject taps on the surface. This is natural because percussion instruments produce onsets upon striking. We have used abrupt onsets with negligible rise time. The advantage of abrupt onsets is that they effectively cue the precedence effect [85] and provide another strong localization cue associated with the broadband transient. 3.4.6.4. Added Noise: Peaked broadband noise can be added to a tone or other narrowband acoustic signal to differentiate object from background. In addition, broadband noise provides even stronger localization cues than the pure tones we discussed above, and makes use of helpful details in the head-related transfer function (HRTF), possibly

38

including front-back cues [86]. Although the noise is broadband to make best use of the HRTF, it is not spectrally flat. Instead, the noise has a sharp spectral peak at the frequency of the corresponding tone. The role of the peak is to cause the subject to identify the noise with the tone – to segregate them into the same auditory stream [87, 88]. 3.4.6.5. Reverberation: Adding reverberation to sound rendering offers multiple advantages. First, it makes the sound rendering more natural. Second, it can help remove front/back confusion, which is a well-known drawback of HRTF directionality rendering. Third, it can be used to enhance perception of varying loudness by modulating the direct-to-reverberant ratio. The design of reverberation will be discussed in more detail in Section 3.4.8 below. 3.4.6.6. Multiple Object Sounds: Another question concerns the ability of listeners to localize and track a sound in the presence of other localized sounds. Tracking can be facilitated by repeated and regular pulsing of sounds in a way that avoids synchronous onsets and profits from the precedence effect. 3.4.7. Using Recorded Musical Instrument Sounds The original musical instrument recordings are typically a couple of seconds long. However, as we discussed in Section 3.4.4, shorter sound periods are necessary to synchronize the scanning finger movements with directionality cues. To accomplish this, we extract approximately 150 ms out of the sound recording and add a silence to the end in order to obtain a 250 ms long signal. The insertion of a silent segment creates onsets, which as we discussed above, provide strong localization cues. In addition, as discussed in the next section, when reverberation is added to the signal, it is also important to have sharp offsets, which make reverberation more noticeable. Note that the onsets, i.e., attack phases of musical instrument notes contain very important cues for recognizing the musical instrument. For example a violin without vibrato could be indistinguishable from trumpet of same note when the onsets are removed. Therefore, the onsets of the original recordings were kept when cropping the sounds, while the sounds were cropped to 150 ms in a way that creates a sharp offset. 3.4.8. Reverberation Rendering We now discuss the implementation details for adding reverberation to the proposed system of using spatial sound to guide the finger to a particular object on the screen. In the following, we will assume that directionality is rendered with HRTF and distance by varying the direct-to-reverberant intensity ratio. Schröder has proposed two artificial room reverberation systems as combinations of allpass and comb filters: (a) a serially connected set of allpass filters and (b) a bank of parallel comb filters connected in series with couple of serial allpass filters (comb-allpass system) [89,90]. In both systems, allpass and comb filters are used to simulate the wall reflections and wave propagation delays. The filter delays should be picked to be mutually prime in order to achieve a uniform

39

Figure 3.6. Implementation of allpass-comb reverberation system

Figure 3.7. Comb filter and dense reverberation tail. Moorer suggested inserting a lowpass filter to the feedback loop of comb filters in the comb-allpass system to address artifacts like metallic sounding decays and issues with impulsive sounds [27]. The lowpass filter in the feedback loop of the comb filter, is used to take the higher frequency air attenuation into account. Moorer claims to have the best reverberation results with the system consisting of six such parallel comb filters with one allpass filter joined serially [27]. We found that Moorer’s best system works well for us with some modifications. Our implementation consists of two allpass filters in series connected with parallel bank of eight comb filters, as shown in Figure 3.6. The system sampling rate is 44.1 KHz. To prevent “imaging” we had to make sure that the left and right reverberation tails are decorrelated. We do this by having two separate reverb systems, with different delays for the two sets of eight comb filters. However, the two allpass filters are common to both channels. We chose delays of 263, 269 and gains of 0.5, -0.5 for two allpass filters. The output from the allpass filters is amplified by different DC gains before they are fed to each comb filter. The DC gain gi is set to (1 − di ), where di is the maximum gain of the lowpass feedback loop filter (F-T60) of the ith √ comb filter. The outputs of the comb filters are added together and scaled down by 8 (8 comb filters), to get the final reverberation signal. As we discuss above, the delays of the two sets of comb filters (left and right channel) are chosen to be mutually prime, with equal logarithmic spacing and between 300 and 1000 samples.

40

We first selected the delays of one channel to satisfy above requirements and then picked the delays of the second channel to be neighboring prime numbers to the first. The delay values we picked for left and right channels are {317, 353, 409, 491, 593, 691, 839, 1009} and {313, 347, 419, 503, 599, 709, 853, 991}, respectively. Since the acoustic environment we render doesn’t necessarily have to be realistic, it is sometimes desirable to overemphasize some cues in order to make it easy for subjects to perceive encoded information. In the case of reverberation, our preliminary experiments suggested that the strength of the reverberation signal generated by our system with T60 values (the time required for reverberations to decay by 60 dB) that correspond to natural settings are not producing a sufficient reverberation level that will make the loudness and front/back judgments easy and precise. Therefore, the reverberation signal was amplified by 15 dB before it was added to the direct sound. A longer than usual T60 is also preferred to overemphasize the reverberation. We thus selected T60 to be equal to 16 seconds. Note that with a T60 of 16 seconds, a direct sound of 250 ms will generate a reverberation signal of length 3 s (the tail after 3 s is negligible). In the configurations in which we used reverberation, the direct sounds were periodic with a period of 500 ms. This means that the reverberation tail of one period overlaps with several subsequent sounds. These “leaked” tails contribute to building a significant and stable reverberation level. With an unusually long T60 (16 s), this stable reverberation level is reached faster. Since the F-T60 filter is intended to act as high frequency air attenuation simulator, the filter is designed to approximate a given T60 profile by its frequency response. We started with the desired T60 times for four frequencies (DC, 1 KHz, 8 KHz, 20 KHz) and calculated the required gain G of the F-T60 filter to provide those T60 s, using the following equations s ∗T 60 GFsample = 0.001 (3.3) n G = Gsample (3.4) where Gsample is the gain per sample and G, Fs is the sampling frequency, and n is the delay of F-T60. The four T60 values used for these four frequencies are 16, 8, 4 and 2 seconds, respectively. Note that these T60 values are common for all comb filters, but the resulting gains will be different, as each comb filter has a unique delay. The T60 values for all frequencies are derived using these four specific gains and log-linear interpolation. The F-T60 filter is designed as, a 6th order IIR filter that has the same frequency response as the above calculated T60 profile, using the Yule-Walker method [91]. In a reverberant environment where reflections are ominidirectional, the change of distance and directionality from the sound source to the listener has a negligible effect on the reverberation [27]. Thus, the reverberation and the direct stereo spatial signal (with directionality and distance cues) are generated separately based on the input mono non-directional sound. Then the direct and reverberation signals are added to get the final (wet) stereo sound, as shown in Figure 3.8.

41

Figure 3.8. Generating spatialized reverberant sound

42

CHAPTER 4

Proposed Configurations We now present several configurations for the embodiment of the proposed approach for conveying graphical and pictorial information via acoustic and tactile signals, using the index finger or a stylus as the primary pointing device. The configurations are organized into four groups. In the first group, we will consider the perception of an isolated shape. In the second group, we will consider the localization of a simple object (a dot). The third group will be used to explore the perception of a simple 2-D layout of isolated dots of identical size. Finally, we will consider a number of relatively simple configurations that can be tied to specific applications. The first will be the “virtual cane” configuration, a simple scene layout consisting of objects in a linear arrangement, each with a distinct tapping sound. The second will be “Venn diagram” configuration, where the subjects explore a diagram with three overlapping circles of different sizes. As examples of other possible applications, we will also briefly discuss other configurations, like maps, block diagrams, charts, and simple scenes; however, the detailed experimentation with these configurations is beyond the scope of this thesis. In all of the configurations the primary feedback will be in the form of sound played back through stereophonic headphones. However, in one of the configurations, we also superimpose a raised-dot pattern embossed on paper on the screen to test the joint perception of acoustic and tactile signals. 4.1. Shape Configurations The identification of objects and their geometrical shapes is important for almost all VS tasks such as the perception of maps, sketches, graphs, and images, as well as navigation in a real or virtual environment. It is thus important to find an efficient, intuitive, and practical ways of rendering geometrical shapes, using the proposed display. Towards this goal, we have designed and implemented several configurations in an Apple iPad 1; however, any other device with a touch sensitive screen can be used. The configurations we describe below present one object at a time on the touch screen. However, the same algorithms, with minor adjustments, can be used to represent multiple objects in the screen. The goal is to locate the object and to trace and identify its shape. Object shape can be represented as a line drawing or as a solid shape (a region with filled interior). Since the exploration is done by finger scanning, the proposed approach can be compared to haptic perception of raised-line drawings [67,92,93]. In a study of haptic picture perception, Thompson et al. found that solid shapes (object interiors filled with

43

(a) S1-2cons (b) S2-3cons, S4-3int, and S5-hrtf Figure 4.1. Training object in the touch screen for different configurations embossed tactile textures) were easier to recognize than raised-line drawings [92]. Based on this observation, we selected the solid shape representation for the first configuration. However, the two representations can be combined to enhance overall perception, which we did for the remainder of the shape configurations. 4.1.1. Configuration S1-2cons: Shape Representation with Two Constant Sounds The touch screen is partitioned into two regions, background and object, each represented by a distinct sound, as shown in Fig. 4.1(a). One sound is played when the subject’s scanning finger is inside the object and the other when finger is outside. Both sounds are spatially constant. The advantage of this configuration is that any time the finger is touching the screen, the subject has a clear indication whether it is located inside the object or in the background. The edge of the object can be located at the transition between the two sounds. 4.1.2. Configuration S2-3cons: Shape Representation with Three Constant Sounds The touch screen is partitioned into three regions, background, object interior, and object border, each represented by a distinct, spatially constant sound, as shown in Fig. 4.1(b). During pilot experiments with S1-2cons, we found that some subjects were attempting to trace the edges of the object. For that, they had to move their finger in a zig-zag fashion around the edge, listening to sound transitions, to make sure that it was on the edge. This is quite awkward and confusing. Thus, in order to facilitate edge tracing, we added a relatively thin strip with a distinct sound around the border. Determining an optimum width for the strip is important, as it can have a significant effect on performance. On the one hand, the strip must be wide enough to be robust to changes in the centroid of the finger contact area due to unintended rolling and turning of the finger, creating tracking instabilities. On the other hand, the strip should be narrow in order to guide the finger correctly. A wide strip allows significant perturbations in the scan direction, as shown in Fig. 4.2. This border strip obviously represents a vertical

44

Figure 4.2. Two possible finger scanning paths inside a border strip

(a) Full screen (b) Detail Figure 4.3. Training object in the touch screen for Configuration S3-2trem edge, but the subject can scan it vertically, as shown in (a), or diagonally, as shown in (b). After several trials and errors, we converged on a 0.38 inch (0.9 cm) strip, which corresponds to 50 pixels on the 132 pixels/inch Apple iPad 1 screen. However, a problem remains. Even though the subject has a clear indication that the finger is on the border, the finger may still bounce back and forth between the border and the surrounding segments (background and object interior) as it traces the border. We thus consider alternative strategies for guiding the finger along the border. 4.1.3. Configuration S3-2trem: Shape Representation with Tremolo One approach for guiding the finger along the boundary is adding distance feedback near the boundary. One way for doing this is by using a tremolo signal. (See Section 3.4.3.3 for a technical description of tremolo and how it can be used to render distance.) The idea is that, instead of a thick border defined by a (spatially) constant sound, we reduce the border to a line, but add strips on each side of the border, as shown in Fig. 4.3. When the finger is inside the strip on the background (object) side, the background (object) sound is modified to give a clear indication that it is moving toward or away from the boundary. When the finger is in the background or the object area, then the sound is constant. We use tremolo signals with different carrier frequency and depth to denote the background and object, on either side of the border (orange line), including the border strips

45

(cyan and gray). The tremolo rate is constant within each segment (background, object), except when the finger enters the (cyan and gray) border strips, where it varies to indicate movement toward or away from the border. 4.1.4. Configuration S4-3int: Shape Representation with Three Sounds and Intensity In the first set of subjective experiments (to be described in Chapter 5), we observed a drop in the performance (accuracy) of S3-2trem compared to that of S2-3cons, in spite of the additional distance information near the border. This can be attributed to the fact that changes in the rate of the tremolo are not perceived instantaneously. This is because users have to listen to at least a few periods of the signal before they can detect rate changes (defined in Section 3.4.3.3). Thus, given the relatively small border width and relatively fast finger movements, the effectiveness of the tremolo in providing distance information did not meet our expectations. The fourth configuration is an attempt to combine the best attributes of S2-3cons and S3-2trem. The advantage of S2-3cons over S1-2cons and S3-2trem can be attributed to the use of a distinct sound for the border segment, and the use of instantaneous cues (timbre) to distinguish the three sounds. The challenge was thus to maintain the three distinct sounds, and at the same time, to provide a strong, instantaneous distance cue within the border. For this, we chose intensity variations of the border sound, highest at the center of the border segment, and decreasing as the finger approaches the edges. The advantage of intensity is that it is an instantaneous cue that people are familiar with, and thus requires no training (as musical concepts like tremolo might). In contrast to S3-2trem, where the sound identity of the object and background are maintained within the border strips, here there is only one sound in the border segment, but we use intensity variations as a strong cue for keeping the finger within the segment as it traces the border. For the intensity variations we used an exponential drop in volume with the distance of the finger from the center line of the border. We believe that this provides a close analogy with raised-line tracing, where the relief is maximal at the center and decays rapidly with distance from the center. 4.1.5. Configuration S5-hrtf: Shape Representation with HRTF This configuration explores the use of spatialized sound (distance and directionality) for both locating the object in the background and tracing its border. As we discussed in Section 3.4.4, the advantage of directionality cues is that the user does not need to move the finger to find out if she/he is moving in the right direction. This considerably simplifies the exploration task (finding the object and tracing its boundary), and allows the subject to focus on the perception of object shape and scene layout. Also, in contrast to the previous two configurations, which provided distance feedback only in the neighborhood of the object boundary, the idea here is that spatialized sound can be used to guide the

46

Figure 4.4. Localization of a single object on the touch screen finger to the object from any point on the touch screen. Thus, the subject does not need to wander around the background before finding the object, which reduces the time for finding the object. In S5-3hrtf, the screen is divided into three segments (object, background, and border), each with a distinct sound, as in S2-3cons and CS4 (Fig. 4.1(b)). When the finger is inside the object, the sound is (spatially) constant. In the background and border segments, directionality is introduced via the HRTF. (See Section 3.4.4 for the details of how the HRTF can used for directionality rendering in the proposed display.) When the finger scans the background segment, a 2-D virtual acoustic scene is formed by assuming that the listener is in the position of the scanning finger, facing toward the top of the screen (fixed orientation described in Section 3.4.5). The sound source that emits the characteristic sound of the object is assumed to be located inside the object at the point nearest to the finger location (rather than at the center of the object). The sound intensity is proportional to the inverse of the squared distance between the listener and the source (plus a DC offset to prevent the sound from becoming inaudible), thus providing distance information in addition to directionality. When the finger scans the border segment, the same assumptions for the virtual listener hold as in the background segment, but the sound source that emits the characteristic border sound is placed in the direction that the user needs to follow in order to keep tracking the border in the clockwise direction. Thus, the directional sound guides the subject around the border. In this case, the sound intensity remains constant. 4.2. Navigation Configurations (Object Localization) As we discussed, object localization is an essential part of any graphical display of information. In this section we focus on the localization of an isolated object in the touch screen. A simple object (a dot of a given radius) is placed at a random position in the touch screen, and the user tries to locate it starting from an arbitrary position on the screen (selected by the user). An example of a 2-D layout that corresponds to this configuration is shown in Figure 4.4. We will consider two main variations of this configuration. One makes use of spatial sound to guide the finger to the object, and the

47

other does not. In both cases, kinesthetic feedback enables the user to perceive the object position in the screen. 4.2.1. Configuration N1-ung: Unguided Object Localization In this configuration there are two segments, object and background. The object is represented by its characteristic (spatially) constant sound, while the background is silent. That is, when the finger is in the background, there is no guidance to the object. The user is expected to scan the screen in some arbitrary fashion (e.g., raster scan) in order to find the object. 4.2.2. Configuration N2-guided: Object Localization Guided by Spatial Sound This configuration also consists of two segments, object and background, but now different sounds are associated with each segment. The object is represented by its characteristic (spatially) constant sound and no directionality cues. In the subjective experiments of Chapter 5, a tone or musical instrument recording have been used to represent the object. The background is associated with a spatially varying sound that is intended to guide the user to the object using distance and directionality cues. In order to be consistent with the layout configurations of the next section, we use the same basic sound (tone or musical instrument recording), so that each object is associated with its own background, but modify it in a way that makes it clearly distinguishable from the object. We have tried two basic approaches for this. In one approach, the background sound consists of the object sound plus broadband noise. To indicate distance, we either vary the signal-to-noise ratio (intensity based) or vary the tempo. In the second approach, we add reverberation to the background signal. The direct-to-reverberant energy ratio is then used to indicate distance from the object. In both cases, directionality is introduced via the HRTF. When the finger enters the object segment, there is a clear step from noisy to clear signal, or from reverberant to non-reverberant environment. In addition, there is always a change in tempo (faster inside the object). As far as directionality rendering is concerned, there are also two options, fixed orientation and trajectory orientation, both of which were described in Section 3.4.5. 4.3. Layout Configurations In this section we focus on identification and localization of simple objects in a 2-D scene for the perception of a 2-D layout. Since the perception of size and shape was considered in Section 4.1, we assume that all objects have the same shape and size (a small dot), and rely on the characteristic sound for identification. An example of a 2-D layout is shown in Figure 4.5(a). We present three configurations for rendering a 2-D layout. The first is unguided, as in navigation configuration N1-ung. The other two are guided with spatial sound,

48

Figure 4.5. Simplified 2-D layout for configurations L1-ung and L2-seq as in N2-spatial, but in one, the objects are presented sequentially, one at a time, in a prespecified order, while in the other, the user is free to jump around searching for objects in any desired order. 4.3.1. Configuration L1-ung: Unguided Simple Layout In this configuration the objects are identified by their unique characteristic sounds, while the background is silent. Thus, when the finger is in the background, there is no guidance to an object. The user is expected to scan the screen in some arbitrary fashion (e.g., raster scan) in order to find the different objects. Since all of the objects are presented at the same time, the order in which the user discovers the objects depends on the scanning strategy. Note that multiple fingers can be used, in order to increase the chances of finding an object, but if two objects are touched simultaneously, the user will have to move/lift the fingers so as to determine which object is associated with which finger. This configuration was designed to use as a benchmark for comparing the performance of other two more sophisticated configurations that rely on spatial sound. 4.3.2. Configuration L2-seq: Sequential Layout Exploration In this configuration, the user is presented with the objects in the screen one at a time using the finger (or stylus). As shown in Figure 4.6(a), the finger is guided first to one object. Once it reaches that object, it is then directed to the next object, as shown in Figure 4.6(b), and so on. The order could be random or deterministic, for example, each time the subject explores the object closest to the finger. To make sure that all objects are visited, the exploration can be organized in cycles, with each object visited once during each cycle. The subject is also allowed to skip objects whose position she/he is already familiar with. The signaling between the user and the system can be done via voice commands (by the system and the user) and tapping on the screen and finger gestures (by the user). For example, in the system setup we used for the experiments we describe in Chapter 5, the user is first directed to the object closest to the finger on the screen. Once the finger

49

❖❈ ❈

✁

PP

✁ ✕ ✁

(a)

(b)

Pq P

❈

❈

❈ ❈

(c)

Figure 4.6. Configuration L2-seq: Sequential exploration of a 2-D layout reaches this object, the object is marked as “inactive” and the user is directed to the next closest object (from the current position of the finger). The process is repeated until all the objects have been explored, i.e., they have become inactive. At this point the cycle is complete, and all the objects are activated again for the next cycle. To guide the finger to an object we rely on spatial sound, using one of the approaches described in Section 4.2.2. At any one time only one object is “visible” in the display, and the finger is guided to that object. Thus, at any one time, the screen is divided into the object (a small dot) and the background of this object, exactly as in Section 4.2.2. The advantage of this configuration is that it ensures that the user will explore all the objects in the layout. On the other hand, the exploration is constrained, and going back and forth between objects is not always straightforward. An alternative approach is explored in the next section. 4.3.3. Configuration L3-vor: Voronoi Layout In this configuration the objects exist simultaneously in the screen, and each has its own background region. The background region of each object includes the points that are closest to this object. This partitions the space into what are called Voronoi regions. One advantage of this representation is that the background of each object is limited to the corresponding Voronoi region, as opposed to L2-seq where objects are presented one at a time and each background spans the entire screen. This reduces the range over which the sound intensity or tempo variations must be spread, thus making distance rendering more effective.1 Note that, even though all the objects are simultaneously present in the display, the user is listening only to the object that her/his finger is closest to. The sound then guides the finger to the object, as in Section 4.2.2. However, if the finger moves in the wrong direction, another Voronoi region may be entered, and another sound will be heard. 1The

distance rendering is calibrated so that the range can cover the largest possible distance. The of the distance cue as a function of distance is the same for all objects.

50

Figure 4.7. Configuration L3-vor: Simultaneous presentation of objects in a 2-D layout

The advantage of this configuration is that the user is free to explore any region of the screen she/he desires. This configuration also allows the user to get a quick rough estimate of location of the various objects (a “birds eye view”) by moving the finger around to discover the different Voronoi regions. Each time, the user hears one object, while the directionality and distance cues give her/him an instant rough idea of where the object can be found (direction and distance). In this configuration, like the in L1-ung, the user is free to use multiple fingers. However, since navigation is guided, there is no need to make use of multiple fingers to increase the chances of finding an object, and it would be difficult to navigate to multiple objects simultaneously. Instead, the user can use multiple fingers to establish (feel) the relative location of different objects, once each of them is located. On the other hand, there is no guarantee that the user will not miss one of the objects. To make sure that the user is aware of all the objects in the display, we have added an initial sequential mode, whereby the user hears each of the objects in the screen by double-tapping the finger on the screen. However, in contrast to L2-seq, no navigation is encouraged. Thus, the main goal of this initial mode is to identify the objects and (via the distance and directionality cues) to give the user a rough idea of where each object is, relative to the tapping point, without having to navigate to each object. Actually, to further discourage the user from trying to navigate to the objects while in this mode, the distance and directionality cues do not adapt if she/he moves the finger.

4.4. Special Configurations We now show how the basic configurations of the previous sections can be extended to handle a variety of useful applications. The design of the configurations we present in this section utilize techniques for object identification, shape, and size perception, navigation, and layout perception discussed in the configurations presented above.

51

4.4.1. Configuration A1-cane: Scene Perception – Virtual Cane In this configuration, we consider a simple scene consisting of several objects in a linear arrangement, as in Figure 1.3. For example, this could be a scene consisting of buildings and trees that the user would see if she/he were looking out of a window or standing at the opposite side of the street. The goal is to convey the relative position, size, identity, and material composition of each object in the scene. The rendering of shape, considered in Section 4.1, could be handled separately, e.g., in a zoomed-in mode, one object at a time. Here we focus on the layout and other object attributes. For object identification, one possibility is to map each object to a unique, distinguishable, but otherwise arbitrary sound. However, a more intuitive assignment would be desirable, and since the user moves the finger on the screen to explore the objects, a characteristic rubbing sound would be the obvious choice. However, we found that striking sounds are better than rubbing sounds for identifying the material from which the object is made. For instance, McAdams et al. [94] used striking sounds to distinguish between aluminum and glass. Since this is the virtual world, the interface has to be effective, not realistic; thus, we can use striking sounds in response to rubbing the finger on the touch screen. The inspiration for this configuration comes from the “long cane,” the oldest and most widely used visual substitution tool. VI people use the long cane for navigation, continuously tapping their surrounds to detect and identify objects, obstacles, and other landmarks. The long cane provides valuable information about the location, shape, size, and even material composition of the objects, not only “feeling” the objects, but also listening to the tapping sounds. In this configuration, we imitate the idea of a VI person exploring a scene (e.g., an outdoor scene outside her/his window) using a virtual cane (a very long cane in the case of the outdoor scene) to tap on the objects. This could be realized using a camera to snap a picture, analyzing it to obtain meaningful segments (objects or parts thereof), and then representing each segment with a characteristic tapping sound. Of course, obtaining a meaningful region-based representation can be quite difficult. Here, we assume the availability of a semantic representation, which could also be available in the form of maps, graphics, etc. We envision two modes of operation, zoomed-out and zoomed-in. In the zoomedout mode, a number of objects are placed in the touch screen. In our initial setup, we assumed that the objects are disjoint, and in a linear arrangement, as shown in Fig. 4.8(a); However, more complicated layouts with connected (touching) objects, as in Fig. 1.3, and even overlapping (with occlusion) objects, can also be handled by the same approach. A prerecorded characteristic tapping sound is assigned to each object, while the background is silent. As we discussed above, in the zoomed-out mode, the user gets information about object position, material/identity, size, and some idea about its shape. To further explore the shape of a selected object in finer resolution, the user can enter the zoomed-in mode, e.g., by double-tapping inside the object. In this mode only the chosen object is present and it is zoomed-in at the center of the screen. Any of the configurations in Section 4.1 can be used for that, preferably the most effective. A reserved gesture (e.g., double-tapping

52

(a) (b) Figure 4.8. Test scenes for “Virtual Cane” configurations (a) A1-cane and (b) A2-cane (green: wood, blue: glass, red: metal) on the screen) can be used to go back and forth between the two modes. In addition, a special sound can be used to confirm the mode change. 4.4.2. Configuration A2-cane: Scene Perception with Overlaid Tactile Imprint In this configuration, we test the joint perception of acoustic and tactile signals, by superimposing a raised-dot pattern embossed on paper on the touch screen. In our implementation, we assumed that the objects are disjoint, as shown in Fig. 4.8(b), and used a dense tactile pattern to represent the objects on a flat background. The advantage of the tactile overlay is that object shape is much easier to perceive without the need for a zoomed-in mode, while the sound can be used for object and material identification. To discriminate between objects that are touching, we would have to use perceptually distinct tactile patterns. Of course, the more tactile patterns we add, the more difficult it will be to distinguish among them. Different tactile patterns can also be used for object/material identification, but again, the more patterns we add, the more difficult it will be to tell them apart. Finally, a disadvantage of the tactile overlay is that it is static. On the other hand, one could use the tactile patterns to display fixed objects (buildings) and sound to display moving objects (cars, people). 4.4.3. Configuration A3-venn: Representing a Venn diagram A Venn diagram is a simple graphical application that can provide useful information to VI subjects. A Venn diagram typically consists of several circles of different sizes that may or may not overlap with each other. This configuration is similar to L3-vor with a few changes. An example is shown in Figure 4.9. The goal is for the user to perceive the relative positions and sizes of each circle and the amount of overlap. The screen is partitioned into background and overlapping circular foreground regions, each with a distinct sound. The background is silent. Each circular region consists of a center (represented by a small dot) and a surrounding area (which is analogous to the Voronoi region in L3-vor). When the finger (or stylus) is on the center dot, the unique (spatially) constant, nondirectional sound assigned to that circle is played. When the finger is in the

53

Figure 4.9. Configuration A3-venn: Venn diagram

surrounding area, directional sound is used to guide the finger to the center of the circle, and intensity is used to indicate distance, as in L3-vor. For this configuration we used the direct-to-reverberant energy ratio to indicate distance from the center. To further differentiate the center from the surround, the tempo of the sound at the center is twice that of the surround. However, in contrast to L3-vor, the circles may overlap as opposed to the mutually exclusive Voronoi regions, In the overlapping areas the user hears the sounds that correspond to all the overlapping regions. As we mentioned above, the goal is for the user to perceive the relative positions and sizes of each circle and the amount of overlap. The user can follow one of the sound signals to the center of the corresponding circle, thus getting a good idea of the location of the circle. Conversely, the user can follow the sound signal to the border of the circle, thus getting a good idea of the size of the circle, as well as along the periphery, to discover any overlapping circles. Assigning sounds with distinct spectral distributions will help the user perceive the two (or more) simultaneous sounds as a combination rather than as a new sound. Note that, in contrast to L3-vor, the range of sound intensities is the same for all circles, independent of size. Thus, to get an idea of the size of a circle, the user does not have to trace it all the way to the border; the rate of change of the sound is an indication of the size (rapid variations corresponds to smaller sizes, and slow variations to larger sizes). We now briefly describe a few additional configurations, in order to illustrate the potential of the proposed approach. However, the detailed study of and experimentation with these configurations is beyond the scope of this thesis. Four of these are illustrated in Fig. 4.10. Configuration A4 (block diagram): This is similar to L2-seq, except that the subject can only move from object to object if there is a link between them.

54

A4 A5 A6 A7 Figure 4.10. 2-D layout with constraints (A4); flow chart (A5); circuit (A6); US map (A7) Configuration A5 (flowchart): This is similar to A4, but the links are directional, and red nodes are decision nodes that offer two or more options to proceed. The actions at each node and the options at the decision nodes can be indicated with voice commands. Configuration A6 (circuit): This is similar to A4. Each circuit element (resistor, capacitor, etc.) is associated with the sound of a particular instrument, with different notes or voice commands indicating different values. Spatial sound is used to guide the finger to an element or away from it towards the nodes, which are marked by special sounds. To get a better sense of the wires connecting the elements, the finger is guided to stay close to the wire. The background is silent. Configuration A7 (US map): Directional sound points to the centroid of the state or country (indicated with a black dot in Fig. 4.11), while intensity (or tempo) decreases as you move away from the centroid. When you cross the border, you hear the sound of the neighboring state. Note that, as in A3-venn, the user can get an idea of the size of the state from the rate of change of the sound. Thus, in California, the sound changes rapidly in the eastwest direction, and slowly in the north-south direction. Of course, the details of the border shape cannot be rendered accurately with sound only. However, the user will have the option of zooming and scrolling with finger gestures, as in visual applications for cell phones and tablets. For identification, each state could have its own sound. Alternatively, voice feedback can be used to name the state, in which case only four different sounds will suffice for differentiating between neighboring states (because of the four color theorem [95]). The addition of tactile overlays (a different raised-dot pattern for each country or raised lines for the borders) is also possible, but limits the flexibility of the display. We now consider more complex, advanced applications, that utilize both modalities. These are illustrated in Fig. 4.11. Configuration A8: This is a variation of A3-cane, with raised-dot patterns used for fixed objects (buildings) and sounds used for moving objects (cars, people). Alternatively, raised-dot patterns and tapping sounds can be used for fixed objects and other (nontapping) sounds for moving objects.

55

A8 A9 A10 Figure 4.11. Long cane (A8); campus map (A9); photograph (A10) Configuration A9 (campus map): We can also combine the tactile and acoustic modalities for rendering a campus map. Buildings, roads, and other segments can each be represented by tactile and acoustic fields, as shown in Fig. 4.11 (A9). Sounds can also be used to represent dynamic objects (red dots in the figure). Configuration A10 (photo): A complex scene is segmented into regions, each represented by a different tactile pattern and sound, as shown in Fig. 4.11. Border emphasis may be added. This is one of the ultimate goals of the proposed research, to be able to convey an entire image via touch and sound. As we discussed, we will assume that the scene has already been segmented into regions. The goal is then to select a tactile-acoustic texture for each segment so that it is distinct from the surrounding segments and it conveys information about the nature of the segment. The success of this configuration will be affected by the number, size, and shape of the segments, as well as the use of explicit boundaries.

56

CHAPTER 5

Subjective Experiments In this chapter we present a series of subjective experiments, which were conducted based on the configurations described in Chapter 4. We first describe the experimental setup in general and then provide specific details for each experiment or groups of experiments, organized according to the configurations they are based on into shape, localization, layout, and applications experiments, as in Chapter 4. 5.1. Subjects The subjects that participated in these experiments were not experts in acoustic or tactile signal processing and perception, and were not familiar with the detailed goals of the experiments. The subjects had different educational backgrounds and different degrees of experience with touch-screen devices. There was no financial reward for participation in the experiments. 5.2. General procedure All the experiments were performed in a reasonably quiet room to avoid disturbances. The subjects interacted with a touch screen tablet and listened to auditory feedback on stereo headphones. The subjects were asked to scan the screen using only one finger or a stylus; however, they were allowed to select any finger and were also allowed to switch fingers in order to avoid discomfort. The subjects were instructed to avoid any contact with the touch screen, apart from the scanning finger or stylus. To eliminate visual contact with the touch screen and the scanning finger or stylus, the touch screen was placed in a box open in the front, so that the subject could insert her/his hand to access the screen [96]. The box was placed on a table, and the subject was seated in front of the table. The touch screen was placed horizontally in the box in landscape orientation, and the subject was asked not to move it during the experiment. Blocking all visual contact of the subjects with the touch screen and the scanning finger/stylus, is important for eliminating any visual cues for the perception of the graphical information presented in acoustic-tactile form. Note that simply eliminating the visual rendering of the 2-D shape or layout on the touch screen is not enough because watching the finger movements can provide strong shape identification clues, which would not be available to VI subjects. The subjects were given written instructions at each stage of the experiments. They were also given time to ask questions, until the instructions were completely clear. At the

57

very beginning, they were given a general introduction for the entire set of experiments, and then at the beginning of each experiment, they were given configuration specific instructions. For each configuration, the participants were shown a training example that had a different shape/layout than the one used in the actual experiment. During the training examples, the subjects were at first able to see the scanning finger and the shape/layout to be presented on the touch screen; then, they repeated the trial visually blocked as in the actual experiment. 5.3. Equipment and Materials All of the shape perception experiments, the first set of navigation experiments, and the experiments with L2-seq, A1-cane, and A2-cane were conducted with an Apple iPad 1, which has a 9.7-inch (diagonal) multi-touch (up to 10 pointers) capacitive display with fingerprint-resistant oleophobic coating. The optical resolution of the screen was 1024 × 768 pixels at 132 pixels per inch, and the range of the audio playback frequency response was from 20 Hz to 20 KHz. The remaining experiments were conducted using a Samsung Galaxy Note 10.1 2014 edition tablet, which has a 10.1-inch (diagonal) LCD multi-touch (up to 10 pointers) capacitive touch screen with 2560×1600 pixels at 299 pixels per inch. The audio playback frequency response range was from 20 Hz to 20 KHz. The tablet comes with its own stylus, called “S pen,” which provides more resolution and stability when used as a pointer instead of the finger. We thus advised the subjects to use the stylus for scanning, and all of them did, except when they needed to use multiple fingers for exploration in L1-ung and L3-vor. We used Sennheiser HD595, around-the-ear stereo headphones (frequency response 12 Hz – 38,500 Hz, sound pressure level 112 dB at 1 kHz and 1 Veff) for all experiments. The subjects were allowed to adjust the playback volume to a comfortable level, at any point during the experiments. All the sounds were played back as pre-stored sound files in Microsoft/IBM Waveform Audio File Format (WAV). 5.4. Shape Perception Experiments The experiments with the five shape configurations were conducted in two sets. In the first set, we tested configurations S1-2cons, S2-3cons, S3-2trem, and S5-hrtf. On the basis of the experimental results, which we discuss in Chapter 6, we decided to modify S5-hrtf and to add S4-3int. We then conducted the second set of experiments with these two configurations and, for comparison (since we used a new group of subjects), with the original (unaltered) S2-3cons. In all of the experiments we used three different shapes (circle, square, triangle) for testing; there was one trial for each shape and subject. The training shape (cross) was different from the test shapes. The subjects were not given any information about the shapes that were going to be presented to them, except for the fact that they were enclosed by smooth or piecewise smooth boundaries, and were not given any feedback on their responses, until the entire set of experiments was completed. This

58

is because the goal of the experiments was to identify an unknown shape, rather than distinguishing among a set of predetermined shapes. 5.4.1. Subjects The first set of experiments was conducted with 20 subjects, 16 male and four female. The age of the subjects ranged from 19 to 56 years old (average 29). All except two reported normal or corrected vision and normal hearing. One subject reported nystagmus (uncontrolled eye movement) from birth and another reported tinnitus (ringing in the ears) for the last 20 years; both subjects had been treated for their impairments. The subjects with nystagmus and tinnitus participated in all the experiments. Two additional female subjects, aged 21 and 35, carried out one experiment with S2-3cons. The second set of experiments was conducted with a different group of 11 subjects, ten male and one female. The original group of subjects could not be used because they were already familiar with the shapes of the experiments, which we had to keep the same in order to be able to compare the results with the first set of experiments. The subject ages were in the range of 22 to 50 years old (average 33). All except one reported normal or corrected vision and normal hearing. One subject reported a hearing deficiency in the left ear that had not been treated. All 11 subjects completed the experiments for S2-3cons, S4-3int, and the modified version of S5-hrtf. 5.4.2. Procedure The written instructions made it clear that each trial is independent of the other trials, so that the same shape could be presented more than once for a given configuration, and the shapes presented in one configuration could be the same as or different from those presented in another configuration. The subjects were only told that they would be presented with objects enclosed by boundaries with smooth or piecewise smooth edges. At the end of each trial, the subjects were asked to draw, on a sheet of paper of given size, the shapes they perceived, and then to describe the shapes. The subjects had full visual contact with the paper and the pen while drawing. They were asked to draw something even if they were not sure about the presented shape. No feedback was given about the correctness of subject responses until the completion of the entire set of experiments. There were no tight time limits for the experiments, but the actual time durations to complete the experiments (including the time to draw and describe the shapes) were recorded. Since the total time for the first set of experiments was one to two hours, the subjects were given the chance to take a break after the first three configurations. However, all of the subjects completed the entire set of experiments in one sitting, except one who completed the experiments in two sittings on consecutive days. The second set of experiments was a lot shorter, so there was no need for a break. In the second set, after

59

the trials for each configuration were complete, we asked the subjects to rate the difficulty of the configuration, in a 1 to 10 scale, with 10 being the most difficult.

5.4.3. Sound Design The selection of the sounds for representing the elements of the layout was critical for the success of the experiments. In the experiments with S1-2cons, S2-3cons, S4-3int, and the first set of experiments with S5-hrtf, we used SONAR pings for the background because they have been found to work well in navigation tasks (e.g., Tran et al. [70]). For S2-3cons, S4-3int, and S5hrtf (first set), we picked synthesized chirp sounds for the border, in order to clearly differentiate it from the object and background. The duration of the chirp was 350 ms and the frequency sweep from 100 to 400 Hz. For S1-2cons, S2-3cons, S3-2trem, and S43int we used monophonic sound. As we discussed, in S4-3int, we used intensity variations to indicate distance. For the experiments with S3-2trem, we used constant tremolos with D1 = 0.25, fc1 = 200 Hz, and fR0 = 6 Hz for the background region and D2 = 0.42, fc2 = 150 Hz, and fR0 = 6 Hz for the object region. In the strips on either side of the border, the tremolo rate fR was varying from 6 to 22 Hz and the depth and carrier frequency was the same as that of the corresponding (background or object) segment. Please refer to Section 3.4.3.3 for notation and definitions. For the experiments with S5-hrtf, we used directional stereophonic sound in the background and border segments, and constant monophonic sound inside the object. Directionality was provided via the HRTF. As we discussed in Section 3.4.4, in order to minimize the delay between sensed finger position and sound rendering, we quantized the directionality to 30◦ sections (12 levels), and preloaded the sound files for each section. As we discussed in Section 3.4.3, to vary the sound intensity on the background segment, we worked directly with the device volume controls, and selected the functions shown in Figure 3.1. In the first set of experiments with S5-hrtf, each subject selected one of two HRTFs, which corresponded to long and short pinnae measured on the KEMAR mannequin [60]. In the second set of experiments, we considered two modifications in order to improve the perception of directional sound. First, we increased the number of possible HRTFs to five, using pre-measured HRTFs that correspond to humans. This is because the HRTF of another human is expected to be a closer match to the HRTF of a given subject than that of a mannequin. The five HRTFs were selected from the CIPIC database [60], and were chosen to represent a wide range of anthropometric data. To select an HRTF, we designed a simple application, in which the subject virtually moves around a sound source listening to rendered directional sounds for each of the available HRTFs, and picks the one that works best.

60

Second, we reconsidered the sound selection for the background and border segments in order to enhance sound directionality. We ran a small experiment with six subjects, in which we asked them to rate five sounds on the basis of their effectiveness in rendering directionality. The five sounds were: a sonar ping, a low-frequency chirp, a high-frequency chirp, wood tapping, and Gaussian noise. Wood tapping and Gaussian noise proved to be the two best directional sounds, and were assigned to the border and background segments, respectively. The mono chirp signal was used inside the object. The sounds used in the configurations were of various durations from 500 ms to 1.25 s, and all were played in a continuous loop. 5.4.4. Experimental Details The first set of experiments was conducted in the following order, which was the same for all subjects: S1-2cons, S2-3cons, S3-2trem, and S5-hrtf. For the second set of experiments, the order was S2-3cons, S4-3int, and S5-hrtf, for all subjects. The shape used for training was a cross and the test shapes were a square, a circle, and an equilateral triangle, presented one at a time. With the exception of the training shape (cross), all of the shapes had roughly the same area in square pixels, and were centered in the touch screen. The width of the border strips for S2-3cons, S3-2trem, S4-3int, and S5-hrtf was 0.38 inches (50 pixels on the Apple iPad). For each configuration, each of the three shapes was presented once, in random order. However, the subjects were told that repetitions were possible. The order of the configurations was fixed. In S2-3cons, S3-2trem, S4-3int, and S5-hrtf, the subjects were told that the border strip(s) were introduced to facilitate edge tracing. However, they were told that they were free to use any technique they wished (e.g., scanning the screen left to right or top to bottom) in order to perform the task of identifying the object. 5.4.5. Discussion To get a more accurate representation of what the subjects perceived, we asked the subjects to draw the perceived shape on a piece of paper before they tried to name it or to describe it in words. The rating of each response was binary (correct or incorrect) based on the verbal description and the drawing. In cases of ambiguity, the subjects were asked to explain. The inclusion of the drawing made the task more demanding, in the sense that rotated or transposed versions of an object shape were not considered as correct answers. However, apart from such transformations, our performance evaluation criteria did not include any contour matching. Interestingly, Wijntjes et al. [67] found that asking the subjects to sketch the shape they perceived improved the identification of raised line drawings. We believe that this relates to the kinesthetic memory of the shape, which helps subjects to directly compare the shape they traced with the memory of tracing similar shapes, visually (for sighted subjects) and kinesthetically (for VI subjects).

61

5.5. Navigation Experiments The purpose of the navigation experiments was to test the ability of the subjects to locate a single randomly placed object. Such experiments are important on their own, and also, they are useful for the empirical evaluation of different design alternatives for the layout configurations of Section 4.3. The navigation experiments were carried out using different variations of the navigation configurations we presented in Section 4.2. Each experiment consisted of multiple trials. The task of each trial was to navigate to the object in the shortest possible time. The goal was to conduct as many trials as possible within a given time (10 minutes). However, in addition to the time limit, we later added a maximum number of trials (31) as a termination criterion. Once the finger/stylus reached the object, the subject was asked to notify the system by double tapping anywhere on the screen, so that a new trial could begin. The system acknowledged each double-tapping with the phrase “end of trial.” Note that the time for the trial stopped only when the subject double tapped, not when the finger entered the object. This was necessary in order to make sure that the subject was fully aware that she/he has located the object, rather than accidentally moving the finger over the object without realizing what happened. A new trial began when the subject lifted the finger and placed it back on the screen. If the subject did not lift the finger after double tapping, the system alerted her/him with the phrase “please lift your finger.” When the subject lifted the finger, the word “ready” was used to indicate that a new trial could be started. In the written instructions the subjects were asked to double tap as soon as they located the object, then to lift their finger and quickly put it down at an arbitrary position of their choice. The subjects were asked to conduct as many trials as they could until an announcement informing them that the “experiment is over” was played back over the headphones. The system automatically kept track of the time for each trial, as well as the cumulative time for each run. The time between trials was not counted in the cumulative time. Note that all trials were carried to completion, even if the termination criterion (10 minute limit or 31 trials) was reached in the middle of a trial. No feedback about the subject’s performance was given during the experiment. The subject was interviewed after each experiment; first, she/he was given a chance to provide general feedback, and then (for all navigation experiments except the first one) was asked specific questions about the difficulty, intuitiveness, and cognitive load of the task. 5.5.1. Navigation Experiment 1: Intensity versus Tempo This experiment was conducted with the N2-guided configuration, and consisted of two runs, one using intensity variations and the other using tempo variations as the distance cue. This goal was to decide which cue is more effective for distance rendering. In both cases, the sound directionality was rendered using the HRTFs of the KEMAR mannequin with quantized directionality, into 30◦ sections, that is, 12 uniform quantized levels. A

62

fixed head orientation, facing the top of the tablet, was assumed for the directionality rendering. As we discussed, each run consisted of multiple trials within a ten-minute block of time. The configurations were evaluated based on the average navigation time over the trials presented in the final three minutes of the test. This was done to negate any learning effects. However, the subjects were not aware of this until the very end of the entire experiment. Moreover, as we will see in the next chapter, we kept and analyzed all the data in order to determine whether indeed there is a learning effect. All trials that started or ended within the final three minutes were counted in the calculation of the average and median navigation time. Eight subjects, four male and four female, with ages between 23 and 35 years old took part in the experiments. In both configurations, we used a 652 Hz sinusoidal tone to identify the object. The radius of the dot representing the object was 0.23 inches (30 pixels on the iPad). In the background region, in the variable tempo configuration, we used a constant tone and noise intensity, fixed the length of the signal segments at 250 ms, and changed the tempo by varying the length of the silence segments. We used three different lengths for the silence segments: 750, 250, and 83 ms. Thus, the periods were 1 s, 500 ms, and 333 ms (1, 2, and 3 Hz). In the variable intensity configuration, we used constant tempo with 250 ms signal segments and 250 ms silence segments. The intensity variations were implemented by modulating the signal-to-noise ratio as described in Section 3.4.3. In all cases, a constant tone (played in a loop without any silence) was played in the interior of the object. However, since the tone was played in a continuous loop, there was an audible periodic artifact (4 Hz). This artifact did not have any negative effect on the experiments.

5.5.2. Navigation Experiment 2 - Fixed versus Trajectory Orientation, Plus Sound Selection In the first navigation experiment, as well as in the layout perception experiment with L2-seq described below, most of the subjects used the same, unexpected strategy for navigating to the object. Once they put down the finger on the touch screen, they moved their finger horizontally to the left or right, towards the direction from which the sound appeared to be coming from, until they perceived a change in the (left/right) direction of the sound. At the transition point, they knew that the finger is vertically aligned with the object. They then relied on the distance cue to determine whether to move up or down in the vertical direction towards the object. They then kept moving in the same vertical direction until they reached the object or until there was a reversal in the distance cue. Since all the movements were in the horizontal and vertical directions, we refer to this as Manhattan navigation strategy. However, our initial intention was to guide the finger to the object in a straight line, rather than in a combination of horizontal and vertical movements, which result in longer scanning paths, and presumably longer trial times.

63

One explanation for the selection of the Manhattan navigation strategy is that the directionality rendering via HRTF was poor and the subjects could only lateralize sounds to two extreme directions, left and right. Another explanation is that human laterization is poor near the ±90◦ azimuth, so by moving in the horizontal direction, the subjects were trying to bring object close to the 0◦ azimuth range, where they could utilize the best human laterization accuracy. One approach for encouraging the subjects to navigate to the target in a straight line, while utilizing the most sensitive laterization range, is to vary the virtual head orientation so that the subject faces the object. This could be done by allowing the subjects to rotate their virtual head towards the object by rotating the finger. However, this could lead to awkward situations, where the subjects may have to twist their finger, e.g., when moving towards the bottom of the screen. Instead, we can assume that the subject is always facing forward as she/he is moving towards the object. Since this is a natural thing to do, it seems that this may indeed be a reasonable approach. To implement this approach, all one has to do is follow the finger trajectory, and orient the virtual head in the direction of the finger movement. We will refer to this as trajectory orientation, as opposed to fixed orientation. However, for the directionality decoding to be done effortlessly, this assumes that the subject can keep mental track of the direction of the scanning finger/pointer movement. As we will discuss below and in the next chapter, this is not true for most subjects, at least without training. To evaluate the efficacy of the two navigation approaches (fixed orientation and trajectory orientation), we conducted experiments with the N2-guided configuration. The radius of the dot representing the object was 0.2 inches (60 pixels on the Samsung Galaxy). In both cases, the distance rendering was done using the direct-to-reverberation energy ratio and the calibrations described in Section 3.4.3.1. Directionality rendering was done using the HRTF, but in order to increase the accuracy of the rendering, the number of quantization levels was increased. As we discussed, human resolution of directionality perception is not uniform and is higher around the zero azimuth. We thus designed a new nonuniform quantization scheme with 54 levels based on human lateralization limitations. The azimuth was quantized to 5◦ sections for azimuth angles between −45◦ and 45◦ and between −135◦ and 135◦ , and 10◦ sections for all other azimuths. As in the Navigation Experiment 1, the time for each trial and the cumulative time for each run was tracked by the system. The termination criterion was either ten minutes or 31 trials, whichever happens first. Again, each trial was carried to completion. Ten subjects, eight male and two female in the age range of 23 to 32 years with average age of 26 years, took part in the experiment. In addition to evaluating the efficacy of two navigation modes, the goal of our experiments was to determine the effect of sound selection on the subjects’ ability to navigate to the object. We considered two main categories of musical instrument recordings for representing the objects: wind instruments that generate relatively steady sounds, and

64

percussion instruments that generate impulsive sounds. The wind instruments recordings were anechoic, and were obtained from the open access Musical Instrument Samples Database of the Electronic Music Studios at the University of Iowa. The first was “G3” (note G of 3rd octave), played on a bass clarinet, the second was “B3,” played on an oboe, and the third was “D5,” played without vibrato on a b flat trumpet. For the percussion sound we used a bongo roll recording downloaded from http://www.freesound.org/; we do not have any other information about the recording conditions. We conducted experiments with four of the ten subjects (all male, 26 to 32 years, 29 years average) using one sound from each category, the bongo roll and the trumpet recording. Overall, for each of the four subjects we tested two head orientations (fixed, trajectory) and two sound selections, for a total of four conditions/runs. However, as we will discuss in Chapter 6, we found no notable difference in the results using the two sound selections. We thus decided to run the rest of the experiment, with the six remaining subjects, with just one sound (bongo roll) and two head orientations. For all ten subjects, the order of the conditions/runs was randomized, in order to neutralize any learning effects. 5.5.3. Navigation Experiment 3: Unguided This navigation experiment was conducted with the N1-ung configuration, that is, without any acoustic guidance to the target. This was in order to determine the benefits of acoustic guidance for navigating to the target. The sound assignment was silence in the background and bongo roll for the object. The radius of the dot representing the object was 0.2 inches (60 pixels on the Samsung Galaxy). The experimental procedure was similar to Navigation Experiment 2. Seven subjects participated in this experiment; five male and two female, with ages ranging from 25 and 58 years old, and average 36. 5.5.4. Navigation Experiment 4: Small Dot, Guided versus Unguided Much to our initial surprise, as we will discuss in more detail in Chapter 6, the layout perception experiments (with L1-ung, L2-seq, and L3-vor) showed that, overall, the performance with guided navigation was not better than that of unguided navigation.1 However, it soon became clear that the size of the object (dot) could have a significant effect in the unguided navigation, while the performance of the guided navigation should remain more or less the same. To verify this hypothesis, we conducted this experiment with the N1-ung and N2-guided configurations using a smaller size object. The radius of the dot representing the object in this experiment was 15 pixels (0.05 inches), one fourth of the dot size used in Navigation Experiments 2 and 3. For the guided navigation we used fixed orientation and bongo roll for the object sound. The implementation and 1In

fact, as we will discuss in Chapter 6, in the navigation experiments, the times for unguided navigation were six times slower than those for guided navigation, but performance improves when multiple large objects are presented simultaneously in the layout experiments.

65

experimental procedure was otherwise similar to Navigation Experiments 2 and 3. The subjects were the same ones that participated in Navigation Experiment 3. 5.6. 2-D Layout Perception Experiments The goal of the layout perception experiments was to test the subjects’ ability to perceive a 2-D layout of simple objects on the touch screen, that is, to locate the objects on the touch screen, to identify them, and to determine their absolute and relative positions on the screen. The layout experiments were carried out with the three layout configurations we presented in Section 4.3, L1-ung, L2-seq, and L3-vor. The task of the subjects was to explore the layout in order to learn the locations of the objects until they were confident that they can reproduce the layout. The subjects did not know how many objects were present in the layout, and were told that the number could be equal to or different from that of the training example. At the end of the experiment, the subjects were first asked to say how many objects were in the layout. Then they were asked to reproduce the layout. There were no time limits for experiments but the trial times, including the time to reproduce the scene, were recorded. No feedback about the subject’s performance was given during the experiment. After each experiment, the subject was interviewed; first, she/he was given a chance to provide general feedback, and then was asked specific questions and to rate of the difficulty, intuitiveness, and cognitive load of the tasks in a scale from one to ten, ten being the highest. 5.6.1. Sequential Exploration of 2-D Layout Experiment (L2-seq) This experiment was conducted with the L2-seq configuration, and was implemented in the iPad1. Based on the results of Navigation Experiment 1, we used sound intensity to indicate distance changes. The proposed system was implemented with four circular objects as shown in Figure 5.1. The radius of the dot representing the object was 0.2 inches (60 pixels on the iPad1). Each object was represented by a sinusoidal tone with white noise added in the background region, as was done for Navigation Experiment 1, discussed in Section 5.5.1. We assigned tones of 452, 652, 852, and 1052 Hz to the four objects (blue, green, orange, red, respectively). As in the variable intensity run of Navigation Experiment 1 (Section 5.5.1), we used a constant tempo in the background region, with alternating segments of signal (tone and noise) and silence, each of 250 ms duration. Thus, the period of the signal was 500 ms, and the frequency was 2 Hz. The interior of the objects was represented with a constant (played in a loop without silence) tone signal of the same frequency as the one in the background. Directionality rendering was the same as that used in Navigation Experiment 1, assuming a fixed head orientation. As we discussed, the subjects’ task was to explore the layout until they were confident that they can reproduce it on a sheet of paper. At the end of the experiment, the subjects were first asked to provide the number of objects in the layout, and then they were given

66

Figure 5.1. Layout for sequential exploration experiment with L2-seq (blue: 452 Hz, green: 652 Hz, orange: 852 Hz, and red: 1052 Hz). a customized graph paper of the same size as the active area of the iPad (5.8 × 7.6 inches), with grid spacing 0.38 inches. According to the written instructions, the subjects were free to place the dots anywhere on the grid.2 To eliminate the additional burden of associating the tone of each object with a label, the subjects were given the ability (another iPad application) to play the tone for each object before they were asked to place it on the graph paper. Four out of the eight subjects of Navigation Experiment 1 participated in this experiment; two were male and two female, with ages ranging from 23 to 35 years old. In order to familiarize the subjects with the system, before the start of the experiment, they were presented with a training example that consisted of a layout of three objects. 5.6.2. Simultaneous 2-D Layout Exploration Experiment (L3-vor) This experiment was conducted with the L3-vor configuration, and was implemented in the Samsung Galaxy tablet. Based on the results of the Navigation Experiment 2, we used a fixed head orientation. The distance and directionality rendering was the same as in Navigation Experiment 2, described in Section 5.5.2. The layout presented to the subjects in this experiment consisted of four objects as shown in Figure 5.2, while the training layout consisted of three objects, as shown in Figure 5.3. We used musical instrument recordings to identify objects: The bass clarinet (“G3”), the oboe (“B3”), the trumpet (“D5”), and the bongo roll. The blue, green, red, and black segments were assigned with recordings of bongo roll, trumpet, oboe, and bass clarinet, respectively. This experiment consisted of two modes, a sequential and a Voronoi mode. The goal of the sequential mode was to help the subject identify the different objects and to get a rough idea of their locations. When the stylus (or finger) was placed on the screen, the sound corresponding to the closest object (to the stylus landing location) was 2As

we will see in Chapter 6, the subjects actually ended up centering the dots on the pixels.

67

Figure 5.2. Layout for simultaneous exploration experiment with L3-vor (blue: bongo roll, green: trumpet, red: oboe, and black: clarinet)

Figure 5.3. Training layout for simultaneous exploration experiment with L3-vor (blue: bongo roll, green: trumpet, and red: oboe)

played; the directionality and distance rendering were based on the same stylus location. In order to discourage further exploration of the object during this mode, the object sound (directionality and distance) did not change with stylus movements. When the stylus was lifted and placed back down on the screen, the previously presented object was marked “inactive” and the (next) closest object (to the new stylus landing location) was played. The process was repeated until all the objects had been heard, i.e., had become inactive. At this point the cycle was complete, and at the next stylus tap, all objects were reactivated, and a notification sound was played to indicate the beginning of the next cycle. The subject was given the option of cycling through objects as many times as

68

Figure 5.4. Training layout for unguided (blue: bongo roll, green: trumpet, and red: oboe) and visual exploration experiments she/he desired, in order to familiarize her/himself with the number of objects and their rough locations. The subject could then switch to the Voronoi mode by double-tapping anywhere on the touch screen. The mode change was acknowledged by a special notification sound. Note that, once the subject switched to the Voronoi mode, it was impossible to go back to sequential mode. The details of the Voronoi mode were provided in Section 4.3.3. As in the experiment with L2-seq, the subjects’ task was to explore the layout until they were confident that they can reproduce the layout. At the end of the experiment, the subjects were first asked to provide the number of objects in the layout. Then they were asked to reproduce the layout using another application, that consisted of dragging and dropping the dots corresponding to the objects. The subjects were able to see the display during this stage of the experiment. The dots were initially placed in a horizontal straight line in the bottom left corner of the touch screen. Each dot was assigned a different color and the characteristic sound of each of the objects in the layout. The subjects were then asked to drag and drop each dot to the appropriate location on the touch screen. Note that this application was also provided during the training phase. 5.6.3. Unguided Layout Exploration Experiment (L1-ung) This experiment was based on the L1-ung configuration, and the goal was to explore the layout without any sound guidance to the objects in the layout. Since the participants in this experiment were the same as those of the second experiment, to eliminate any prior familiarity with the object layouts, both the training and the experimental layouts were transposes (mirror images taken at the middle of the vertical and horizontal axes) of the corresponding layouts used in the experiment with L3-vor. The transposed layouts are shown in Figures 5.4 and 5.5. The experimental procedure and object sound assignments were similar to those of the experiment with L3-vor.

69

Figure 5.5. Layout for unguided (blue: bongo roll, green: trumpet, red: oboe, and black: clarinet) and visual exploration experiments 5.6.4. Experiment with Visual Layout In the layout experiments described above, the performance of the subjects depends on (a) their ability to explore and perceive the layout by navigating to and identifying the objects, and (b) their ability to memorize and reproduce the perceived layout. However, our interest in only in the first. In the visual layout experiment, on the other hand, where the layout is presented visually and the subjects are asked to reproduce what they saw, we anticipate that the first task will become trivial, and thus the subject performance will correspond to the second task. The results of this experiment could thus be used as an upper limit for the performance of the visually-blocked experiments. object layouts, both the training and the experimental layouts were transposes (mirror images taken at the middle of the vertical and For the visual layout experiment we used the same training and experimental layouts as in the unguided exploration experiment of the previous subsection, shown in Figures 5.4 and 5.5, respectively. In the experiment, the subjects were first asked to look at the layout without touching the screen or listening to any sounds. They were then asked to reproduce it, by dragging and dropping the (silent) dots, as in the experiments with L3-vor and L1-ung, based on their memory of the layout. 5.7. Experiments with Special Configurations We now discuss the subjective experiments with the special configurations we introduced in Section 4.4. 5.7.1. Virtual Cane Experiment 1: Acoustic Object Representation This experiment was conducted with Configuration A1-cane, which consists of two modes, as we discussed in Section 4.4.1. In the zoomed-out mode, we used prerecorded mono

70

tapping sounds of wood (for the green object in Figure 4.8), glass (blue), and metal (red) to represent up to three different objects. The background was silent. The mode change was triggered by double-tapping inside the object. A short-duration sound was used for the mode change notification. The zoomed-in mode was the same as S5-hrtf in the first set of experiments, except for the choice of object sound. The training scene contained two objects, and the testing scene contained three objects that were different from those used in the training scene. In both cases, the objects were disjoint with roughly equal horizontal spacing. In the instructions at the beginning of the experiment, the subjects were told that the number of objects in the experiment could be different than that of the training example. The subjects had no prior information about the set of possible sounds for the objects. The analogy of a virtual cane was also provided. The subjects were told that the purpose of zoomed-in mode is to facilitate shape identification; it could not be used for the determination of object size, as the degree of zooming varied with object size. At the end of the experiment (just one trial), the subjects were asked to draw all the objects in the scene indicating the material they were made of. Their performance evaluation was based on the accuracy of the number of objects in the scene, their relative positions, shapes, and materials. 5.7.2. Virtual Cane Experiment 2: Tactile-Acoustic Object Representation This experiment was conducted with Configuration A2-cane, which was introduced in Section 4.4.2. The sound signals for this configuration were the same as those for A1cane (wood, glass, and metal). However, a sheet of paper with embossed tactile patterns (embossed using a “VersaPoint Duo Braille Embosser”, Model# BP2B-01) was superimposed on the touch screen, so that the tactile and acoustic patterns were aligned. Since the objects were disjoint, the same tactile texture (dot pattern, density, height, and size) was used for all objects, while the background was flat. As we saw in Section 3.4, there is no need for a zoomed-in mode. The training scene contained the same two objects as in A1-cane while the testing scene contained the three objects shown in Fig. 4.8(b), which are different from those in A1-cane. However, in order to address significant errors in material identification observed in experiments with A1-cane, we decided to narrow down the number of possible materials. Thus, at the beginning of the experiment, the subjects were presented with tapping sounds labeled wood, glass, metal, cardboard, plastic, and composite, as the only choices for the objects in the virtual scene. The performance evaluation was the same as in A1-cane. 5.7.3. Experiment With Venn Diagrams This experiment was conducted with Configuration A3-venn, introduced in Section 4.4.2. The Venn diagrams we used in this experiment, both for the actual experiment and the training phase, contained three circles, as shown in Figures 5.6 and 5.7, respectively. Each

71

Figure 5.6. Layout for Venn diagram experiment (green: bongo roll, blue: trumpet, red: oboe) circle was represented by a different instrument, bong roll for the green circle, trumpet for the blue, and oboe for the red. The subjects were told that the Venn diagram contains three circles. For each pair of circles, the subjects were asked to determine the relative size, relative location, and amount of overlap. A finite number of choices was given for each answer. For the size, the choices were small, medium, and large. For the relative location, the choices were one inside the other, north, northeast, east, southeast, south, southwest, west, and northwest. Finally, for the amount of the overlap, the choices were none, 10–40%, 40–60%, 60–90%, and 100%. The subjects were given the option of answering the above (multiple choice) questions during the layout exploration or at the end of the experiment. Once the answers were complete, the tablet was removed, and the subject was asked to draw the Venn diagram on a piece of paper. Since our interest is in determining the relative sizes and locations of the circles, the paper did not have to be of the same size as the touch screen. After drawing the Venn diagram, the subject was given a chance to modify the answers, to make them consistent with the drawing. There were no time restrictions for the experiment, but the total time for exploring the diagram, answering the questions, and drawing the Venn diagram was recorded.

72

Figure 5.7. Training layout for Venn diagram experiment (green: bongo roll, blue: trumpet, red: oboe)

73

CHAPTER 6

Results and Discussion In this chapter we present and analyze the results of the subjective experiments, which are organized into four groups that focus on shape, localization, layout, and applications, as in previous chapters. At the outset, we should point out that the subjective experiments were very time consuming, and as a result, we could only conduct a limited number of trials in each sitting. In addition, all of our subjects were unpaid volunteers, and it was difficult to recruit them and could not ask them to volunteer for multiple sittings. Thus, our experimental data are limited compared to what a typical subjective experiment will produce, and as we will see, some of the results are not statistically significant. Nevertheless, as we will see, they can be considered as pilot experiments that offer many valuable insights for the design of further systematic tests with visually impaired and visually blocked subjects. Hence we use an alpha level of 0.1 for all statistical tests. 6.1. Shape Perception As we discussed in Chapter 5, we conducted the shape perception experiments in two sets. In the first set, we tested Configurations S1-2cons, S2-3cons, S3-2trem, and S5-hrtf. On the basis of the results, we then added S4-3int, and conducted a second set of experiments with that, a modified S5-hrtf, and the unaltered S2-3cons. The five configurations are summarized in Table 6.1. As we discussed in Chapter 5, in all of the shape experiments the subjects’ performance was measured by the accuracy of their response and the time they took to complete the experiment. As we discussed, the rating of each response was binary, correct or incorrect, based on the verbal description and the drawing. Thus, typical statistical methods such as ANOVA, ANCOVA, and regression, cannot be used to analyze the accuracy of the Table 6.1. Configuration summary S1-2cons S2-3cons S3-2trem S4-3int S5-hrtf

2 3 2 3 3

constant sounds constant sounds tremolo sounds with varying border rate sounds with varying border intensity Sounds with HRTF in border and background

74

Table 6.2. First set of experiments: Accuracy and time averaged over all subjects S1-2cons S2-3cons SquareCircleTriangleSquareCircleTriangle Accuracy (%) 85 40 75 86 76 81 Aver. Accuracy (%) 66.7 80.9 Time (s) 253 230 225 222 227 234 Aver. Time (s) 236 227.7 S3-2trem S5-hrtf SquareCircleTriangleSquareCircleTriangle Accuracy (%) 80 55 80 95 60 85 Aver. Accuracy (%) 71.7 80.0 Time (s) 201 202 140 127 211 207 Aver. Time (s) 181 181.7 responses, because they assume a normal distribution and equal variance [97]. Instead, we used the chi-square test of independence for categorical analysis of the accuracies. The timing data, on the other hand, is continuous and ANOVA and the t-test can be used for analysis. 6.1.1. First Set of Experiments with Shape Configurations The results of the first set of subjective experiments with S1-2cons, S2-3cons, S3-2trem, and S5-hrtf are summarized in Table 6.2, which shows accuracy and time averaged over all subjects for each shape and configuration. Figure 6.1 shows the accuracy of the different configurations for each shape. The data were analyzed using the chi-square test of independence and it was shown that accuracy is independent of the configuration (χ2 (3, 243) = 4.55, p > 0.2), that is, there are no significant differences in accuracy among configurations. However, in a pairwise comparison of S1-2cons and S2-3cons, the performance of S2-3cons is significantly better (χ2 (1, 123) = 3.26, p = 0.07). The performance of S5-hrtf is also significantly better than that of S1-2cons (χ2 (1, 120) = 2.73, p = 0.09.). These observations justify the addition of a narrow border strip with a distinct sound. On the other hand, the difference between S3-trem and S1-2cons is not significant, which indicates that tremolo may not be as effective. Figure 6.2 shows box plots for the time it took the subjects to identify each shape for each configuration, as well as averages across the different shapes. Box plots illustrate the distribution of the results: the red line indicates the median, the box edges indicate the 25th and 75th percentiles, each whisker extends the box by 1.5 times its length, and the crosses show the outliers (outside the range defined by the whiskers). As we discussed in

75

Correct Incorrect

Number of trials

20

15

10

5

0

S

C

T

S

C

T

S

C

T

S

C

T

S1-2cons S2-3cons S3-2trem S5-hrtf Figure 6.1. First set of experiments: Accuracy for each shape for S1-2cons, S2-3cons, S3-2trem and S5-hrtf (S: square, C: circle,, and T: triangle) 900 800 700

Time

600 500 400 300 200 100 0

S C

T All

S C

T All

S C

T All

S C

T All

S1-2cons S2-3cons S3-2trem S5-hrtf Figure 6.2. First set of experiments: Time distribution for each shape of S1-2cons, S2-3cons, S3-2trem and S5-hrtf (S: square, C: circle,, and T: triangle) Section 5.4, the recorded times include the scanning of the touch screen and the drawing and naming of the shapes. A one-way ANOVA showed significant differences among the configurations (F(3, 236) = 2.72, p = 0.045). In particular, a comparison between S23cons and S5-hrtf shows that the latter requires significantly shorter time (t-test assuming unequal variance: t(114) = 1.80, p = 0.07) for about the same accuracy. It thus appears that the addition of spatial sounds in S5-hrtf helped expedite shape tracing. On the other hand, there is no significant difference between S3-2trem and S5-hrtf (paired t-test: t(59) = −0.02, p = 0.98). In addition, S3-2trem requires significantly shorter time than S2-3cons (t(111) = −1.85, p = 0.07)), hence the use of tremolo variations also expedites shape tracing.

76

Based on the statistical analysis, we have established that adding a distinct border sound improves accuracy and that the addition of spatial sounds (directionality and distance cues in S5-hrtf and S3-2trem, respectively) improves timing. What we have not established on firm grounds is the superiority of S5-hrtf over S3-2trem. However, analyzing the subjects’ feedback after the experiments, there were numerous negative comments about S3-2trem, such as “inside/outside of shapes were not differentiable by assigned tremolos,” “tremolo rate changes very fast within a small area,” “tremolo rate changes were not noticeable.” Such comments provide an indication of the ineffectiveness of S32trem. We expect that more extensive subjective tests will establish the superiority of S5-hrtf over S3-2trem in terms of accuracy. If we do take the subjects’ negative feedback into account, the explanation for the relatively faster time of S3-2trem may be that, when faced with an ineffective or annoying interface, a subject gives up or makes a haphazard guess. Overall, out of the four configurations, S5-hrtf received most positive feedback, with special emphasis on the ease of use. This can be attributed to the addition of an explicit border and spatial sounds. According to the subject comments, the addition of spatial sounds in the border segment, was quite helpful in guiding the finger in tracing the edges, and also provided cues about edge orientation. On the other hand, the subjects reported that spatial sounds did not help much in the background region. This can be explained by the fact that the object occupied much of the screen and was always centered, which made it easy to locate the object even without spatial sound. However, as we mentioned in Section 5.5.4 and will discuss in detail in Section 6.2.4, spatial sound is more important for guiding the finger to the object in sparser object layouts. Overall, there are clear indications that the addition of spatial sounds simplifies the exploration task, thus allowing the user to focus on the perception of object shape and scene layout, as claimed in Section 4.1.5. It is important to point out that the accuracies are much greater than what would be achieved by mere guessing. The results are strengthened by the fact that the subjects did not have any prior knowledge about the shapes they were going to be tested on and were not given any feedback on their accuracy during the test. Figure 6.3 shows some interesting drawings from the experiments with the shape configurations (from both sets of experiments). The first row shows sketches that were marked as correct. All the remaining ones were marked as wrong. The diversity of the sketches is a clear indication that the subjects did not have any prior knowledge of the test shapes. The effectiveness of a configuration is reflected on both the accuracy and the timing of the task of object shape perception. The time it takes to identify a shape depends on the degree of confidence that each subject wants to achieve before naming the shape of an object. Figures 6.4 and 6.5 show individual subject performance in terms of average accuracy and duration. As can be seen in the figures, there were wide performance variations. Four of the subjects did perfectly in all of the shapes, while three of the subjects had accuracies below 50%. There are also wide variations in timing performance.

77

Figure 6.3. Selected subject drawings for Configurations S1-2cons, S22cons, S3-2trem, S4-3int, and S5-hrtf (s21A: Subject 21, first set of experiments; s11B: Subject 11, second set of experiments) It is thus interesting to look at the relationship between the time and accuracy for each subject; this is shown in Figure 6.6. Note that about a third of the subjects have over 80% average accuracy and average time of less than 3 minutes. The data in Table 6.2 also indicate that the accuracy varied with shape. Indeed, statistical analysis shows that the subjects were significantly less accurate in identifying the circle compared to the other two shapes (χ2 (2, 243) = 19.22, p < 0.001). Overall, one would expect that the detection of curved edges is more difficult than that of straight edges and corners. Figure 6.7 shows some interesting sketches drawn by subjects in response to the circle stimulus, all of which were marked wrong. The first row shows the sketches of Subject S7A (Subject 7, first set of experiments). The subject called the first sketch a hexagon and other three octagons. Note that all four of these drawings can be considered as straight line approximations of the circle. The second row shows the sketches of Subject S2A. These can be considered as pixel approximations of the circle. We should also note that for both subjects the crudest approximation corresponds to S1-2cons. Overall, the drawings in Figure 6.7 were drawn by eight subjects, and include 11 line and five pixel approximations of the circle. Actually, Subject S2A (as well as other subjects) also drew

78

accuracy

1 0.8 0.6 0.4 0.2 0

1

3

5

7

9 11 13 15 subject number

17

19

21

Figure 6.4. First set of experiments: Accuracy for each subject for S1-2cons, S2-3cons, S3-2trem, and S5-hrtf, averaged over configurations and shapes

time in seconds

600 400 200 0

1

3

5

7

9

11 13 subjects

15

17

19

21

Figure 6.5. First set of experiments: Average time and standard deviation for each subject for S1-2cons, S2-3cons, S3-2trem, and S5-hrtf, averaged over configurations and shapes pixel approximations of the triangle, as can be seen in Figure 6.3. However, we should point out that the training example may have biased the subjects towards using line and pixel approximations, because in all cases the training shape was a cross, which consists of straight edges only. To further probe the accuracy variations with shape, we used the chi-square test of independence to analyze the shape accuracies separately for each configuration. The accuracies varied significantly among shapes in S1-2cons (χ2 (2, 60) = 10.05, p = 0.006) and in S5-hrtf (χ2 (2, 60) = 8.12, p = 0.017); in both cases the circle accuracies were significantly lower. On the other hand, the accuracies of other two configurations with explicit border rendering were consistent across shapes (S2-3cons: χ2 (2, 63) = 0.62, p >

79

21

1

17 5

0.8 accuracy

6

18

16

1

11

8

0.6

10

4 13

14 3 12

15

7 9 2

0.4

19 20

0.2 0 50

100

150

200 250 time in seconds

300

350

Figure 6.6. First set of experiments: Average time vs. average accuracy for each subject in S1-2cons, S2-3cons, S3-2trem, and S5-hrtf (Points labeled with subject number) 0.7 and S3-2trem: (χ2 (2, 60) = 4.1, p = 0.13). There is no clear explanation for these results. The lack of an explicit border in S1-2cons could explain the difficulty in tracing a circular edge, perhaps more than the straight lines of the other two shapes. Also, even though S5-hrtf includes an explicit border and tracing guidance in the border region, many of the subjects still drew polygonal or pixelized approximations of the circles, shown in Figure 6.7. One explanation for this may be the angular quantization (12 angular sectors) of the HRTF rendering. Finally, regarding timing differences among shapes, one-way ANOVA shows that there is no significant difference among shapes when considering all configurations (F (2, 240) = 0.38, p = 0.68). However in S5-hrtf, there is a significant difference in timing data (oneway ANOVA gives F (2, 57) = 3.48, p = 0.04), with the square recording about 40% faster times compared to circle and triangle. The faster identification times for the square may be due to the use of directionality cues in the border segment to judge horizontal/vertical edge orientation without tracing the entire edge, as several subjects explicitly mentioned in their feedback. The other three configurations show no significant differences among shapes.

80

Figure 6.7. First set of experiments: Selected subject drawings, all marked wrong, in response to circle stimulus in S1-2cons, S2-3cons, S3-2trem, and S5-hrtf Table 6.3. Second set of experiments: Average accuracy, timing, and difficulty for all 11 subjects (Sq - Square, Cir - Circle and Tri - Triangle) S2-hrtf S4-3int S5-hrtf Sq Cir Tri Sq Cir Tri Sq Cir Tri Accuracy (%) 100 45 64 100 64 82 91 64 64 Average Accuracy 69.7 % 81.8 % 72.7 % Normalized Accuracy 80.9 95.0 84.5 Time (s) 209 496 289 180 314 283 108 170 285 Average Time 331.2 s 259.0 s 188.7 s Normalized Time (s) 227.7 178.1 129.0 Difficulty (1-10) 6.23 5.64 3.77 6.1.2. Second Set of Experiments with Shape Configurations Table 6.3 summarizes the results of the second set of experiments, conducted with S23cons, S4-3int and modified S5-hrtf. As we discussed in Chapter 5, we had to run the

81

Table 6.4. Second set of experiments: Average accuracy, timing, and difficulty for 6 of the 11 subjects (Sq - Square, Cir - Circle and Tri - Triangle) S2-3cons S4-3int S5-hrtf Sq Cir Tri Sq Cir Tri Sq Cir Tri Accuracy (%) 100 50 83.3 100 66.7 100 100 83.3 83.3 Average Accuracy 77.8 % 88.9 % 88.9 % Time (s) 198 312 218 163 264 210 60 160 88 Average Time 242.7 s 212.2 s 102.7 s Difficulty (1-10) 6.92 5.92 3.75

second set of experiments with a different set of subjects and included S2-3cons for comparison between the two sets of subjects. This turned out to be important as the performance in accuracy and timing dropped when compared with the first set. Since the S2-3cons configuration remained unchanged, we analyzed the the difference in performance for S2-3cons between the two sets of experiments. Even though the mean accuracy dropped by over 10% in the second set, the difference was not statistically significant (χ2 (1, 96) = 1.55, p = 0.21). However, a two-sample t-test (assuming unequal variance) showed that the timing of the subjects in the second set of experiments (mean = 331.2, stdev = 314.5) was significantly slower than that of the first experiment (mean = 227.7, stdev = 162.4), with t(41) = −1.77, p = 0.08. The variance was also higher. The drop in performance could be attributed to the natural abilities of the subjects or the demographics (prior experience with touch-screen devices). Indeed, a closer look at the background of the subjects revealed that six of the subjects owned and used touchscreen devices (smart phones, tablets, or tablet PCs) in their day-to-day life (experienced group), while the remaining five had very little experience with touch-screen devices (inexperienced group). In contrast, in the first set of experiments, only one subject belonged to the latter category. In fact, the performance of the inexperienced group (accuracy: mean = 62.2%; time: mean = 347.4, stdev = 291) was significantly lower than that of the experienced group (accuracy: mean = 85.2%; time: mean = 185.9, stdev = 201.7), with χ2 (1, 99) = 6.86 and p = 0.009 for accuracy and t(76) = 3.15 and p = 0.002 for time. Table 6.4 summarizes the results of the experiments for the experienced group of six subjects. Note that the mean performance in S2-3cons is now comparable to that in the first set of experiments with S5-hrtf showing a significant improvement in time compared to group of all 11 subjects (t(133) = −1.94, p = 0.05). Moreover, the modified S5-hrtf reported 43% reduction (for the selected six subjects) in mean time compared to the S5-hrtf in the first set of experiments (t(34) = −2.87, p = 0.007) which justifies the modifications.

82

Another approach for comparing the performance of the new configurations to the old ones (in the first set of experiments) is to use all the subjective data, but to normalize the results on the basis of the common configuration (S2-3cons), as shown in Table 6.3. Having explained the differences between the two sets of experiments, we now focus on the second set and a performance comparison of the three configurations. In our statistical analysis we will use all 11 subjects, as it is difficult to draw significant conclusions with just six subjects; however, we will list some significant results with six subjects in footnotes. When the accuracies of three configurations were analyzed using the chisquare test of independence, it was shown that the subject performance was independent of configuration (χ2 (2, 99) = 1.4, p = 0.5).1 On the other hand, using ANOVA to test the time differences among configurations, the differences were shown to be significant (F (2, 96) = 2.64, p = 0.08).2 Our statistical analysis demonstrates that S5-hrtf significantly outperforms S2-3cons in terms of timing (paired t-test: t(32) = 2.74, p = 0.01). There are also significant differences in timing for S4-3int over S2-3cons (paired t-test: t(32) = 2.15, p = 0.039) and S5-hrtf over S4-3int (paired t-test: t(32) = 1.91, p = 0.066). In addition, we asked the subjects to rate the difficulty of the task, and one way ANOVA showed significant differences in the ratings of the three configurations (F (2, 30) = 3.61, p = 0.039). S5-hrtf was the easiest, followed by S4-3int and S2-3cons. Our results establish that S5-hrtf is the fastest and easiest to use configuration, and that S4-3int is better than S2-3cons. Thus, the importance of utilizing instantaneous acoustic cues (directionality in S5-hrtf and distance in S4-3int) for shape exploration becomes apparent. In fact, it may be beneficial to combine the two, for example, using directional sound to guide the finger along the border and loudness to keep it on the border. Another direction for improvement is better rendition of sound directionality. This can be accomplished by using individually calibrated HRTFs, as well as finer angular quantization and interpolation, as discussed in Section 5.5.2. Such improvement are costly and time consuming, but they may be worthwhile, especially for VI subjects. Overall, as we argued in the previous section and Section 4.1.5, spatial sound can be used to provide a natural, intuitive interface for exploration, that allows the user to focus on the perception of object shape and scene layout. More importantly, the fact that subjects familiar with touch-screen interfaces significantly outperformed subjects without such experience, points to the importance of training. Since none of the subjects in either experiment received any systematic training, it should be clear that there is a lot of room for performance improvement with extended training and experience. 1In

this case, the differences are also significant withing the group of six experienced subjects (χ2 (2, 54) = 1.17, p = 0.56). 2In this case, the differences are also significant within the group of six experienced subjects (F (2, 51) = 2.53, p = 0.09).

83

12

Correct Incorrect

Number of trials

10 8 6 4 2 0

S

C

T

S

C

T

S

C

T

S2-3cons S4-3int S5-hrtf Figure 6.8. Second set of experiments: Accuracy for each shape for S23cons, S43-3int, and S5-hrtf (S: square, C: circle,, and T: triangle) 1200 1000

Time

800 600 400 200 0

S

C

T

All

S

C

T

All

S

C

T

All

S2-3cons S4-3int S5-hrtf Figure 6.9. Second set of experiments: Time distribution for each shape of S2-3cons, S4-3int, and S5-hrtf (S: square, C: circle,, and T: triangle)

6.1.3. Comparison With Existing Techniques We now compare our results with those reported in Soundview [98], which was implemented in a graphical tablet using a pointing device for scanning. Note that the authors report only the mean values for their results, hence we cannot perform statistical analysis for the comparison. So, we simply compare their reported means to ours. Soundview used two sounds, one inside the shape and one in the background, as in S1-2cons. However, the sound played to the subject at a given time depended on both the location and the velocity of the pointer. In addition, they used six shapes (square, circle, and triangle,

84

with and without a hole in the middle). In contrast to our experiment, they allowed subjects to have visual contact with the tablet (shape visually hidden) and scanning pointer, thus indirectly using vision in shape identification - which is unrealistic for VI subjects. They used three different experimental setups. In the first, the subjects did not know the shapes they were going to be tested on, but they were told that they would be simple shapes, and had to draw the shape after each trial. In the second, the participants were asked to choose the shape they perceived from a set of 18 shapes. In the third, they had to pick one from a set of six possible shapes. Each shape was presented once, and the subjects were not explicitly told whether shapes could be repeated or not. The overall accuracy for the three experiments was 30.0%, 38.3% and 66.2%, respectively. In all of their setups, there was a time limit of 90 seconds for perceiving a shape and 90 seconds to record their response. We believe that the poor performance of Soundview can be attributed to the dependency of acoustic stimuli on the velocity of the pointer, which makes it too complicated for subjects to decode. In addition, they tested the performance of the vOICe system (described in Chapter 1) in their six-shape setup, and found that the overall accuracy was only 31.0%. In terms of difficulty, our experimental setup was harder than the first and hardest of their setups, where the possible shapes were unknown to the subjects. This is because our experiments did not make indirect use of vision, in contrast to their allowing visual contact with the tablet and scanning pointer. Yet, our experiments demonstrated significantly better performance than the easiest of their setups. We also compare with the TeslaTouch experiments reported in [18], as they also attempted to convey simple shapes (triangle, square, and circle). As in the third Soundview experiment [98], the subjects had to select one out of a small set of shapes (three). They used three types of shape rendering: solid (as in S1-2cons), outline only, and solid with outline (as in S2-3cons). They tested these configurations with three blind subjects. There were no time limitations for the experiments. They reported just below 80% accuracy for the solid rendering, and just above 40%for the other two configurations. The average time per trial was less than two minutes. While their results for the solid configuration were about as good as those of our best configuration, one should keep in mind that the task of discriminating among their known shapes is a lot easier than that of identifying (and drawing correctly) an unknown shape. It is interesting that in the variable friction rendition, the solid shape outperforms the solid with outline configuration, while in our acoustic renditions, the solid with outline performs best. In fact, the authors also report that subjects had difficulty following the object edge in the outline configurations. This can be attributed to the limited spatial resolution of friction and the lack of directionality cues. The fact that the use of a third level for the object outline does not improve performance is an indication that friction is sensitive to gradients rather than absolute levels. More importantly, friction displays (and the tactile sense in general) lack a multiplicity of dimensions that acoustic signals offer (intensity, frequency, directionality, timber, etc.) for conveying different types of information.

85

6.2. Navigation Experiments We conducted four navigation experiments, to study different design alternatives for object localization, including intensity versus tempo for distance rendering, fixed versus trajectory head orientation, the effects of object characteristic sound selection, guided versus unguided navigation, and the effects of object size. The performance criterion was the time to locate the object. 6.2.1. Navigation Experiment 1: Intensity versus Tempo The results of Navigation Experiment 1 are shown in Figure 6.10, which shows the distribution of trial times across the different subjects for the intensity and tempo distance renderings. There is no significant difference between the time distributions for intensity (mean = 18.2, stdev = 21.5) and tempo (mean = 18.0, stdev = 26.4) variations, as shown by a two-sample t-test: t(529) = 0.11, p = 0.914. On the other hand, single factor ANOVA analysis shows significant variations among subject performances for both the intensity (F (7, 265) = 14.86, p < 0.001) and the tempo (F (7, 269) = 26.32, p < 0.001) experiments. To achieve a conclusive comparison between intensity and tempo, we would have to reduce inter-subject variability by either pre-training subjects or recruiting more subjects and excluding outliers. However, based on the existing results, we can only conclude that intensity and tempo are equally effective in rendering distance. Several interesting issues with intensity and tempo as a distance cue came up in subject interviews after the completion of this navigation experiment. Most of the subjects complained that they could not perceive significant intensity changes in the background for small finger movements and had to move over a large distance before they could feel a change in intensity. This appears to nullify the advantage of continuous variation. Another issue is the use of two sounds (tone and noise) in the background. Three of the subjects complained about having to monitor two sounds instead of one. Two other subjects liked the idea of having two sounds in the background, but suggested that we increase the volume of noise and decrease that of the tone as the finger reaches the object. Finally, the relationship between the intensity and loudness of the tone and noise signals is complicated by masking effects [99]. As we discussed in the previous chapter, rather than trying to understand and model such complicated effects, we decided to use an alternative approach, utilizing the direct-to-reverberation energy ratio. The subjects also indicated that tempo may provide a more comfortable interface that requires less concentration from the user, who may thus devote more attention to comprehending scene organization. Regarding the quantization of tempos to three levels, some of the subjects used it to their advantage, adjusting their scanning speeds based on the tempo, i.e., their perceived distance from the object (the further from the object the faster the scanning speed). In spite of the problems with intensity rendering described above, most of which were addressed in the subsequent navigation experiments, intensity rendering offers important

86

Tral time (seconds)

150

100

50

0 Sub01

Sub02

Sub03

Sub04

Sub05

Sub06

Sub07

Sub08

Sub06

Sub07

Sub08

(a) Intensity

Tral time (seconds)

150

100

50

0 Sub01

Sub02

Sub03

Sub04

Sub05

(b) Tempo

Figure 6.10. Results of Navigation Experiment 1: Trial time distribution for each subject

advantages that motivated us to select it over tempo for distance rendering. First, the perception of intensity changes is instantaneous, while temporal cues like tempo, tremolo, and pitch are not. Second, intensity is a natural and intuitive auditory cue for distance. Finally, it is straightforward to implement continuous intensity variations on the touch screen device, as opposed to quantized tempo variations.

87

Table 6.5. Trial time statistics for Navigation Experiment 2 (FO: fixed orientation, TO: trajectory orientation First four subjects FO TO Bongo Trumpet Bongo Trumpet Trial time Mean (s) 14.0 13.4 20.8 22.2 Trial time STD (s) 12.1 14.1 17.7 22.1 Difficulty (1 - 10) 4.5 4.2 6.8 7.2 Intuitiveness (1 - 10) 8.5 8.8 7.0 6.0 Cognitive load (1 - 10) 2.9 2.5 7.0 7.2

All ten subjects Bongo FO TO 12.1 17.9 9.2 15.1 3.4 5.8 8.4 6.3 3.1 5.7

6.2.2. Navigation Experiment 2 - Fixed versus Trajectory Orientation, Plus Sound Selection As we saw in Section 5.5.2, the goals of this experiment were to determine relative advantages of the two navigation modes (fixed versus trajectory head orientation) and the effects of sound selection. The latter is important for ensuring a broad selection of sounds for the intuitive representation of multiple objects in a layout, which will be considered in the next section. As explained in Section 5.5, subjects were asked to rate the difficulty, intuitiveness, and cognitive load for each mode in a scale of one to ten, ten being the highest. As we discussed, in the experiments with the first four subjects, we conducted four runs with two different sounds (bongo roll and trumpet) and the two navigation modes. The means and standard deviations of the trial times, as well as the means of difficulty, intuitiveness, and cognitive load for the four runs of the four subjects are listed in Table 6.5. As can be seen in table, the numbers for the two sounds were comparable, which means that both types of sounds can be used. The idea was to find out whether there were any dramatic differences in performance between the two instruments, rather than to establish performance equivalence in a statistically rigorous manner. We then proceeded with the experiments with the remaining six subjects using just one sound (bongo roll) and the two head orientations. Table 6.5 also shows the data for all ten subjects that conducted two runs with the bongo roll. The distribution of trial times for each mode, for each of the subjects are shown as box plots in Figures 6.11. For completeness, the corresponding data for four subjects that conducted runs with the trumpet sound are shown in Figure 6.12. A two-sample t-test shows significant differences between the two modes (t(449) = −5.49, p < 0.001). According to Table 6.5, the mean trial times are 48% shorter for the fixed orientation than those for the trajectory orientation. Fixed orientation also performs better when considering the mean subjective ratings for difficulty (paired t-test: t(9) = −3.00, p = 0.015), intuitiveness (paired t-test: t(9) = 2.35, p = 0.043), and cognitive load (paired t-test: t(9) = −3.20, p = 0.011). Nine out of the ten subjects preferred the fixed orientation

88

Tral time (seconds)

100 80 60 40 20 0 Sub01 Sub02 Sub03 Sub04 Sub05 Sub06 Sub07 Sub08 Sub09 Sub10

(a) Fixed orientation

Tral time (seconds)

100 80 60 40 20 0 Sub01 Sub02 Sub03 Sub04 Sub05 Sub06 Sub07 Sub08 Sub09 Sub10

(b) Trajectory orientation

Figure 6.11. Results of Navigation Experiment 2 with bongo : Trial time distribution for each subject and complained about the trajectory orientation being confusing, while subject S05 favored trajectory orientation because of its flexibility of rotating the acoustic scene. In conclusion, after considering all the measured factors, we can safely conclude that the fixed orientation is preferable to the navigation mode. Post experiment interviews with the subjects indicate that only two of the subjects (Sub03 and Sub05) really did navigate to the object in a single straight line during the trajectory orientation run, as was hoped. Most subjects lost track of the virtual orientation, and hence the directionality information, within first minute or so. After losing track of orientation, three subjects (Sub01, Sub02, and Sub04) used random movements, while three other subjects (Sub07, Sub08, and Sub05) relied only on distance cues for

89

Tral time (seconds)

100 80 60 40 20 0

Sub01

Sub02

Sub03

Sub04

Sub01

Sub02

Sub03

Sub04

Trajectory Orientation Fixed Orientation Figure 6.12. Results of Navigation Experiment 2 with trumpet : Trial time distribution for each subject navigation. In the fixed orientation runs, only one subject (Sub06) navigated to the object in a single straight line, while eight subjects used the Manhattan navigation strategy. The remaining subject (Sub08) relied only on directional cues, claiming that distance cues were “shadowed” by the directional cues. The ineffectiveness of the trajectory orientation can be mainly attributed to the extra cognitive load that it demands. As it is clearly evident in their feedback, most subjects had to explicitly keep mental track of the scanning history to derive the virtual orientation in order to decode the sound directionality. For subjects who jumped around instead of dragging the stylus on the screen, and for those who tried to stop and correct their scanning path (once they got lost) using spatial information, it was very difficult to determine the virtual orientation. Even for the subjects who were able to derive the orientation, rotating the acoustic scene accordingly (especially in large angles) proved to be difficult. For example, when scanning vertically in the direction towards the bottom of the tablet, a 180o rotation was required, which means that the subject had to mentally swap the sound in the left and right ears. Thus, most subjects found it highly counter intuitive and disturbing to mentally reconstruct a scene with virtual orientation that deviates by a large angle from their physical orientation. Since the tablet is fixed in front of the subject, the physical orientation is North (straight ahead), and it is natural to assume that the virtual (perceived) orientation is the same. Even if the subject physically turns the head, the perceived orientation still remains the same (North). In fact, it is the location of the tablet in front of the subject that determines the perceived orientation, not the head rotation. This alignment of the virtual world with the physical location of the tablet reduces the cognitive load for decoding the directionality, and explains the significantly lower “difficulty” and “cognitive load” ratings.

90

Table 6.6. Trial time statistics - Comparison of guided vs. unguided and large vs. small object size Guided Unguided Large Object Small Object Large Object Small Object Trial time Mean (s) 12.1 17.1 71.9 406.3 Trial time STD (s) 9.2 12.4 61.4 218.4 Difficulty (1 - 10) 3.4 2.7 6.9 10.0 Cognitive load (1 - 10) 3.1 5.0 5.9 6.9 Intuitiveness (1 - 10) 8.4 9.8 6.0 4.0 Even though the results of this navigation experiment are fairly conclusive and in line with the subjects post-experiment interviews, it is still possible that a creative subject may find an effective way to utilize the trajectory orientation mode. For example, Sub03 (who had the second best performance) moved the stylus in a circle until the sound was coming from the front, and then continued in that direction towards the object. This is why subjective experiments with VI subject, who are highly motivated to use this interface in their daily life, are imperative in the future. Finally, the fixed orientation setup of this experiment is similar to the intensity setup of Navigation Experiment 1. Comparing the results of the two, we see that the subjects performed about 50% better in this experiment (t(360) = 4.33, p < 0.001). The faster navigation times justify the design changes we implemented for this experiment (the new volume to distance curve based on calibration, the elimination of background noise and addition of reverberation, the finer directionality quantization, and the selection of new sounds), and strengthen the arguments for the continued use of intensity for distance rendering. 6.2.3. Navigation Experiment 3: Unguided The goal of this experiment was to locate a randomly placed object without any acoustic guidance in the background. As we discussed in Section 5.5.3, the object was represented by the bongo roll, and the background was silent. Figure 6.13 shows the distribution of trial times across subjects. A statistical comparison of the results with the fixed orientation results of Navigation Experiment 2 shows a significant increase in time for the silent background (two-sample t-test: t(66) = −7.88, p < 0.001). As shown in Table 6.6 (first and third column, large object), the time for locating an object increases by a factor of six, when the spatial acoustic guidance in the background is removed. 6.2.4. Navigation Experiment 4: Small Dot, Guided versus Unguided The goal of Navigation Experiment 4 was to explore the effect of object size on performance. The setups for Navigation Experiment 2 (with fixed orientation and bongo roll)

91

Tral time (seconds)

300 250 200 150 100 50 0

Sub01

Sub02

Sub03

Sub04

Sub05

Sub06

Sub07

Figure 6.13. Results of Navigation experiment 3 - unguided background : Trial time distribution for each subject and Navigation Experiment 3 were repeated with a smaller object size (dot radius reduced to one fourth). Box plots illustrating timing distributions among subjects for two setups are shown in Figure 6.14. Table 6.6 summarizes the corresponding results from the three experiments. Note that the smaller object size results in significant increases of trial times in both the guided (two-sample t-test: t(343) = −4.82, p < 0.001) and the unguided (two-sample t-test: t(15) = −5.88, p < 0.001) setups. However, the results are vastly different. The increase is relatively small 42% for the guided setup, while it is huge for the unguided (465%), that is, a factor of 11 difference in the time increases. 6.3. 2-D Layout Perception Experiments As we discuss in Section 5.6, we carried out four experiments, one for each of the three layout configurations (L1-ung, L2-seq, and L3-vor) and a visual layout experiment. The layout perception experiments present a composite task that involves the subject’s ability (a) to locate objects, (b) to rely on kinesthetic feedback to integrate the object locations into a perception of 2-D space, (c) to recognize individual objects, and (d) to memorize and reproduce the perceived layout. The first of these tasks was considered in the navigation experiments. A detailed understanding of each of the other tasks is beyond the scope of this thesis. However, the visual layout experiment was an attempt to isolate the last task. This is because in the visual experiment, object localization, object recognition, and perception of 2-D space are straightforward, while the memorization and reproduction of the perceived layout remains challenging. The success was evaluated based on the accuracy of reproduction and the time it took to complete the experiment. It is thus important to have an objective measure of the subjects’ performance in reproducing the 2-D layout, so that we can compare the performance among subjects and configurations. To obtain such a measure, we calculate

92

Tral time (seconds)

150

100

50

0 Sub01

Sub02

Sub03

Sub04

Sub05

Sub06

Sub07

Sub05

Sub06

Sub07

(a) Guided

Tral time (seconds)

800 600 400 200 0 Sub01

Sub02

Sub03

Sub04

(b) Unguided

Figure 6.14. Results of Navigation Experiment 4 : Trial time distribution for each subject (Note factor of 6 scale difference in y-axes.)

the layout reproduction error as the average distance (in inches) between the subject’s positioning of each object and the actual location of the object. Since we conducted the layout experiments using two tablets with different dimensions and different resolutions, the error should be normalized. This can be done by dividing the error by the diagonal of the display (the maximum possible length). The total layout reproduction error can then be calculated as the average of the errors for all objects and subjects. We will refer to this as the error of reproduction (EOR).

93

Figure 6.15. Results of the experiment with L2-seq (blue: 452 Hz, green: 652 Hz, orange: 852 Hz, and red: 1052 Hz) 6.3.1. Experiment with L2-seq The results of the layout experiment with L2-seq are shown in Figure 6.15, which is a composite drawing of object placements by the four subjects. The solid circles denote the actual object positions and the empty circles show the placement by the four subjects, marked by the subject number in the center of the circle. Note that two of the subjects confused the green and blue objects. This can be attributed to the fact that the objects were represented by adjacent frequencies (452 and 652 Hz). Aside from this, it appears that Subject 2 had considerably better placement accuracy than the other subjects, but overall the performance is reasonable, at least in the relative placement of the objects, if not in the accuracy of placement. For L2-seq the EOR was 0.166 and the average time for scene exploration was seven minutes and eight seconds. 6.3.2. Experiments with L3-vor The results of the layout experiment with L3-vor are shown in the composite drawing of Figure 6.16 for the six subjects that participated in the experiment. Note that Subject 1

94

Figure 6.16. Results of the Experiment with L3-vor (blue: bongo roll, green: trumpet, red: oboe, and black: clarinet)

has placed the green object almost on top of the red object and has placed the red object just 0.3 inches (100 pixels) away from the green. It is thus safe to assume that the subject has confused the two objects (red and green). Similarly, Subject 3 has confused the red object with the black object. As we remarked in the discussion of the results of L2-seq, these mistakes can be attributed to the subjects’ confusing of the characteristic sounds of the two objects. In this case, the oboe recording was confused with the trumpet and the clarinet. None of the subjects confused the bongo with any other sound, probably because bongo was the only percussion instrument, while all the other sounds corresponded to wind instruments with relatively steady sounds. In order to prevent such mix-ups, it is important to pick sounds that are easy to distinguish and to memorize. Apart from the two mix-ups, most of the subjects did a good job in reproducing the perceived layout. Excluding the confusions, the most and least a subject has deviated from the actual locations were 1.8 inches (548 pixels) and 0.03 inches (10 pixels), respectively. The average (across objects and subjects) deviation was 0.5 inches (163 pixels). The EOR for the L3vor was 0.099 and the average time was eight minutes. Figure 6.17 shows the error distribution among subjects in placing the objects. From Figures 6.17 and 6.16, it is clear that Subjects 2 and 4 performed the best, while Subjects 1 and 3 had the highest error due to the confusions discussed above.

95

Error in placing an object

0.4

0.3

0.2

0.1

0

Sub01 Sub01* Sub02

Sub03 Sub03* Sub04

Sub05

Sub06

Figure 6.17. Error distribution for each subject in the experiment with L3-vor (Sub01* and Sub03* show the errors after the mix-ups have been corrected.) 6.3.3. Experiments with L1-ung The results of the layout experiment with L1-ung are shown in the composite drawing of Figure 6.18 for the three subjects that participated in the experiment. Since there is no acoustic guidance in the background, a comparison of the results of this experiment with the previous two can be used to evaluate the effect of acoustic guidance for object localization in the overall layout perception. Observe that there were no object confusions in this experiment. Even though the number of subjects was too small (three) to draw any firm conclusions, we should compare with L2-seq, where two out of four subjects confused objects. 6.3.4. Experiments with visual layout The results of the visual layout experiment are shown in the composite drawing of Figure 6.19 for the three subjects that participated in the experiment. This experiment was conducted to get a measure of the best possible accuracy with the experimental setup we were utilizing for all layout experiments. As should be expected, there were no object confusions in this experiment. 6.3.5. Statistical Analysis and Discussion Table 6.7 summarizes the results of all four layouts. As discussed above, we use EOR as the objective measure in comparing the accuracy of the layout configurations; the lower the EOR, the better the performance. Single factor ANOVA on EOR (with mix-ups) shows significant differences among the four layouts (F (3, 60) = 3.74, p = 0.016). The same

96

Figure 6.18. Results of the experiment with L1-ung (blue: bongo roll, green: trumpet, red: oboe, and black: clarinet) Table 6.7. Results of layout experiments L1-ung L2-seq L3-vor Visual EOR Mean - with mix-ups 0.056 0.166 0.099 0.065 EOR Mean - w/out mix-ups NA 0.110 0.049 NA EOR STD - with mix-ups 0.037 0.129 0.110 0.027 EOR STD - w/out mix-ups NA 0.052 0.039 NA Time (seconds) 623 428 479.8 24.3 Difficulty (1 – 10) 4.3 NA 4.6 3.3 Cognitive Load (1 – 10) 6.7 NA 3.8 4.3 Intuitiveness (1 – 10) 8.3 NA 8.5 7.0 same conclusion holds when the mix-ups are not included (F (3, 60) = 7.29, p < 0.001). However, as shown by similar ANOVA analysis, there is no significant difference among the configurations for timing and the subjective ratings of difficulty, intuitiveness, and cognitive load. Note that L1-ung reports the lowest error, a result that would suggest that navigation guidance via spatial sounds in L2-seq and L3-vor is unnecessary. However, one should keep in mind that in all of the layouts, there were four objects, each of which was presented

97

Figure 6.19. Results of the Experiment with visual layout

with a dot of 0.2 inch radius. Thus, the objects take 4.5% of the total area of the screen. Moreover, subjects were allowed to scan using multiple fingers in all of the nonvisual layout experiments. This means that the chances of bumping into an object during scanning without acoustic guidance are quite high. The results of the unguided layout exploration should be contrasted with the results of unguided navigation to a single object, which were presented in Section 5.5.3. The average time to localize one object was six times higher without acoustic guidance (N1-ung, Navigation Experiment 3) than with acoustic guidance (N2-gui, Navigation Experiment 2). This is because in the layout exploration the localization is parallelized. The results of Navigation Experiment 4 (Section 5.5.4) demonstrated that, in the unguided case, the time for locating a single object increases dramatically when the dot size is reduced, while in the guided case, the increase is only moderate. Therefore, we expect that the performance of the unguided layout exploration (L1-ung) will drop rapidly as the size of the dots representing the objects decreases, while the performance of the guided exploration (L2-seq and L3-vor) will not be severely affected by the size change. The confusions observed in the guided layouts (L2-seq and L3-vor) contribute heavily to the EOR. However, if we correct for the two confusions in L3-vor, the mean EOR drops to 0.049, which makes it comparable to L1-ung. Similarly, in L2-seq, the mean EOR drops to 0.11.

98

Note that the confusions were observed only in the guided configurations. While the statistical data may not be adequate to draw any firm conclusions, an explanation might lie in the the difference between having to listen to sounds continuously in the guided layouts and only when the finger is on the object in the unguided layout. In fact, two of the three subjects that participated in the L1-ung experiment (all three of whom had also participated in the L3-vor experiment) explicitly mentioned this as the reason for the mix-ups. Using easier to distinguish sounds for the different objects may help alleviate the latter problem in the future. A two-sample t-test performed on the EOR data for L2-seq and L3-vor shows a marginally significant difference, with t(28) = 1.67, p = 0.107. However, when the mixups are removed, the difference becomes significant t(26) = 3.88, p < 0.001. Given that there is no significant difference in the timing data between the two configurations (t(7) = −0.272, p = 0.793), we can argue that L3-vor outperforms L2-seq. The simultaneous presentation of the objects in L3-vor could explain the better performance. On the other hand, Navigation Experiment 2 and the layout experiment with L3-vor were conducted with a number of enhancements over Navigation Experiment 1 and layout experiment with L2-seq. The enhancements, which included intensity calibration, using the direct-to-reverberation energy ratio instead of the signal-to-noise ratio, finer directionality quantization, and new sound selections, may also account for the superior performance of L3-vor. We now return to the opening statement of this section, that the visual layout experiment was conducted in order to isolate the memorization and reproduction of the perceived layout from the object localization, recognition, and 2-D layout perception. A comparison with the visual layout performance shows no significant difference between the EOR distributions of L3-vor (with mix-ups) and the visual layout (t(28) = 1.391, p = 0.175). Of course, the same holds true when the mix-ups are removed. However, there is a significant difference between L2-seq and the visual layout, even when the mixups are corrected (t(24) = 1.71, p = 0.009). It would thus seem that with the design enhancements and the simultaneous representation of objects, the object localization, object recognition, and 2-D layout perception via L3-vor have approached the optimal performance level provided by the visual layout. 6.4. Special Applications Here we discuss the results and implications of experiments conducted on special applications. 6.4.1. Virtual cane: A1-cane and A2-cane Table 6.8 summarizes the results of the experiments with A1-cane and A2-cane. In the zoomed-out mode of both configurations, the subjects had no problem determining the number of objects in the scene. In A1-cane, there were significant errors in material

99

Table 6.8. Results of experiments with A1-cane and A2-cane Configuration A1-cane Configuration A2-cane Obj. 1 Obj. 2 Obj. 3 Obj. 1 Obj. 2 Obj. 3 Number of Objects 100 100 Material Identification 90 80 70 100 60 60 Shape Identification 20 30 20 100 100 100 Shape (aver.) 23.3 100 Overall Time (s) 745 240 Accuracy (%)

identification, which prompted us to narrow down the number of possible materials in A2cane. However, material identification errors persisted, this time confined to confusions between glass and metal. This is because the metal and glass sounds were not easy to distinguish. In the future, we plan to use more distinguishable sounds, even if they are not as realistic. In the zoomed-in mode of A1-cane, which all of the subjects utilized for object shape identification, the accuracy was remarkably low compared to the configurations we discussed in Section 6.1. As we explained in Section 4.4.1, the zoomed-in mode is the same as S5-hrtf. However, the objects are considerably more complicated, consisting of combinations of the three basic shapes used in the earlier configurations. This is most likely the reason for the lower performance (even though two of the subjects had perfect responses). Shape identification performance is expected to improve with the use of individual HRTFs. Figure 6.20 shows some interesting sketches from the experiments with A1-cane. The first row shows successful sketches and the second and third rows show failed attempts to identify the shapes in the first row. Note, however, that the sketches in the second and third rows can be considered to consist of straight line and pixel approximations of curved or diagonal lines in the original shapes, as well as one curved line approximation of a piecewise linear segment (roof). Note that the third row was drawn by Subject S2A, who also drew four pixel approximations of the circle in Figure 6.7 and two pixel approximations for the triangles in Figure 6.3. The average time for the experiments with A1-cane was 12.5 minutes. The recorded time included the exploration (in both modes) and the time taken to draw and label the scene on paper. In contrast, A2-cane resulted in perfect shape identification and much faster time (4 minutes). Comparative statistical analysis showed that A2-cane has significantly better performance than A1-cane in both accuracy (χ2 (1, 72) = 25.48, p < 0.001) and timing (two-sample t-test: t(21) = 4.91, p < 0.001). This is because the use of raised-dot patterns in a flat background is much better suited for shape identification than sound rendition, thus, eliminating the need for the time consuming zoomed-in mode. However, the performance is expected to decrease when two or more distinct raised-dot patterns are used to differentiate adjacent objects. More importantly, the tactile feedback is static, impairing the interactivity of the device. The development of dynamic tactile devices is expected to remedy this problem in the not too distant future. An alternative approach

100

Figure 6.20. Selected subject drawings in experiments with A1-cane: First row marked correct, second and third marked wrong (A1-1: First object in A1-cane) is to provide feedback through the use of vibrations or variable friction. However, neither of these is expected to have the shape rendition accuracy of the raised-dot patterns for the reason discussed in Section 6.1.3. 6.4.2. Experiment A3-venn Table 6.9 summarizes the results of the experiment with A3-venn. All subjects worked on their answers while exploring the Venn diagram and drew the diagram they perceived at the end of the exploration. Note that circles B (blue) and C (red) had 59% overlap (calculated as the percentage of the smaller circle area). Half of the subjects judged it to be in the 40–60% range and the other half picked the 60–90% range. This was not surprising, as the actual percentage was right at the edge of the two ranges. The overlap of circles A and B was 16%, and was correctly judged to be in the 10–40% range by all subjects. All relative locations were judged perfectly, except for one subject who thought that circle A (green) was Northwest of B instead of directly West. Finally, in the relative size questions, all the subject provided the correct answers, except for one

101

Table 6.9. Results of the experiment A3-venn (In relative locations A to B: A relative to B) Overlaps Relative Locations Relative Sizes A - B B - C C - A A to B B to C C to A A B C Percentage Accuracy 100 50 100 83.3 100 100 83.3 100 100 subject who incorrectly judged the size of A to be large (that is, same as B). Overall, the Venn diagram representation was quite successful, and demonstrated the ability of the proposed configuration to convey interesting relationships in a constrained space. 6.5. Conclusions In this chapter we presented and analyzed the results of subjective experiments with multiple configurations that were designed to test different aspects of the proposed acoustictactile display, as well as simple applications. Throughout the chapter we demonstrated that spatial sound (directional and distance cues) for guiding the scanning finger or pointer provides significant performance advantages, especially in terms of time needed to complete the task. Such advantages were shown to increase with the amount of detail (smaller object size) in the display. The virtual cane application demonstrated that an effective interface does not have to be realistic (striking sounds in response to rubbing the finger on the screen), while our comparison of the fixed and trajectory head orientations for sound directionality rendering demonstrated that simple and intuitive approaches can yield better results than trying to imitate navigation in the real world. We also showed that raised dot patterns provide the best shape rendition in terms of both the accuracy and speed, and argued that they offer substantial advantages over variable friction displays. Unfortunately, however, current technology does not allow the combination of raised-dot patterns and acoustic feedback in a dynamic, affordable display. Our experiments also demonstrated significant inter-subject performance variations. We showed that there is a clear performance gap between experienced and inexperienced groups of subjects, which indicates that there is a lot of room for improvement with systematic training and extensive experience. We also saw varying levels of commitment across subjects. Some of the subjects took their time during the training phase to figure out the most effective strategies for accomplishing the goals of the experiment, and sometimes came up with unexpected, creative approaches. All this is very important because the most important application of the proposed acoustic-tactile representation is for assisting visually-impaired subjects, who are highly motivated and are willing to undergo extensive training in a methodical, controlled fashion.

102

CHAPTER 7

Conclusions and Future Research We have proposed a new approach for dynamic acoustic-tactile representation of visual signals. Using this approach, a visually impaired or otherwise visually blocked user can actively explore a two-dimensional layout consisting of one or more objects on a touch screen, using a finger or stylus as a pointing device and listening to acoustic feedback on stereo headphones. A key distinguishing feature of the proposed approach is the use of spatial sound to facilitate the active exploration of the layout. While the emphasis is on interactive display of visual information via hearing and touch using existing devices, we also experimented with static superimposed raised-dot tactile patterns embossed on paper. Our research has addressed object shape and size perception, object localization and identification, simple 2-D layout perception, as well as relatively simple applications, including the rendering of a simple scene layout that consists of objects in a linear arrangement, each with a distinct tapping sound, which we compare to a “virtual cane,” as well as a Venn diagram application. We have designed and implemented a number of configurations for the embodiment of the proposed approach, and conducted subjective experiments with visually blocked subjects to test their effectiveness and to select the best modes and parameter settings. Our research has demonstrated that spatial sound (directional and distance cues) for guiding the scanning finger or pointer provides significant performance advantages, especially in terms of the time needed to complete the task. In particular, we have shown the importance of instantaneous acoustic cues (directionality cues via HRTF and distance cues via intensity variations) for shape exploration. We considered each of these cues in separate configurations. However, in the future it may be beneficial to combine the two, using directional sound to guide the finger along the border and loudness to keep it on the border. Based on the available data, we concluded that the proposed approach outperforms existing approaches, and argued that using sound for shape rendering has considerable advantages over variable friction displays. We also demonstrated that, even though they are currently available only as static overlays, raised dot patterns provide the best shape rendition in terms of both the accuracy and speed. We also showed that directionality cues (HRTF) and distance cues (direct-to-reverberation energy ratio) provide the best object localization, and that their advantages relative to nondirectional sound exploration increase with the amount of detail (smaller object size) in the display. We also explored two modes of directionality rendering, one assuming fixed

103

virtual head orientation and the other assuming that the virtual head is oriented in the direction of the finger movement, and showed that the former is simpler, more intuitive, and most effective. Our experiments with layout rendering and perception demonstrated that simultaneous representation of objects, using the most effective approaches for directionality and distance rendering, approaches the optimal performance level provided by visual layout perception. Finally, experiments with the virtual cane and Venn diagram configurations demonstrated that the proposed techniques can be used effectively in simple but nontrivial real-world applications. Our subjective experiments revealed a clear performance gap between experienced and inexperienced subjects, which indicates that there is a lot of room for improvement with appropriate and extensive training. In addition, the varying levels of commitment across subjects make it imperative that, in the future, we conduct systematic tests with visually impaired subjects, who are highly motivated and are willing to undergo extensive training. Our subjective experiments, which were very time consuming, were carried out with visually blocked, unpaid volunteers. Thus, our experimental data were limited compared to what typical subjective experiments produce, and as a result, some of our conclusions are not statistically significant. Overall, the focus of this thesis was on the development and implementation of new approaches for acoustic-tactile display of visual information, rather than rigorous statistical tests for – by necessity – a limited number of configurations. However, by exploring a wide variety of design alternatives and focusing on different aspects of the acoustic-tactile interfaces, our results offer many valuable insights for the design of future systematic tests, utilizing the most effective configurations, and – subject to the availability of funding – with paid, committed, visually impaired and visually blocked subjects.

104

References [1] R. E. Ladner, M. Y. Ivory, R. Rao, S. Burgstahler, D. Comden, S. Hahn, M. Renzelmann, S. Krisnandi, M. Ramasamy, B. Slabosky, A. Martin, A. Lacenski, S. Olsen, and D. Groce, “Automating tactile graphics translation,” in Proc. 7th Int. ACM SIGACCESS Conf. on Computers and Accessibility, ser. Assets ’05. New York, NY, USA: ACM, 2005, pp. 150–157. [2] W. H. Dobelle, M. G. Mladejovsky, and J. P. Girvin, “Artificial vision for the blind: Electrical stimulation of visual cortex offers hope for a functional prosthesis,” Science, vol. 183, no. 4123, pp. 440–444, 1974. [3] G. J. Chader, J. Weiland, and M. S. Humayun, “Artificial vision: needs, functioning, and testing of a retinal electronic prosthesis,” in Neurotherapy: Progress in Restorative Neuroscience and Neurology, ser. Progress in Brain Research, J. Verhaagen, E. M. Hol, I. Huitenga, J. Wijnholds, A. B. Bergen, G. J. Boer, and D. F. Swaab, Eds. Elsevier, 2009, vol. 175, pp. 317–332. [4] P. Bach-y-Rita, Brain Mechanisms in Sensory Substitution. Press, 1972.

New York: Academic

[5] P. Bach-y-Rita and K. A. Kaczmarek, “Tongue placed tactile output device,” US Patent # 6430450, Aug. 2002. [6] H. Kajomoto, Y. Kanno, and S. Tachi, “Dorehead electro-tactile display for vision substitution,” in Proc. Int. Conf. Eurohaptics, 2006, pp. 75–79. [7] K. Gourgey, “Personal communication.” [8] J. M. Loomis, “Tactile pattern perception,” Perception, vol. 10, pp. 5–27, Feb. 1981. [9] J. Blauert, Spatial Hearing: The Psychophysics of Human Sound Localization. MIT Press, 1997. [10] T. N. Pappas, “An adaptive clustering algorithm for image segmentation,” IEEE Trans. Signal Processing, vol. SP-40, no. 4, pp. 901–914, Apr. 1992.

105

[11] J. Chen, T. N. Pappas, A. Mojsilovic, and B. E. Rogowitz, “Adaptive perceptual color-texture image segmentation,” IEEE Trans. Image Processing, vol. 14, no. 10, pp. 1524–1536, Oct. 2005. [12] S. Landau and L. Wells, “Merging tactile sensory input and audio data by means of the Talking Tactile Tablet,” in EuroHaptics. IEEE Computer Society, 2003, pp. 414–418. [13] G. Peeters, “A large set of audio features for sound description (similarity and classification) in the CUIDADO project,” Tech. Rep., 2004, cUIDADO I.S.T. Project Report. [14] “ViewPlus Braille embossers.” [Online]. Available: http://www.viewplus.com [15] “Talking Tactile Tablet,” (date last viewed 04/01/2013). [Online]. Available: http://www.touchgraphics.com/ [16] O. Bau, I. Poupyrev, A. Israr, and C. Harrison, “TeslaTouch: Electrovibration for touch surfaces,” in Proc. 23nd Annual ACM Symp. User Interface Software and Technology (UIST’10). New York, NY: ACM, 2010, pp. 283–292. [17] “Senseg Tixel,” (date last viewed 04/01/2013). http://senseg.com/technology/senseg-technology

[Online].

Available:

[18] C. Xu, A. Israr, I. Poupyrev, O. Bau, and C. Harrison, “Tactile display for the visually impaired using TeslaTouch,” in CHI ’11 Extended Abstracts on Human Factors in Computing Systems. ACM, May 2011, pp. 317–322. [19] A. Israr, O. Bau, S.-C. Kim, and I. Poupyrev, “Tactile feedback on flat surfaces for the visually,” in CHI ’12 Extended Abstracts on Human Factors in Computing Systems. ACM, May 2012, pp. 1571–1576. [20] L. E. Winfield, J. Glassmire, J. E. Colgate, and M. A. Peshkin, “T-PaD: Tactile pattern display through variable friction reduction,” in Proc. Second Joint Eurohaptics Conf. and Symp. on Haptic Interfaces for Virtual Environment and Teleoperator Systems., Tsukuba, Japan, Mar. 2007. [21] E. Samur, J. E. Colgate, and M. A. Pehskin, “Psychophysical evaluation of a variable friction tactile interface,” in Human Vision and Electronic Imaging XIV, ser. Proc. SPIE, B. E. Rogowitz and T. N. Pappas, Eds., vol. 7240, San Jose, CA, Jan. 2009, pp. 72 400J–1–7.

106

[22] E. C. Chubb, J. E. Colgate, and M. A. Peshkin, “Shiverpad: A device capable of controlling the shear force on a bare finger,” in World Haptics Conference, Salt Lake City, Utah, Mar. 2009, pp. 18–23. [23] N. D. Marchuk, J. E. Colgate, and M. Peshkin, “Friction measurements on a large area TPaD,” in IEEE Haptics Symposium, Waltham, MA, Mar. 2010, pp. 317–320. [24] “Immersion,” (date last http://www.immersion.com

viewed

04/01/2013).

[Online].

Available:

[25] P. Zahorik, “Auditory display of sound source distance,” in Proc. 2002 Int. Conf. Auditory Displays, 2002, pp. 326–332. [26] ——, “Assessing auditory distance perception using virtual acoustics,” J. Acoust. Soc. Am., vol. 111, no. 4, pp. 1832–1846, 2002. [27] J. A. Moorer, “About this reverberation business,” Computer Music Journal, vol. 3, no. 2, pp. 13–28, Jun. 1979. [28] Wikipedia. (2010, Nov.) http://en.wikipedia.org/wiki/Braille

Braille.

[Online].

Available:

[29] J. Linvill and J. Bliss, “A direct translation reading aid for the blind,” Proceedings of the IEEE, vol. 54, no. 1, pp. 40–51, Jan. 1966. [30] O. Tretiakoff and A. Tretiakoff, “Portable print reading device for the blind,” Sep. 27 2005, uS Patent 6,948,937. [Online]. Available: http://www.google.com/patents/US6948937 [31] R. Weisgerber, B. Everett, and C. Smith, “Evaluation of an ink print reading aid for the blind: The stereotoner,” ser. Contract No. V101(134)P-163, Dec. 1975. [32] N. Pressey, “Mowat sensor,” Focus, vol. 11, no. 3, pp. 35–39, 1977. [33] J. A. Brabyn, “New developments in mobility and orientation aids for the blind,” IEEE Trans. Biomed. Eng., vol. 29, no. 4, pp. 285–289, Apr. 1982. [34] A. G. Dodds, “The Nottingham obstacle detector: Development and evaluation,” Journal of Visual Impairment and Blindness, vol. 75, no. 5, pp. 203–209, May 1981. [35] A. D. Heyes, “The sonic pathfinder: A new electronic travel aid,” Journal of Visual Impairment and Blindness, vol. 78, no. 5, pp. 200–202, May 1984.

107

[36] P. B. L. Meijer, “Sensory substitution - Vision substitution,” (date last viewed 04/01/2013). [Online]. Available: http://www.seeingwithsound.com/sensub.htm [37] “K-SONAR, Bay Advanced Technologies,” 2006, (date last viewed 04/01/2013). [Online]. Available: http://www.batforblind.co.nz/ [38] S. Meers and K. Ward, “A vision system for providing 3d perception of the environment via transcutaneous electro-neural stimulation,” in Eigth Int. Conf. Information Visualisation, Jul. 2004, pp. 546–552. [39] B. N. Walker and J. Lindsay, “Navigation performance with a virtual auditory display: Effects of beacon sound, capture radius, and practice,” Human Factors, vol. 48, pp. 286–299, 2006. [40] K. S. Papadopoulos, “On the theoretical basis of tactile cartography for the haptic transformation of historic maps,” 2005. [41] S. Jehoel, D. W. McCallum, J. Rowell, and S. Ungar, “An empirical approach on the design of tactile maps and diagrams: The cognitive tactualization approach,” British Journal of Visual Impairment, vol. 24, pp. 67–75, May 2006. [42] M. Kurze, “TDraw: a computer-based tactile drawing tool for blind people,” in Proceedings of the second annual ACM conference on Assistive technologies, ser. Assets ’96. New York, NY, USA: ACM, 1996, pp. 131–138. [43] J. P. Fritz, T. P. Way, and K. E. Barner, “Haptic representation of scientific data for visually impaired or blind persons,” in Proc. CSUN Conference Technology and Disability, 1996. [44] G. Jansson and P. Pedersen, “Obtaining geographical information from a virtual map with a haptic mouse,” in International Cartographic Conference (ICC), Coruna, Spain, Jul. 2005, pp. 11–16. [45] K. van den Doel, “SoundView: sensing color images by kinesthetic audio,” in Int. Conf. Auditory Display. IEEE, Jul. 2003, pp. 303–306. [46] P. B. L. Meijer, “An experimental system for auditory image representations,” IEEE Trans. Biomed. Eng., vol. 39, no. 2, pp. 112–121, Feb. 1992. ´ Tor[47] A. F. R. Hernández, C. M. Gracia, O. C. González, C. M. Pascual, M. Angel res Gil, R. Montserrat, E. B. Gutiérrez, and J. L. González-Mora, “Computer solutions on sensory substitution for sensory disabled people,” in Proc. 8th WSEAS

108

Int. Conf. Comp. Intelligence, Man-machine Systems and Cybernetics (CIMMACS), 2009, pp. 134–138. [48] S. E. Hernandez and K. E. Barner, “Tactile imaging using watershed-based image segmentation,” in ASSETS’00, Arlington, VA, Nov. 2000, pp. 26–33. [49] A. Nayak and K. E. Barner, “Optimal halftoning for tactile imaging,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 12, no. 2, pp. 216–227, Jun. 2004. [50] T. C. Aysal and K. E. Barner, “Stochastic and deterministic models for haptic pseudo-textures,” in Haptic Interfaces for Virtual Environment and Teleoperator Systems (HAPTICS’06), Alexandria, VA, Mar. 2006, pp. 469–476. [51] R. D. Jacobson, “Navigating maps with little or no sight: An audio-tactile approach,” in Workshop on Content Visualization and Intermedia Representations (CVIR), Montreal, Quebec, Canada, 1998, pp. 95–102. [52] P. Parente and G. Bishop, “BATS: the blind audio tactile mapping system,” in ACM South Eastern Conference (ACMSE), Mar. 2003, pp. 11–16. [53] D. Parkes, “NOMAD - an audio tactile tool for the acquisition, use and management of spatially distributed information by partially sighted and blind people,” in Proceedings Second Int. Conf. Maps Graphics for Visually Disabled People, 1988, pp. 24–29. [54] P. Blenkhorn and D. G. Evans, “A system for reading and producing talking tactile maps,” in Proceedings 9th Int. Conf. Technology and Persons with Disabilities, 1994, pp. 24–29. [55] A. D. N. Edwards, “Soundtrack: an auditory interface for blind users,” Hum.Comput. Interact., vol. 4, no. 1, pp. 45–66, Mar. 1989. [56] A. I. Kershmer and R. L. Oliver, “Special computer interfaces for the visually handicapped: FOB the manufacturer,” in Proc. Third Int. Conf. EWHCI, 1993, pp. 272– 280. [57] A. Savidis and C. Stephanidis, “Building non-visual interaction through the development of the rooms metaphor,” in Conference Companion on Human Factors in Computing Systems, ser. CHI ’95. New York, NY, USA: ACM, 1995, pp. 244–245. [58] E. D. Mynett, “Transforming graphical interfaces into auditory interfaces for blind users,” Human-Computer Interaction, vol. 12, pp. 7–45, 1997.

109

[59] F. Ribeiro, D. Florencio, P. Chou, and Z. Zhang, “Auditory augmented reality: Object sonification for the visually impaired,” in IEEE 14th Int. Workshop Multimedia Signal Proc. (MMSP), Sep. 2012, pp. 319–324. [60] V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano, “The CIPIC HRTF database,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2001, pp. 99–102. [61] J. Sánchez, M. Lumbreras, and L. Cernuzzi, “Interactive virtual acoustic environments for blind children: computing, usability, and cognition,” in CHI ’01 Extended Abstracts on Human factors in Computing Systems, ser. CHI ’01. New York, NY, USA: ACM, 2001, pp. 65–66. [62] D. Sagi and B. Julesz, ““Where” and “What” in Vision,” Science, vol. 228, pp. 1217–1219, Jun. 1985. [63] A. Das, A. Roy, and K. Ghosh, “Proposing a cnn based architecture of mid-level vision for feeding the where and what pathways in the brain,” in Swarm, Evolutionary, and Memetic Computing, ser. Lecture Notes in Computer Science, B. Panigrahi, P. Suganthan, S. Das, and S. Satapathy, Eds. Springer Berlin / Heidelberg, 2011, vol. 7076, pp. 559–568. [64] Mortimer and Mishkin, “Analogous neural models for tactual and visual learning,” Neuropsychologia, vol. 17, no. 2, pp. 139–151, 1979. [65] J. H. Kaas and T. A. Hackett, “What and where processing in auditory cortex,” Nature Neuroscience, vol. 2, no. 12, pp. 1045–1047, Dec. 1999. [66] J. M. Henderson, “Human gaze control during real-world scene perception,” Trends in Cognitive Sciences, vol. 7, no. 11, pp. 498–504, 2003. [67] M. W. A. Wijntjes, T. van Lienen, I. M. Verstijnen, and A. M. L. Kappers, “Look what I have felt: Unidentified haptic line drawings are identified after sketching,” Acta Psychologica, vol. 128, no. 2, pp. 255–263, Jun. 2008. [68] J. M. Loomis, R. L. Klatzky, and S. J. Lederman, “Similarity of tactual and visual picture recognition with limited field of view,” Perception, vol. 20, no. 2, pp. 167–177, 1991. [69] S. J. Lederman, R. L. Klatzky, C. Chataway, and C. D. Summers, “Visual mediation and the haptic recognition of two-dimensional pictures of common objects,” Perception and Psychophysics, vol. 47, no. 1, pp. 54–64, Jan. 1990.

110

[70] T. V. Tran, T. Letowski, and K. S. Abouchacra, “Evaluation of acoustic beacon characteristics for navigation tasks,” Ergonomics, vol. 43, pp. 807–827, 2000. [71] B. G. Shinn-Cunningham, “Distance cues for virtual auditory space,” in In Proc. 1st IEEE Pacific-Rim Conf. on Multimedia, Sydney, Australia, 2000, pp. 227–230. [72] E. Zwicker, “Audibility limits for amplitude and frequency modulation of a tone,” Acustica, vol. 2, pp. AB125–AB133, 1952, in German, available in English as NASA Report No. NASA TT F-15, 597. [73] R. J. Reedstrom, “Tremolo effect,” (date last viewed 04/01/2013). [Online]. Available: http://cnx.org/content/m15497/latest/lfmod tremolo-eqn.html [74] R. O. Duda and W. L. Martens, “Range dependence of the response of a spherical head model,” The Journal of the Acoustical Society of America, vol. 104, no. 5, pp. 3048–3058, 1998. [75] N. I. Durlach, A. Rigopulos, X. Pang, W. Woods, A. Kulkarni, H. Colburn, and E. Wenzel, “On the externalization of auditory images,” Presence: Teleoperators and Virtual Environments, vol. 1, no. 2, pp. 251–257, 1992. [76] J. D. Atkins, “Binaural reproduction of spherical microphone array signals.” J. Acoust. Soc. Am., vol. 126, no. 4, pp. 2156–2156, 2009. [Online]. Available: http://link.aip.org/link/?JAS/126/2156/1 [77] W. Hartmann, “How we localize sound,” Physics today, vol. 52, p. 24, 1999. [78] J. Blauert, “Spatial hearing: the psychophysics of human sound localization,” Book, p. 494, Jan 1997. [Online]. Available: http://books.google.com/books?id=wBiEKPhw7r0C&printsec=frontcover [79] W. Zhang, T. D. Abhayapala, R. Kennedy, and R. Duraiswami, “Modal expansion of hrtfs: Continuous representation in frequency-range-angle,” ICASSP 2009, pp. 285—288, Oct 2009. [80] R. Duraiswami, D. Zotkin, and N. Gumerov, “Interpolation and range extrapolation of hrtfs [head related transfer functions],” ICASSP 2004, vol. 4, pp. iv–45– iv–48 vol.4, 2004. [81] M. Evans and J. Angus, “Analyzing head-related transfer function measurements using surface spherical harmonics,” J. Acoust. Soc. Am., Jan 1998.

111

[82] F. L. Wightman and D. J. Kistler, “The dominant role of low-frequency interaural time differences in sound localization,” J. Acoust. Soc. Am., vol. 91, pp. 1648–1661, 1992. [83] W. M. Hartmann and A. T. Wittenberg, “On the externalization of sound images,” J. Acoust. Soc. Am., vol. 99, no. 6, pp. 3678–3688, 1996. [84] B. R.Glasberg and B. C. J. Moore, “Derivation of auditory filter shapes from notched noise data,” Hear. Res., vol. 47, pp. 103–138, 1990. [85] R. Y. Litovsky, H. S. Colburn, W. A. Yost., and S. J. Guzman, “The precedence effect,” J. Acoust. Soc. Am., vol. 106, pp. 1633–1654, 1999, review and Tutorial. [86] J. Blauert, “Sound localization in the median plane,” Acustica, vol. 22, pp. 205–213, 1969-70. [87] L. van Noorden, “Temporal coherence in the perception of tone sequences,” Ph.D. dissertation, Institute for Perception Research, Eindhoven, Holland, 1975. [88] A. Bregman, Auditory scene analysis: the perceptual organization of sound. bridge, MA: MIT Press, 1990.

Cam-

[89] M. Schroeder, “Improved quasi-stereophony and “colorless” artificial reverberation,” Journal of the Acoustical Society of America, vol. 33, no. 8, pp. 1061–1064, Aug. 1961. [90] ——, “Natural-sounding artificial reverberation,” Journal of Audio Eng. Soc, vol. 10, pp. 219–223, 1962. [91] B. Friedlander and B. Porat, “The modified yule-walker method of arma spectral estimation,” IEEE Transactions on Aerospace Electronic Systems, vol. 20, no. 2, pp. 158–173, 1984. [92] L. J. Thompson, E. P. Chronicle, and A. Collins, “The role of pictorial convention in haptic picture perception,” Perception, vol. 32, pp. 887–893, 2003. [93] M. W. A. Wijntjes, T. van Lienen, I. M. Verstijnen, and A. M. L. Kappers, “The influence of picture size on recognition and exploratory behaviour in raised-line drawings,” Perception, vol. 37, pp. 602–614, 2008. [94] S. McAdams, V. Roussairie, A. Chaigne, and B. Giordano, “The psychomechanics of simulated sound sources: Material properties of impacted thin plates,” J. Acoust. Soc. Am., vol. 128, pp. 1401–1413, 2010.

112

[95] K. Appel and W. Haken, “Every planar map is four colorable: Parts i and ii,” Illinois Journal of Math., vol. 21, no. 3, pp. 429–567, Sep. 1977. [96] V. C. Tartter, “Personal communication.” [97] H. J. Seltman, Experimental Design and Analysis, Jun. 2014. [98] K. V. D. Doel, D. Smilek, A. Bodnar, C. Chita, R. Corbett, D. Nekrasovski, and J. McGrenere, “Geometric shape detection with soundview,” in Int. Conf. on Auditory Display, 2004, pp. 1–8. [99] E. Zwicker, “Procedure for calculating loudness of temporally variable sounds,” J. Acoustical Society of America, vol. 62, no. 3, pp. 675–682, 1977.