Chapter 2: Spatial Hearing

Hearing is a human and therefore highly subjective process. Testing any facet of hearing requires special considerations and analysis of test results. Before investigating spatial hearing, one must examine hearing in the more general context of physics, sensation, and perception. Auditory events and their measurements in the context of a given coordinate system are also described.

Physics

Physics is the study of matter, energy, and their interaction. Physics is concerned with quantitatively predicting the "evolution of a given physical system (or "postdict" its past history), based on the conditions in which the system is found at a given time [12]." Measurements of the system are used to induce the physical laws that govern it, and mathematical equations based on these laws establish relationships between measured physical magnitudes at a given time. Simplified models of the system often are made in lieu of detailed equations when the relationship between the system’s values is too complex. The mathematical equations or model(s) thus devised are used to predict the future state of the system. In classical physics, exact and unique predictions can be made within the accuracy of the measurement instruments. In quantum physics, only probabilities for the physical magnitudes can be predicted because measurements on the atomic scale are never exact or unique.

Sensation

Sensation is both the experience associated with a physical stimulus and the initial steps by which the sense organs and neural pathways take in stimulus information [13]. The process of sensation can be considered a chain of three classes of events: physical stimulus, physiological response, and sensory experience. Gray describes these events:

(1) The physical stimulus is the matter or energy that impinges on sense organs;

(2) the physiological response is the pattern of electrical activity that occurs in sense organs, nerves, and the brain as a result of the stimulus; and

(3) the sensory experience is the subjective, psychological sensation -- the sound, sight, taste, or whatever, that is experienced by the individual whose sense organs have been stimulated [13].

The first two events are measurable directly by physical means. The third cannot be measured directly but can be measured indirectly through observing behavior. The manner of this measurement is explained below.

Different areas of academic inquiry have grown that focus on the relationships between these events [13]. These relationships are shown in Figure 2.1. Sensory physiology is the study of the relationship between the stimulus and the physiological response. Sensory physiological psychology is the study of the relationship between the physiological response and the sensory experience. (The more general category, physiological psychology, is the study of the physiological mechanisms that mediate behavior and psychological experiences.) Psychophysics is the study of the relationship between the physical stimulus and the sensory experience, disregarding the physiological response that actually mediates that relationship.

Fig. 2.1. The three events in sensation and their relationships [13].

Roederer [12] has characterized psychophysics by comparing it with the more traditional area of physics:

Like physics, psychophysics requires that the causal relationship between physical stimulus input and psychological (or behavioral) output be established through experimentation and measurement. Like physics, psychophysics must make simplifying assumptions and devise models in order to be able to establish quantitative relationships and venture into the business of prediction-making … Unlike classical physics, but strikingly like quantum physics, psychophysical predictions can never be expected to be exact or unique -- only probability values can be established. Unlike classical physics, but strikingly like quantum physics, most measurements in psychophysics will substantially perturb the system under observation. The result of a measurement does not reflect the state of the "system per se," but, rather, the more complex state of the "system under observation [12]."

Crucial differences do exist between psychophysics and both branches of physics. Repeated measurements may condition the response of the observed psychophysical system, and the motivation of the subject and its consequences may interfere with measurements in an unpredictable way [12].

Perception

Perception is the organization and recognition of one or more sensory experiences. To understand these sensations, "perceptual processes must integrate and organize the sensory input, extracting the useful information that resides not in the individual stimulus elements but in the arrangement of those elements in space and time [13]." These processes occur entirely in the brain and, as such, are far less amenable to physiological study than the sense organs. While the study of sensation is primarily a physiological approach, the study of perception is primarily a cognitive one that looks at our understanding of physical reality.

Patterns are perceived according to various methods of grouping stimuli, and objects are recognized based on these patterns. The selectivity of perception is useful in perceiving patterns and recognizing objects. Attention is the process by which the mind chooses which of among all stimuli should enter into higher stages of information processing [13]. Humans can focus their attention on one stimulus and ignore others, or they can monitor those stimuli that they are not attending and use them as a basis for shifting attention. Object recognition and attention are closely related to the understanding of stimuli in relation to us in three-dimensional space.

Psychophysical Magnitude: Psychophysical or Perceptual Measurement?

Psychophysical magnitudes are the learned classification and ordering of sensations, typically corresponding to physical stimuli, based on learned frames of reference. While the concept of psychophysical magnitude typically is discussed along with psychophysics, it is better understood in the larger cognitive context of perception, memory, and intellect. The "ordering of sensations" clearly falls under perception, while "learned classification" encompasses memory and the human intellect. Unlike physical magnitudes of stimuli, these psychophysical magnitudes typically corresponding to the stimuli cannot be measured directly. Humans learn to make judgments such as "twice as long" and "half as loud" relative to perceptions (or memories of perceptions) of sensations, and it is the indirect measurement of these judgments that constitutes a psychophysical magnitude measurement.

The Auditory Event

Acoustics is the branch of physics that is concerned with sound, the wave motion in air or other elastic media that is in the frequency range of human hearing [14] [15]. Psychoacoustics is the branch of psychophysics that studies the relationship between acoustic stimuli (sounds) and auditory sensations. Much scientific inquiry under the title of psychoacoustics also includes auditory perception because psychophysical magnitudes cannot be measured without the intervention of perceptual processes, as described above.

Jens Blauert, author of a comprehensive reference on spatial hearing [14], categorizes some of these ideas. He names the acoustic stimulus the "sound event" and that which is perceived auditorily as the "auditory event." Blauert notes that the relationship between sound events and auditory events cannot be assumed to be a causal one. "A careful description would go no farther than to say that particular precisely definable sound events and particular precisely definable auditory events occur with one another or one after the other under certain precisely defined conditions [14]." The best example of an auditory event not corresponding to a sound event is the ventriloquism effect, wherein localization is affected by visual cues. Auditory events associated with sounds produced by loudspeakers are variously called phantom images or virtual sound sources.

Blauert’s simple model of an auditory event includes a perceiving system and a describing system. Only the person hearing the sound event can observe the output of the perceiving system. In his more complex models, Blauert characterizes the describing system as a psychophysical measuring instrument and adds response modifying factors based on higher-order, cognitive brain functions. Figure 2.2 shows the combination of Blauert’s models with Gray’s concepts of sensation and perception.

Fig. 2.2. Model for an auditory experiment (after [14], [13]).

Coordinate System

The coordinate system that is commonly used to describe the spatial relationships between the sound source and the subject’s head is shown in Figure. 2.3. The horizontal plane is the region that is level with the listener’s ears. The median plane is the region in which sound sources (events) are equidistant from both ears. The frontal or lateral plane divides the listener’s head vertically between the front and the back. When the sound source is not equidistant from the ears, by definition one ear must be closer to the sound source than the other. The ipsilateral ear is nearest the sound source, and the contralateral ear is farthest from the sound source. Sounds arriving at the ipsilateral ear arrive earlier and are generally more intense because of the inverse square law.

Fig. 2.3. Coordinate system for spatial hearing experiments [16].

A vector describing the position of the sound source relative to the center of the listener’s head is expressed in polar fashion as an azimuth, elevation, and distance. Figure 2.4 shows this localization vector. Azimuth is measured as an angle between a projection of the vector onto the horizontal plane and a second vector extending in front of the listener [16]. This second vector along the median plane usually is given as 0º, and azimuths are measured relative to this vector going counter-clockwise from 0º to 359º around the listener’s head. (In this project, azimuth is the most important measurement because we are only concerned with localization of virtual sound sources from horizontally arrayed loudspeakers.) Elevation is measured as the angle between a third vector pointing forward in the horizontal plane (0º) and the height of the sound source. Here, 90º and -90º elevations are directly above and below the listener, respectively.

Fig. 2.4. Position of the sound event’s position relative to the center of the head, with reference azimuths labeled [16].

Human Localization Performance

How well does a human being actually hear spatially? Two definitions must be given before answering this question. "Localization" is the law or rule by which the location of an auditory event is related to specific attributes of a sound event or other stimulus correlated with the auditory event [14]. The typical question regarding localization is, "where does the auditory event appear, given a specific position of the sound source?"

"Localization blur," which depends on the type of sound source and its direction, can be thought of as the resolution of spatial hearing. It is the smallest change in specific attributes of a sound event or other event correlated to an auditory event that is sufficient to produce a change in the location of the auditory event [14]. The typical question regarding localization blur is "what is the smallest possible change of position of the sound source that produces a just-noticeable change of position of the auditory event?" It is measured as the amount of displacement of the sound source that is recognized by 50 percent of experimental subjects as a change in position of the auditory event.

Localization blur in the horizontal plane for a sound source at 0º azimuth and elevation has been measured between about 0.9º and 4º depending on the type of signal [14]. For 100 ms white noise pulses at 70 phons, localization blur increases to about ± 10º at the sides (90º or 270º) and decreases again to about ± 6º in the rear (180º). (Phons are defined in Appendix A.) Blauert also found that localization for these sounds arriving from 0º, 90º, 180º, and 270º azimuth was 359º, 80.7º, 179.3º, and 281.6º, respectively.

Localization blur varies with elevation quite differently. For signals of familiar speech, localization blur varies along the median plane from ± 10º at 0º elevation to as much as ± 27º at 180º. Localizations for this same experiment showed greater errors as the sounds moved over the head from front to back, with as much as 32º in error for a source at 144º elevation (above and behind the listener). This is a typical problem in localization and shows up similarly in experiments with different source signals.

Spatial Hearing Theory

Evolutionary Views on Hearing

Spatial hearing should be examined in the context of biological evolution. "The most important higher cortical functions of an animal brain are environmental representation and prediction, and the planning of behavioral response with the goal of maximizing the chances of survival and perpetuation of the species [12]." Our sensory systems, necessary for environmental representation, evolved based on their usefulness in picking out the information that is most useful to our survival from the sea of energy around us [13].

McEachern [17] argues that signal detection, identification, and location (localization) were the most critical signal processing tasks for the eyes and ears of early humans. (I recall an unknown reference that observed that an object must emit or disturb energy to be detected by a biological or man-made sensor.) To hunt and avoid being eaten, our ancestors had to detect nearby movement, identify it as prey, predator, or human, and determine its location so they could run towards or away from it as necessary. While his ideas are reasonable, they are more useful in engineering applications than in describing actual perceptual processes because identification and localization are not independent.

All of these views point to the evolutionary advantage of environmental representation through perception of spatial object-person relationships. Spatial hearing exists because it is advantageous to humankind’s survival.

Early Theories

The study of spatial hearing began in the area of psychoacoustics and has since been embraced by other disciplines, including acoustics. One of the oldest psychoacoustic theories of spatial hearing is Rayleigh’s duplex theory of sound localization [16]. Based on experiments with sine wave stimuli, Rayleigh found that differences in the signals reaching the listener’s ears strongly affected spatial perception. Interaural intensity differences (IIDs) and interaural time differences (ITDs) were found to have a significant impact in specific frequency ranges. IIDs also are called interaural level differences (ILDs) in the literature. (See Appendix A for a description of how intensity level influences subjective loudness.)

IIDs were found to dominate localization for frequencies above about 1.5 kHz. Because the head is much larger than wavelengths of sound in this range, most of the energy of the sound wave is reflected away. (Taking the speed of sound to be 343 m/s at STP, and the distance between the ears (an approximate diameter of the head) to be 17.5 cm, we find the wavelength of equivalent size to the head to correspond to 1.96 kHz.) Thus the contralateral ear is said to be acoustically shadowed by the head. ITDs are not as important in this frequency range because it is more difficult to judge time delay based on phase differences of higher frequency stimuli. ITDs dominate localization for frequencies below 1.5 kHz, where sound waves more easily diffract around the head and there is less of an intensity difference between the ears.

Later investigations of spatial hearing discovered that IIDs and ITDs do not explain localization sufficiently. In fact, they only affect the lateralization of the sound source, where lateralization is the perception of position along the interaural axis on the frontal plane. (The interaural axis can be conceptualized as a ray passing through both ears.) When signals presented to the ears have only IIDs or ITDs, listeners can describe the extent to which the signals are to their left or right, but not whether they are in front of, behind, above, or below them. Woodworth [16] called this ambiguity of location at a given degree of lateralization the "cone of confusion," as shown in Figure 2.5. It was given its name because all points that occur at the same distance from the left and right ears form a cone opening outward from each ear.

It can be shown that the cone of confusion forms a hyperbola on the horizontal plane. For experiments concerned only with horizontal azimuth, this "hyperbola of confusion" results in confusions between auditory event locations in front of or behind the frontal plane. These confusions are called front-back reversals (or front-back confusion).

Fig. 2.5. The cone of confusion results when only IIDs or ITDs are present.
Here, lateralization to the person’s left would occur to match the projection
of the sound source vector onto the interaural axis [16].

HRTF Theories

Rayleigh’s duplex theory was insufficient to describe spatial listening because it did not describe how sound waves reaching the ears are affected acoustically by the listener’s body and the room, how these effects change under head motion, and other temporal cues to localization. Acoustic examinations of the signals reaching the ears show that IIDs and ITDs are really functions of frequency.

Head-related transfer functions describe the acoustic interactions that a sound wave has with the listener’s torso, head, pinnae (outer ears), and ear canals. The complexity of these interactions makes the HRTF at each ear strongly dependent on the direction of the sound [16].

HRTFs are measured by recording test signals using one of three techniques [16]. Using live subjects, miniature microphone capsules may be placed at the entrance of the ear canal or a probe tube may be placed within the ear canal. Alternatively, a probe tube may be placed at the ear drum position of an artificial "dummy head," a mannequin torso and head with anatomically correct pinnae and ear canals (portions of the outer ear). A fixed ratio exists between the magnitude spectra of the two methods employing probe microphones for frequencies below about 7 kHz [16]. All measurements are made under anechoic conditions with the head in a perfectly fixed position. Blauert notes that a 25 m s rectangular pulse is a sufficient signal for such measurements because its Fourier transform shows a drop of only 2.4 dB at 16 kHz [14].

Like all transfer functions, HRTFs may be examined in the time domain or the frequency domain. (This author has not found any examinations of HRTFs using joint time-frequency analysis methods.) In the time domain, the original impulses are spread over 1 to 3 ms by acoustic interactions with the listener’s body [16]. Differences between the HRTF time domain representations are expressed as IIDs and ITDs. When a sound source is directly to the listener’s side, the ITD reaches its maximum value between 0.6 to 0.8 ms depending on the signal. Alternatively, the IID must be between 15 and 20 dB for localization to be completely to one side of the listener [14].

More subtleties are seen in HRTFs when they are examined in the frequency domain. Comparing HRTFs from two ears, we see that magnitude spectra are more similar for frequencies below about 1,500 Hz. Thus magnitude spectra differences are more evident for higher frequencies. Phase spectra may be interpreted as either phase delay or group delay. Phase delay differences are greatest for sound waves at low frequencies because their diffraction around the head slows them relative to those at high frequencies [16]. The transition between these low and high frequency regions exists from 500 to 2,500 Hz and is centered around 1,500 Hz [16].

HRTFs change very little if the sound event is more than 2 to 3 m from the head, where the sound wave is approximately planar. Planar approximations to spherical sound waves are given in the literature at distances of 2 m [16] and 3 m [14]. Comparing HRTFs from different people, we see that the spectral features do not entirely match although overall trends are quite similar. Differences are expected, however, since people’s heads, ears, and torsos are all different sizes and shapes.

Obviously the frequency domain perspective on HRTFs corresponds well with classical duplex theories. However, IIDs and ITDs vary in complex ways across frequency because of constructive and destructive interference of the direct wave with sound reflected off the body. Above 4 kHz sound is reflected mainly by the pinnae, and below 2 kHz sound is reflected mainly from the torso. In between there is a region of overlapping influence. The pinna is especially important to spatial hearing and may be considered as an acoustic, linear filter. "By distorting incident sound signals linearly, and differently depending on their direction and distance, the pinna codes spatial attributes of the sound field into temporal and spectral attributes [14]." The frequency dependence of IIDs and ITDs is important to the resolution of front-back reversals and was not captured by Rayleigh’s duplex theory.

Motional Theories

Because HRTFs are such strong functions of the direction of the sound source relative to position of the head, head movement plays an important role in spatial hearing. When a listener moves his or her head, the acoustic interference of the sound wave with the head changes and the HRTFs change accordingly. Whereas the duplex theory alone led to front-back reversals, the changing HRTFs resulting from head movement considerably reduce their occurrence [16]. Assuming the duration of a sound event is long enough, exploratory head movements almost always increase localization precision [14].

Lambert [18] hypothesized that both the azimuth and range of a sound source could be determined wholly by rotating one’s head about its vertical axis, without any prior knowledge of the sound source. Rodgers [19] studied head motion effects on the perception of sounds reproduced by a stereo loudspeaker system. She showed that phantom image instability occurs under head movement "due to the drastically changing pinna transformations as a function of source azimuth and height."

Interaural Envelope Time Shift Theories

The duplex theory also ignores the effects of amplitude modulation of signals on spatial perception. While ITDs are only important below about 1.5 kHz, time differences in signal envelopes with carriers above 1.6 kHz do affect localization [14]. Blauert notes that the entire envelope of the signal is not evaluated:

First, the spectrum of the signal is dissected to a degree determined by the finite spectral resolution of the inner ear; then the envelopes of the separate spectral components are evaluated individually. A unified auditory event appears only if the shifts of the envelopes in the different frequency ranges show sufficient similarity to one another [14].

Lateralization depends in complex ways on the steepness of the slopes of the envelope. Lateralization blur depends on the onset time interval between amplitude-modulated signals presented interaurally. In this case, lateralization blur is the smallest change in interaural phase delay or group delay that leads to a change in lateralization of the auditory event [14].

Figure 2.6 summarizes the frequency ranges in which the primary localization cues are evaluated. Although IIDs are evaluated below 1.5 kHz, localization is increasingly dominated by ITD carrier cues in lower frequencies.

Fig. 2.6. Frequency ranges of primary localization cues [14].

Spatial Cues to Auditory Stream Segregation

Auditory stream segregation is (1) the perceptual process in which relationships are formed between different sensations and (2) the effect of relationships on what is included and excluded from our perceptual descriptions of distinct auditory events [20]. It is the method by which our brains group together or "fuse" multiple sensations of acoustic stimuli because they are understood to have come from a common source. Spatial location of a sound source is one of the many perceptual cues to auditory stream segregation; others include timbre, melodic direction, and even position or movement perceived visually. Localization of auditory events at different positions is really only important in this project as it relates to their segregation into different streams.

Spatial cues affect stream segregation for sensations resulting from non-simultaneous or simultaneous events. Segregation by localization does occur for the case of non-simultaneous events. However, it seems weak unless supported by other bases for segregation. For simultaneous sound events, or ones that run in and out of simultaneity such as two people talking, localization effects on stream segregation are much stronger. The "cocktail party effect" is the most famous example of this phenomenon. "It arises from the fact that a desired signal S with a certain direction of incidence is less effectively masked by an undesired noise N from a different direction when subjects listen binaurally (with two functioning ears) than when they listen monaurally (e.g., with one ear plugged) [14]." The effect got its name from the ability to understand the speech of one person in a crowded, noisy room.

The "cocktail party effect" falls into the more general case of sound events of lower subjective loudness not being masked by those of higher loudness because of different localizations. The binaural masking level difference (BMLD) describes how much the intensity level of one sound event must differ from another to reverse this beneficial effect of spatial cues. The BMLD is directly proportional to the distance between two sound sources. However, the nature of this relationship depends on the sound sources themselves because the BMLD is frequency-dependent. For the case of pure tones being masked by broadband noise, the BMLD has values as high as 18 dB at 200 Hz and lower than 5 dB for frequencies above 2 kHz [14]. The BMLD’s frequency dependence is evidence that different localization estimates are made in different frequency regions [20].

According to the Huygens-Fresnel principle, spatial hearing in a reverberant room may be considered part of the more general case of spatial hearing of multiple sound sources (where each reflection is considered another source) [14]. In the case of an impulsive sound source in a room, the direct sound arrives first at the position of the subject, generating the primary auditory event. This elicits an inhibitory effect on perception of all subsequent reflections of the sound in the room, described equivalently by the law of first wavefront, precedence effect, or Haas effect. This law states that as the delay time between two identical sounds is increased beyond about 1 ms, the auditory event is localized in the direction of the sound that arrived first. This inhibitory effect is an example of fusion into a single auditory stream. Blauert describes this effect in detail:

For a certain length of time the forming of further auditory events is suppressed. After a time interval corresponding to the echo threshold [when two auditory events are localized separately], either a strong reflection leads to the forming of an echo and to further inhibition, or else the intervening reverberation has been strong enough that a precisely located auditory event is no longer formed. Instead the largely incoherent ear input signals resulting from the reverberation generate a diffusely located auditory event whose components more or less fill the subject’s entire auditory space. The primary auditory event merges into the reverberant auditory event in such a way that the primary event appears to disperse spatially [14].

For non-impulsive sounds, localization precision increases as the intensity of the primary sound increases relative to that of the diffuse field [14]. Blauert noted that differences will occur with different signals, rooms, and speakers (if applicable) and that more research is necessary on room effects on localization. The effects of speaker placement in a room on the phantom image localization are considered in the "Loudspeaker placement" section of Chapter 4.

Before leaving the topic of room-related effects, we should state that non-sound-producing objects also can be localized roughly [21]. The presence of room boundaries may be detected based on their effect on sound events within the room, and the boundaries themselves may be localized roughly if they are near the listener and particularly reflective or absorptive. Individuals who have learned this complex skill can often describe the size, shape, and texture of objects from the pattern of reflected and diffracted sound in the room. Blind people often possess this skill to some degree.

(Previous Chapter) <- Main Page-> (Next Chapter)