Cue integration during spoken word recognition

One way for listeners to cope with variability in the speech signal is to use multiple acoustic cues when identifying speech sounds. Multiple cues often contribute to a single phonetic distinction in speech, and listeners can combine different sources of acoustic information to help resolve ambiguity. For example, one of the primary acoustic cues to the voicing distinction in English, the difference between the sounds ‘b’ and ‘p’, is voice-onset time (VOT). Short VOTs (~0 ms) correspond to ‘b’ sounds, while long VOTs (~50 ms) correspond to ‘p’ sounds. VOT values in between are ambiguous and could be either ‘b’ or ‘p’. Vowel length (VL) is a secondary cue that distinguishes voiced from voiced sounds word-initially.

Thus, listeners may deal with ambiguous VOT values by relying on other cues in the signal, like VL, that may be more informative about the intended speech sound. How do listeners combine cues during perception? Do they wait until all cues are available and then make a voicing judgment? Or, do they make an initial judgment and update it as more information becomes available? The latter approach would provide faster estimates, but listeners may make more errors early in processing.

Using an eye-tracking approach known as the visual world paradigm, we have been examining these two hypotheses. In these experiments, subjects view a set of objects on a computer screen (Fig. 1). Their task is to click on the picture corresponding to a spoken word they hear. Eye movements are monitored using an eye-tracker allowing us to determine the proportion of time spent fixating each object in the display over the course of the trial. These proportions correspond well with lexical activation in models of spoken word recognition like TRACE. Thus, this measure gives us an estimate of lexical activation that has a high level of temporal precision and can be obtained online during word recognition.

Fig. 1. Sample visual world display used in the experiments

By comparing the likelihood of fixating particular objects as a function of each cue, we can obtain an estimate of the effect of that cue. For example, if VOT has an effect on voicing perception, we expect to see differences in the proportion of looks to the ‘b’ object (e.g. beach) and the ‘p’ object (peach) as a function of VOT. By examining the effect at each point in time over the course of the trial, we can see the effect of each cue as it unfolds of the course of word recognition (Fig. 2).

Fig. 2. Effect of VOT (blue line) and vowel length (orange line) over the time course of a trial

The figure shows the results from an experiment with stimuli varying in VOT and VL. Since these two cues occur at different points in time (i.e. VOT information is available before VL), we can assess the two hypotheses presented above. The effect of VOT occurs at approximately 400 ms and the effect of VL occurs at approximately 700 ms. This suggests that listeners use each cue as it becomes available rather than waiting until they hear every cue.

We can also examine how the effect of a cue unfolds over time for different cue-values. The movie to the left shows the change in the proportion of looks to /b/ and /p/ items as a function of VOT (+1: more looks to /p/ items; -1: more looks to /b/ items) over the course of a trial. Using this data, we can examine how the effect of VOT evolves from early time points in processing to later ones.

We are conducting similar experiments examining the time course of processsing in the presence of additional variability. For example, sentential rate has been shown to have an effect on voicing perception, and both VOT and VL values are affected by speaking rate (since they are temporal cues). This approach may help us to understand how listeners deal with variations in speaking rate that cause some acoustic cues to have different values.

More information

Tagged with: , , , , , , ,
Posted in Research